├── LICENSE
├── README.md
├── Section 1
├── Video 1.1 Improving your models using Feature engineering.ipynb
├── Video 1.2 Implementing feature engineering with logistic regression.ipynb
├── Video 1.3 Extracting data with feature selection and interaction.ipynb
├── video 1.4 Combining all together.ipynb
└── video 1.5 Build model based on real-world problem.ipynb
├── Section 2
├── video 2.1 Support Vector Machines.ipynb
├── video 2.2 Implementing kNN on the dataset.ipynb
├── video 2.3 Decision Tree as predictive model.ipynb
├── video 2.4 Tricks with dimensionality reduction method.ipynb
└── video 2.5 combining all together.ipynb
├── Section 3
├── 3.1 Random Forest for classification.ipynb
├── 3.2 Gradient boosting trees and bayes optimization.ipynb
├── 3.2 gradient boosting.ipynb
├── 3.4 Implement blending.ipynb
└── 3.5 stacking.ipynb
├── Section 4
├── 4_1_Memory based collaborative filtering.ipynb
├── 4_2_Item to item recommendation with kNN.ipynb
├── 4_3_Applying Matrix Factorization on dataset.ipynb
└── 4_4_Wordbach at use.ipynb
└── Section 5
├── 5_1_Validation dataset tuning.ipynb
├── 5_2_Regularizing model to avoid overfitting.ipynb
├── 5_3_Adversarial Validation.ipynb
└── 5_4_Perform metric selection on real data.ipynb
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 Packt
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Python-Machine-Learning-Tips-Tricks-and-Techniques
2 | Python Machine Learning: Tips, Tricks, and Techniques, published by Packt
3 | # Python Machine Learning Tips, Tricks, and Techniques [Video]
4 | This is the code repository for [Python Machine Learning Tips, Tricks, and Techniques [Video]](https://www.packtpub.com/big-data-and-business-intelligence/python-machine-learning-tips-tricks-and-techniques-video?utm_source=github&utm_medium=repository&utm_campaign=9781789135817), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the video course from start to finish.
5 | ## About the Video Course
6 | achine learning allows us to interpret data structures and fit that data into models to identify patterns and make predictions. Python makes this easier with its huge set of libraries that can be easily used for machine learning. In this course, you will learn from a top Kaggle master to upgrade your Python skills with the latest advancements in Python.
7 | It is essential to keep upgrading your machine learning skills as there are immense advancements taking place every day. In this course, you will get hands-on experience of solving real problems by implementing cutting-edge techniques to significantly boost your Python Machine Learning skills and, as a consequence, achieve optimized results in almost any project you are working on.
8 | Each technique we cover is itself enough to improve your results. However; combining them together is where the real magic is. Throughout the course, you will work on real datasets to increase your expertise and keep adding new tools to your machine learning toolbox.
9 | By the end of this course, you will know various tips, tricks, and techniques to upgrade your machine learning algorithms to reduce common problems, all the while building efficient machine learning models.
10 |
11 |
What You Will Learn
12 |
13 |
14 | - Tips and tricks to speed up your modeling process and obtain better results
15 |
- Make predictions using advanced regression analysis with Python
16 |
- Modern techniques for solving supervised learning problems
17 |
- Various ways to use ensemble learning with Python to derive optimum results
18 |
- Build your own recommendation engine and perform collaborative filtering
19 |
- Give your production machine learning system improved reliability
20 |
21 | ## Instructions and Navigation
22 | ### Assumed Knowledge
23 | To fully benefit from the coverage included in this course, you will need:
24 | This course is for aspiring data science professionals and Machine Learning practitioners who are familiar with basic Python programming and machine learning libraries.
25 | ### Technical Requirements
26 | This course has the following software requirements:
27 | Python
28 |
29 | ## Related Products
30 | * [Python Data Structures and Algorithms [Video]](https://www.packtpub.com/application-development/python-data-structures-and-algorithms-video?utm_source=github&utm_medium=repository&utm_campaign=9781788622066)
31 |
32 | * [Hands-on Webpack for React Development [Video]](https://www.packtpub.com/application-development/hands-webpack-react-development-video?utm_source=github&utm_medium=repository&utm_campaign=9781789139808)
33 |
34 | * [Object-oriented and Functional Programming with Java 8 [Integrated Course]](https://www.packtpub.com/application-development/object-oriented-and-functional-programming-java-8-integrated-course?utm_source=github&utm_medium=repository&utm_campaign=9781788294027)
35 |
36 |
--------------------------------------------------------------------------------
/Section 1/Video 1.1 Improving your models using Feature engineering.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**Types of scaling**:\n",
8 | "\n",
9 | "* MinMaxScaler - scales all features to $[a, b]$ range\n",
10 | "\n",
11 | "* StandardScaler - removes mean and divides by variance of all features. $X^{new}_i = \\frac{X_i - \\mu}{\\sigma}$, where $\\mu $is for mean and $\\sigma$ is for variance\n",
12 | "\n",
13 | "* RobustScaler - same as StandardScaler but removes median and divides by IQR\n"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 7,
19 | "metadata": {
20 | "collapsed": true
21 | },
22 | "outputs": [],
23 | "source": [
24 | "import pandas as pd\n",
25 | "import numpy as np"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 8,
31 | "metadata": {
32 | "collapsed": true
33 | },
34 | "outputs": [],
35 | "source": [
36 | "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\",sep = ';')"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 10,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "y = df.pop('quality')"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": 11,
51 | "metadata": {
52 | "collapsed": true
53 | },
54 | "outputs": [],
55 | "source": [
56 | "for i in df.columns:\n",
57 | " df[i] = df[i].fillna(np.mean(df[i]))"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 12,
63 | "metadata": {
64 | "collapsed": true
65 | },
66 | "outputs": [],
67 | "source": [
68 | "from sklearn.preprocessing import StandardScaler,MinMaxScaler,RobustScaler\n",
69 | "from sklearn.linear_model import Ridge\n",
70 | "from sklearn.model_selection import train_test_split\n",
71 | "from sklearn.metrics import mean_squared_error"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 13,
77 | "metadata": {
78 | "collapsed": true
79 | },
80 | "outputs": [],
81 | "source": [
82 | "np.random.seed(42)\n",
83 | "train,test,y_train,y_test = train_test_split(df,y,test_size = 0.1)"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": 14,
89 | "metadata": {
90 | "collapsed": true
91 | },
92 | "outputs": [],
93 | "source": [
94 | "def fit_predict(train,test,y_train,y_test,scaler = None):\n",
95 | " if scaler is None:\n",
96 | " lr = Ridge()\n",
97 | " lr.fit(train,y_train)\n",
98 | " y_pred = lr.predict(test)\n",
99 | " print('MSE score:', mean_squared_error(y_test,y_pred))\n",
100 | " else:\n",
101 | " train_scaled = scaler.fit_transform(train)\n",
102 | " test_scaled = scaler.transform(test)\n",
103 | " lr = Ridge()\n",
104 | " lr.fit(train_scaled,y_train)\n",
105 | " y_pred = lr.predict(test_scaled)\n",
106 | " print('MSE score:', mean_squared_error(y_test,y_pred))"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 15,
112 | "metadata": {},
113 | "outputs": [
114 | {
115 | "name": "stdout",
116 | "output_type": "stream",
117 | "text": [
118 | "MSE score: 0.57404414001\n"
119 | ]
120 | }
121 | ],
122 | "source": [
123 | "fit_predict(train,test,y_train,y_test)"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 16,
129 | "metadata": {},
130 | "outputs": [
131 | {
132 | "name": "stdout",
133 | "output_type": "stream",
134 | "text": [
135 | "MSE score: 0.567545067343\n"
136 | ]
137 | }
138 | ],
139 | "source": [
140 | "fit_predict(train,test,y_train,y_test,MinMaxScaler())"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 17,
146 | "metadata": {},
147 | "outputs": [
148 | {
149 | "name": "stdout",
150 | "output_type": "stream",
151 | "text": [
152 | "MSE score: 0.558144966334\n"
153 | ]
154 | }
155 | ],
156 | "source": [
157 | "fit_predict(train,test,y_train,y_test,StandardScaler())"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 18,
163 | "metadata": {},
164 | "outputs": [
165 | {
166 | "name": "stdout",
167 | "output_type": "stream",
168 | "text": [
169 | "MSE score: 0.55823299573\n"
170 | ]
171 | }
172 | ],
173 | "source": [
174 | "fit_predict(train,test,y_train,y_test,RobustScaler())"
175 | ]
176 | }
177 | ],
178 | "metadata": {
179 | "kernelspec": {
180 | "display_name": "Python 3",
181 | "language": "python",
182 | "name": "python3"
183 | },
184 | "language_info": {
185 | "codemirror_mode": {
186 | "name": "ipython",
187 | "version": 3
188 | },
189 | "file_extension": ".py",
190 | "mimetype": "text/x-python",
191 | "name": "python",
192 | "nbconvert_exporter": "python",
193 | "pygments_lexer": "ipython3",
194 | "version": "3.6.2"
195 | }
196 | },
197 | "nbformat": 4,
198 | "nbformat_minor": 2
199 | }
200 |
--------------------------------------------------------------------------------
/Section 1/Video 1.2 Implementing feature engineering with logistic regression.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "from sklearn.metrics import accuracy_score\n",
14 | "from sklearn.linear_model import LogisticRegression\n",
15 | "from sklearn.model_selection import train_test_split\n",
16 | "import warnings\n",
17 | "warnings.filterwarnings(\"ignore\")\n",
18 | "\n",
19 | "np.random.seed(42)"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
31 | "y = df.pop('quality')\n",
32 | "\n",
33 | "for i in df.columns:\n",
34 | " df[i] = df[i].fillna(np.mean(df[i]))\n",
35 | " \n",
36 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2) "
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 3,
42 | "metadata": {},
43 | "outputs": [
44 | {
45 | "name": "stdout",
46 | "output_type": "stream",
47 | "text": [
48 | "Accuracy score baseline: 0.514285714286\n"
49 | ]
50 | }
51 | ],
52 | "source": [
53 | "lr = LogisticRegression()\n",
54 | "lr.fit(train, y_train)\n",
55 | "y_pred = lr.predict(test)\n",
56 | "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 6,
62 | "metadata": {},
63 | "outputs": [
64 | {
65 | "data": {
66 | "text/html": [
67 | "\n",
68 | "\n",
81 | "
\n",
82 | " \n",
83 | " \n",
84 | " | \n",
85 | " fixed acidity | \n",
86 | " volatile acidity | \n",
87 | " citric acid | \n",
88 | " residual sugar | \n",
89 | " chlorides | \n",
90 | " free sulfur dioxide | \n",
91 | " total sulfur dioxide | \n",
92 | " density | \n",
93 | " pH | \n",
94 | " sulphates | \n",
95 | " alcohol | \n",
96 | "
\n",
97 | " \n",
98 | " \n",
99 | " \n",
100 | " 4665 | \n",
101 | " 7.3 | \n",
102 | " 0.17 | \n",
103 | " 0.36 | \n",
104 | " 8.20 | \n",
105 | " 0.028 | \n",
106 | " 44.0 | \n",
107 | " 111.0 | \n",
108 | " 0.99272 | \n",
109 | " 3.14 | \n",
110 | " 0.41 | \n",
111 | " 12.4 | \n",
112 | "
\n",
113 | " \n",
114 | " 1943 | \n",
115 | " 6.3 | \n",
116 | " 0.25 | \n",
117 | " 0.44 | \n",
118 | " 11.60 | \n",
119 | " 0.041 | \n",
120 | " 48.0 | \n",
121 | " 195.0 | \n",
122 | " 0.99680 | \n",
123 | " 3.18 | \n",
124 | " 0.52 | \n",
125 | " 9.5 | \n",
126 | "
\n",
127 | " \n",
128 | " 3399 | \n",
129 | " 5.6 | \n",
130 | " 0.32 | \n",
131 | " 0.33 | \n",
132 | " 7.40 | \n",
133 | " 0.037 | \n",
134 | " 25.0 | \n",
135 | " 95.0 | \n",
136 | " 0.99268 | \n",
137 | " 3.25 | \n",
138 | " 0.49 | \n",
139 | " 11.1 | \n",
140 | "
\n",
141 | " \n",
142 | " 843 | \n",
143 | " 6.9 | \n",
144 | " 0.19 | \n",
145 | " 0.35 | \n",
146 | " 1.70 | \n",
147 | " 0.036 | \n",
148 | " 33.0 | \n",
149 | " 101.0 | \n",
150 | " 0.99315 | \n",
151 | " 3.21 | \n",
152 | " 0.54 | \n",
153 | " 10.8 | \n",
154 | "
\n",
155 | " \n",
156 | " 2580 | \n",
157 | " 7.7 | \n",
158 | " 0.30 | \n",
159 | " 0.26 | \n",
160 | " 18.95 | \n",
161 | " 0.053 | \n",
162 | " 36.0 | \n",
163 | " 174.0 | \n",
164 | " 0.99976 | \n",
165 | " 3.20 | \n",
166 | " 0.50 | \n",
167 | " 10.4 | \n",
168 | "
\n",
169 | " \n",
170 | "
\n",
171 | "
"
172 | ],
173 | "text/plain": [
174 | " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n",
175 | "4665 7.3 0.17 0.36 8.20 0.028 \n",
176 | "1943 6.3 0.25 0.44 11.60 0.041 \n",
177 | "3399 5.6 0.32 0.33 7.40 0.037 \n",
178 | "843 6.9 0.19 0.35 1.70 0.036 \n",
179 | "2580 7.7 0.30 0.26 18.95 0.053 \n",
180 | "\n",
181 | " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n",
182 | "4665 44.0 111.0 0.99272 3.14 0.41 \n",
183 | "1943 48.0 195.0 0.99680 3.18 0.52 \n",
184 | "3399 25.0 95.0 0.99268 3.25 0.49 \n",
185 | "843 33.0 101.0 0.99315 3.21 0.54 \n",
186 | "2580 36.0 174.0 0.99976 3.20 0.50 \n",
187 | "\n",
188 | " alcohol \n",
189 | "4665 12.4 \n",
190 | "1943 9.5 \n",
191 | "3399 11.1 \n",
192 | "843 10.8 \n",
193 | "2580 10.4 "
194 | ]
195 | },
196 | "execution_count": 6,
197 | "metadata": {},
198 | "output_type": "execute_result"
199 | }
200 | ],
201 | "source": [
202 | "train.head()"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": 7,
208 | "metadata": {
209 | "collapsed": true
210 | },
211 | "outputs": [],
212 | "source": [
213 | "def feat_eng(df):\n",
214 | " df['eng1'] = df['fixed acidity'] * df['pH']\n",
215 | " df['eng2'] = df['total sulfur dioxide'] / df['free sulfur dioxide']\n",
216 | " df['eng3'] = df['sulphates'] / df['chlorides']\n",
217 | " df['eng4'] = df['chlorides'] / df['sulphates']\n",
218 | " return df\n",
219 | "\n",
220 | "train = feat_eng(train)\n",
221 | "test = feat_eng(test)"
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 8,
227 | "metadata": {},
228 | "outputs": [
229 | {
230 | "data": {
231 | "text/html": [
232 | "\n",
233 | "\n",
246 | "
\n",
247 | " \n",
248 | " \n",
249 | " | \n",
250 | " fixed acidity | \n",
251 | " volatile acidity | \n",
252 | " citric acid | \n",
253 | " residual sugar | \n",
254 | " chlorides | \n",
255 | " free sulfur dioxide | \n",
256 | " total sulfur dioxide | \n",
257 | " density | \n",
258 | " pH | \n",
259 | " sulphates | \n",
260 | " alcohol | \n",
261 | " eng1 | \n",
262 | " eng2 | \n",
263 | " eng3 | \n",
264 | " eng4 | \n",
265 | "
\n",
266 | " \n",
267 | " \n",
268 | " \n",
269 | " 4665 | \n",
270 | " 7.3 | \n",
271 | " 0.17 | \n",
272 | " 0.36 | \n",
273 | " 8.20 | \n",
274 | " 0.028 | \n",
275 | " 44.0 | \n",
276 | " 111.0 | \n",
277 | " 0.99272 | \n",
278 | " 3.14 | \n",
279 | " 0.41 | \n",
280 | " 12.4 | \n",
281 | " 22.922 | \n",
282 | " 2.522727 | \n",
283 | " 14.642857 | \n",
284 | " 0.068293 | \n",
285 | "
\n",
286 | " \n",
287 | " 1943 | \n",
288 | " 6.3 | \n",
289 | " 0.25 | \n",
290 | " 0.44 | \n",
291 | " 11.60 | \n",
292 | " 0.041 | \n",
293 | " 48.0 | \n",
294 | " 195.0 | \n",
295 | " 0.99680 | \n",
296 | " 3.18 | \n",
297 | " 0.52 | \n",
298 | " 9.5 | \n",
299 | " 20.034 | \n",
300 | " 4.062500 | \n",
301 | " 12.682927 | \n",
302 | " 0.078846 | \n",
303 | "
\n",
304 | " \n",
305 | " 3399 | \n",
306 | " 5.6 | \n",
307 | " 0.32 | \n",
308 | " 0.33 | \n",
309 | " 7.40 | \n",
310 | " 0.037 | \n",
311 | " 25.0 | \n",
312 | " 95.0 | \n",
313 | " 0.99268 | \n",
314 | " 3.25 | \n",
315 | " 0.49 | \n",
316 | " 11.1 | \n",
317 | " 18.200 | \n",
318 | " 3.800000 | \n",
319 | " 13.243243 | \n",
320 | " 0.075510 | \n",
321 | "
\n",
322 | " \n",
323 | " 843 | \n",
324 | " 6.9 | \n",
325 | " 0.19 | \n",
326 | " 0.35 | \n",
327 | " 1.70 | \n",
328 | " 0.036 | \n",
329 | " 33.0 | \n",
330 | " 101.0 | \n",
331 | " 0.99315 | \n",
332 | " 3.21 | \n",
333 | " 0.54 | \n",
334 | " 10.8 | \n",
335 | " 22.149 | \n",
336 | " 3.060606 | \n",
337 | " 15.000000 | \n",
338 | " 0.066667 | \n",
339 | "
\n",
340 | " \n",
341 | " 2580 | \n",
342 | " 7.7 | \n",
343 | " 0.30 | \n",
344 | " 0.26 | \n",
345 | " 18.95 | \n",
346 | " 0.053 | \n",
347 | " 36.0 | \n",
348 | " 174.0 | \n",
349 | " 0.99976 | \n",
350 | " 3.20 | \n",
351 | " 0.50 | \n",
352 | " 10.4 | \n",
353 | " 24.640 | \n",
354 | " 4.833333 | \n",
355 | " 9.433962 | \n",
356 | " 0.106000 | \n",
357 | "
\n",
358 | " \n",
359 | "
\n",
360 | "
"
361 | ],
362 | "text/plain": [
363 | " fixed acidity volatile acidity citric acid residual sugar chlorides \\\n",
364 | "4665 7.3 0.17 0.36 8.20 0.028 \n",
365 | "1943 6.3 0.25 0.44 11.60 0.041 \n",
366 | "3399 5.6 0.32 0.33 7.40 0.037 \n",
367 | "843 6.9 0.19 0.35 1.70 0.036 \n",
368 | "2580 7.7 0.30 0.26 18.95 0.053 \n",
369 | "\n",
370 | " free sulfur dioxide total sulfur dioxide density pH sulphates \\\n",
371 | "4665 44.0 111.0 0.99272 3.14 0.41 \n",
372 | "1943 48.0 195.0 0.99680 3.18 0.52 \n",
373 | "3399 25.0 95.0 0.99268 3.25 0.49 \n",
374 | "843 33.0 101.0 0.99315 3.21 0.54 \n",
375 | "2580 36.0 174.0 0.99976 3.20 0.50 \n",
376 | "\n",
377 | " alcohol eng1 eng2 eng3 eng4 \n",
378 | "4665 12.4 22.922 2.522727 14.642857 0.068293 \n",
379 | "1943 9.5 20.034 4.062500 12.682927 0.078846 \n",
380 | "3399 11.1 18.200 3.800000 13.243243 0.075510 \n",
381 | "843 10.8 22.149 3.060606 15.000000 0.066667 \n",
382 | "2580 10.4 24.640 4.833333 9.433962 0.106000 "
383 | ]
384 | },
385 | "execution_count": 8,
386 | "metadata": {},
387 | "output_type": "execute_result"
388 | }
389 | ],
390 | "source": [
391 | "train.head()"
392 | ]
393 | },
394 | {
395 | "cell_type": "code",
396 | "execution_count": 9,
397 | "metadata": {},
398 | "outputs": [
399 | {
400 | "name": "stdout",
401 | "output_type": "stream",
402 | "text": [
403 | "Accuracy score feat eng: 0.523469387755\n"
404 | ]
405 | }
406 | ],
407 | "source": [
408 | "lr = LogisticRegression()\n",
409 | "lr.fit(train, y_train)\n",
410 | "y_pred = lr.predict(test)\n",
411 | "print('Accuracy score feat eng:', accuracy_score(y_test, y_pred))"
412 | ]
413 | },
414 | {
415 | "cell_type": "code",
416 | "execution_count": null,
417 | "metadata": {
418 | "collapsed": true
419 | },
420 | "outputs": [],
421 | "source": []
422 | }
423 | ],
424 | "metadata": {
425 | "kernelspec": {
426 | "display_name": "Python 3",
427 | "language": "python",
428 | "name": "python3"
429 | },
430 | "language_info": {
431 | "codemirror_mode": {
432 | "name": "ipython",
433 | "version": 3
434 | },
435 | "file_extension": ".py",
436 | "mimetype": "text/x-python",
437 | "name": "python",
438 | "nbconvert_exporter": "python",
439 | "pygments_lexer": "ipython3",
440 | "version": "3.6.2"
441 | }
442 | },
443 | "nbformat": 4,
444 | "nbformat_minor": 2
445 | }
446 |
--------------------------------------------------------------------------------
/Section 1/Video 1.3 Extracting data with feature selection and interaction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "from sklearn.model_selection import train_test_split\n",
14 | "import warnings\n",
15 | "from sklearn.metrics import mean_absolute_error\n",
16 | "from sklearn.linear_model import Lasso,Ridge\n",
17 | "from sklearn.preprocessing import PolynomialFeatures\n",
18 | "warnings.filterwarnings(\"ignore\")\n",
19 | "np.random.seed(42)"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
31 | "y = df.pop('quality')\n",
32 | "\n",
33 | "for i in df.columns:\n",
34 | " df[i] = df[i].fillna(np.mean(df[i]))\n",
35 | " \n",
36 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2) "
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 3,
42 | "metadata": {},
43 | "outputs": [
44 | {
45 | "name": "stdout",
46 | "output_type": "stream",
47 | "text": [
48 | "MAE score: 0.59404938801\n"
49 | ]
50 | }
51 | ],
52 | "source": [
53 | "def fit_predict(train, test, y_train, y_test, scaler = None):\n",
54 | " if scaler is None:\n",
55 | " lr = Ridge()\n",
56 | " lr.fit(train, y_train)\n",
57 | " y_pred = lr.predict(test)\n",
58 | " print('MAE score:', mean_absolute_error(y_test, y_pred))\n",
59 | " else:\n",
60 | " train_scaled = scaler.fit_transform(train)\n",
61 | " test_scaled = scaler.transform(test)\n",
62 | " lr = Ridge()\n",
63 | " lr.fit(train_scaled, y_train)\n",
64 | " y_pred = lr.predict(test_scaled)\n",
65 | " print('MAE score:', mean_absolute_error(y_test, y_pred))\n",
66 | "\n",
67 | "fit_predict(train,test,y_train,y_test)"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": 4,
73 | "metadata": {},
74 | "outputs": [
75 | {
76 | "name": "stdout",
77 | "output_type": "stream",
78 | "text": [
79 | "non zero features: 6\n"
80 | ]
81 | }
82 | ],
83 | "source": [
84 | "def get_feat_imp(train,y_train,alpha=0.01):\n",
85 | " lr = Lasso(alpha=alpha)\n",
86 | " lr.fit(train,y_train)\n",
87 | " return lr.coef_\n",
88 | "fi = get_feat_imp(train,y_train)\n",
89 | "print('non zero features:',np.sum(fi != 0))"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 5,
95 | "metadata": {
96 | "collapsed": true
97 | },
98 | "outputs": [],
99 | "source": [
100 | "bestf = np.argwhere(fi)\n",
101 | "train_best = train.iloc[:, [x[0] for x in bestf.tolist()]]\n",
102 | "test_best = test.iloc[:, [x[0] for x in bestf.tolist()]]"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 6,
108 | "metadata": {
109 | "collapsed": true
110 | },
111 | "outputs": [],
112 | "source": [
113 | "def create_poly(train,test,degree):\n",
114 | " poly = PolynomialFeatures(degree=degree)\n",
115 | " train_poly = poly.fit_transform(train)\n",
116 | " test_poly = poly.fit_transform(test)\n",
117 | " return train_poly,test_poly"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": 7,
123 | "metadata": {},
124 | "outputs": [
125 | {
126 | "name": "stdout",
127 | "output_type": "stream",
128 | "text": [
129 | "No feature selection degree 1\n",
130 | "MAE score: 0.59404938801\n",
131 | "----------\n",
132 | "No feature selection degree 2\n",
133 | "MAE score: 0.577238983011\n",
134 | "----------\n",
135 | "No feature selection degree 3\n",
136 | "MAE score: 0.596958634563\n",
137 | "----------\n"
138 | ]
139 | }
140 | ],
141 | "source": [
142 | "for degree in [1,2,3]:\n",
143 | " train_poly,test_poly = create_poly(train,test,degree)\n",
144 | " print('No feature selection degree',degree)\n",
145 | " fit_predict(train_poly,test_poly,y_train,y_test)\n",
146 | " print(10*'-')"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 8,
152 | "metadata": {},
153 | "outputs": [
154 | {
155 | "name": "stdout",
156 | "output_type": "stream",
157 | "text": [
158 | "Feature selection degree 1\n",
159 | "MAE score: 0.597972321004\n",
160 | "----------\n",
161 | "Feature selection degree 2\n",
162 | "MAE score: 0.591541808012\n",
163 | "----------\n",
164 | "Feature selection degree 3\n",
165 | "MAE score: 0.597630769778\n",
166 | "----------\n"
167 | ]
168 | }
169 | ],
170 | "source": [
171 | "for degree in [1,2,3]:\n",
172 | " train_poly,test_poly = create_poly(train_best,test_best,degree)\n",
173 | " print('Feature selection degree',degree)\n",
174 | " fit_predict(train_poly,test_poly,y_train,y_test)\n",
175 | " print(10*'-')"
176 | ]
177 | }
178 | ],
179 | "metadata": {
180 | "kernelspec": {
181 | "display_name": "Python 3",
182 | "language": "python",
183 | "name": "python3"
184 | },
185 | "language_info": {
186 | "codemirror_mode": {
187 | "name": "ipython",
188 | "version": 3
189 | },
190 | "file_extension": ".py",
191 | "mimetype": "text/x-python",
192 | "name": "python",
193 | "nbconvert_exporter": "python",
194 | "pygments_lexer": "ipython3",
195 | "version": "3.6.2"
196 | }
197 | },
198 | "nbformat": 4,
199 | "nbformat_minor": 2
200 | }
201 |
--------------------------------------------------------------------------------
/Section 1/video 1.4 Combining all together.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "from sklearn.model_selection import train_test_split\n",
14 | "import warnings\n",
15 | "from sklearn.metrics import mean_absolute_error\n",
16 | "from sklearn.linear_model import Lasso,Ridge\n",
17 | "from sklearn.preprocessing import PolynomialFeatures\n",
18 | "from sklearn.preprocessing import StandardScaler\n",
19 | "warnings.filterwarnings(\"ignore\")\n",
20 | "np.random.seed(42)"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 2,
26 | "metadata": {
27 | "collapsed": true
28 | },
29 | "outputs": [],
30 | "source": [
31 | "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
32 | "y = df.pop('quality')\n",
33 | "\n",
34 | "for i in df.columns:\n",
35 | " df[i] = df[i].fillna(np.mean(df[i]))\n",
36 | " \n",
37 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2) "
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 3,
43 | "metadata": {},
44 | "outputs": [
45 | {
46 | "name": "stdout",
47 | "output_type": "stream",
48 | "text": [
49 | "MAE score: 0.59404938801\n"
50 | ]
51 | }
52 | ],
53 | "source": [
54 | "def fit_predict(train, test, y_train, y_test, scaler = None):\n",
55 | " if scaler is None:\n",
56 | " lr = Ridge()\n",
57 | " lr.fit(train, y_train)\n",
58 | " y_pred = lr.predict(test)\n",
59 | " print('MAE score:', mean_absolute_error(y_test, y_pred))\n",
60 | " else:\n",
61 | " train_scaled = scaler.fit_transform(train)\n",
62 | " test_scaled = scaler.transform(test)\n",
63 | " lr = Ridge()\n",
64 | " lr.fit(train_scaled, y_train)\n",
65 | " y_pred = lr.predict(test_scaled)\n",
66 | " print('MAE score:', mean_absolute_error(y_test, y_pred))\n",
67 | "\n",
68 | "fit_predict(train,test,y_train,y_test)"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 4,
74 | "metadata": {
75 | "collapsed": true
76 | },
77 | "outputs": [],
78 | "source": [
79 | "def feat_eng(df):\n",
80 | " df['eng1'] = df['fixed acidity'] * df['pH']\n",
81 | " df['eng2'] = df['total sulfur dioxide'] / df['free sulfur dioxide']\n",
82 | " df['eng3'] = df['sulphates'] / df['chlorides']\n",
83 | " df['eng4'] = df['chlorides'] / df['sulphates']\n",
84 | " return df\n",
85 | "\n",
86 | "train = feat_eng(train)\n",
87 | "test = feat_eng(test)"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 5,
93 | "metadata": {},
94 | "outputs": [
95 | {
96 | "name": "stdout",
97 | "output_type": "stream",
98 | "text": [
99 | "non zero features: 8\n"
100 | ]
101 | }
102 | ],
103 | "source": [
104 | "def get_feat_imp(train,y_train,alpha=0.01):\n",
105 | " lr = Lasso(alpha=alpha)\n",
106 | " lr.fit(train,y_train)\n",
107 | " return lr.coef_\n",
108 | "fi = get_feat_imp(train,y_train)\n",
109 | "print('non zero features:',np.sum(fi != 0))"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 6,
115 | "metadata": {
116 | "collapsed": true
117 | },
118 | "outputs": [],
119 | "source": [
120 | "bestf = np.argwhere(fi)\n",
121 | "train_best = train.iloc[:, [x[0] for x in bestf.tolist()]]\n",
122 | "test_best = test.iloc[:, [x[0] for x in bestf.tolist()]]"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 7,
128 | "metadata": {
129 | "collapsed": true
130 | },
131 | "outputs": [],
132 | "source": [
133 | "def create_poly(train,test,degree):\n",
134 | " poly = PolynomialFeatures(degree=degree)\n",
135 | " train_poly = poly.fit_transform(train)\n",
136 | " test_poly = poly.fit_transform(test)\n",
137 | " return train_poly,test_poly"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 8,
143 | "metadata": {},
144 | "outputs": [
145 | {
146 | "name": "stdout",
147 | "output_type": "stream",
148 | "text": [
149 | "No feature selection degree 1\n",
150 | "MAE score: 0.579732117236\n",
151 | "----------\n",
152 | "No feature selection degree 2\n",
153 | "MAE score: 0.566008983832\n",
154 | "----------\n",
155 | "No feature selection degree 3\n",
156 | "MAE score: 0.557362504716\n",
157 | "----------\n",
158 | "No feature selection degree 4\n",
159 | "MAE score: 0.569031127402\n",
160 | "----------\n"
161 | ]
162 | }
163 | ],
164 | "source": [
165 | "for degree in [1,2,3,4]:\n",
166 | " train_poly,test_poly = create_poly(train,test,degree)\n",
167 | " print('No feature selection degree',degree)\n",
168 | " fit_predict(train_poly,test_poly,y_train,y_test,StandardScaler())\n",
169 | " print(10*'-')"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": 9,
175 | "metadata": {},
176 | "outputs": [
177 | {
178 | "name": "stdout",
179 | "output_type": "stream",
180 | "text": [
181 | "Feature selection degree 1\n",
182 | "MAE score: 0.586503532959\n",
183 | "----------\n",
184 | "Feature selection degree 2\n",
185 | "MAE score: 0.575863518226\n",
186 | "----------\n",
187 | "Feature selection degree 3\n",
188 | "MAE score: 0.571050244983\n",
189 | "----------\n",
190 | "Feature selection degree 4\n",
191 | "MAE score: 0.626462712407\n",
192 | "----------\n"
193 | ]
194 | }
195 | ],
196 | "source": [
197 | "for degree in [1,2,3,4]:\n",
198 | " train_poly,test_poly = create_poly(train_best,test_best,degree)\n",
199 | " print('Feature selection degree',degree)\n",
200 | " fit_predict(train_poly,test_poly,y_train,y_test,StandardScaler())\n",
201 | " print(10*'-')"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": 10,
207 | "metadata": {},
208 | "outputs": [
209 | {
210 | "name": "stdout",
211 | "output_type": "stream",
212 | "text": [
213 | "overall improvement is 6.18 %\n"
214 | ]
215 | }
216 | ],
217 | "source": [
218 | "original_score = 0.59404938801\n",
219 | "best_score = 0.557362504716\n",
220 | "improvement = np.round(100*(original_score - best_score)/original_score,2)\n",
221 | "print('overall improvement is {} %'.format(improvement))"
222 | ]
223 | }
224 | ],
225 | "metadata": {
226 | "kernelspec": {
227 | "display_name": "Python 3",
228 | "language": "python",
229 | "name": "python3"
230 | },
231 | "language_info": {
232 | "codemirror_mode": {
233 | "name": "ipython",
234 | "version": 3
235 | },
236 | "file_extension": ".py",
237 | "mimetype": "text/x-python",
238 | "name": "python",
239 | "nbconvert_exporter": "python",
240 | "pygments_lexer": "ipython3",
241 | "version": "3.6.2"
242 | }
243 | },
244 | "nbformat": 4,
245 | "nbformat_minor": 2
246 | }
247 |
--------------------------------------------------------------------------------
/Section 1/video 1.5 Build model based on real-world problem.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 17,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "from sklearn.model_selection import train_test_split\n",
14 | "import warnings\n",
15 | "from sklearn.metrics import mean_absolute_error\n",
16 | "from sklearn.linear_model import Lasso,Ridge\n",
17 | "from sklearn.preprocessing import PolynomialFeatures\n",
18 | "from sklearn.preprocessing import StandardScaler\n",
19 | "warnings.filterwarnings(\"ignore\")\n",
20 | "np.random.seed(42)"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "Data Set Information:\n",
28 | "\n",
29 | "Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues. \n",
30 | "\n",
31 | "Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.\n",
32 | "\n",
33 | "cnt: count of total rental bikes including both casual and registered"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 18,
39 | "metadata": {},
40 | "outputs": [
41 | {
42 | "data": {
43 | "text/html": [
44 | "\n",
45 | "\n",
58 | "
\n",
59 | " \n",
60 | " \n",
61 | " | \n",
62 | " season | \n",
63 | " yr | \n",
64 | " mnth | \n",
65 | " hr | \n",
66 | " holiday | \n",
67 | " weekday | \n",
68 | " workingday | \n",
69 | " weathersit | \n",
70 | " temp | \n",
71 | " atemp | \n",
72 | " hum | \n",
73 | " windspeed | \n",
74 | "
\n",
75 | " \n",
76 | " \n",
77 | " \n",
78 | " 0 | \n",
79 | " 1 | \n",
80 | " 0 | \n",
81 | " 1 | \n",
82 | " 0 | \n",
83 | " 0 | \n",
84 | " 6 | \n",
85 | " 0 | \n",
86 | " 1 | \n",
87 | " 0.24 | \n",
88 | " 0.2879 | \n",
89 | " 0.81 | \n",
90 | " 0.0 | \n",
91 | "
\n",
92 | " \n",
93 | " 1 | \n",
94 | " 1 | \n",
95 | " 0 | \n",
96 | " 1 | \n",
97 | " 1 | \n",
98 | " 0 | \n",
99 | " 6 | \n",
100 | " 0 | \n",
101 | " 1 | \n",
102 | " 0.22 | \n",
103 | " 0.2727 | \n",
104 | " 0.80 | \n",
105 | " 0.0 | \n",
106 | "
\n",
107 | " \n",
108 | " 2 | \n",
109 | " 1 | \n",
110 | " 0 | \n",
111 | " 1 | \n",
112 | " 2 | \n",
113 | " 0 | \n",
114 | " 6 | \n",
115 | " 0 | \n",
116 | " 1 | \n",
117 | " 0.22 | \n",
118 | " 0.2727 | \n",
119 | " 0.80 | \n",
120 | " 0.0 | \n",
121 | "
\n",
122 | " \n",
123 | " 3 | \n",
124 | " 1 | \n",
125 | " 0 | \n",
126 | " 1 | \n",
127 | " 3 | \n",
128 | " 0 | \n",
129 | " 6 | \n",
130 | " 0 | \n",
131 | " 1 | \n",
132 | " 0.24 | \n",
133 | " 0.2879 | \n",
134 | " 0.75 | \n",
135 | " 0.0 | \n",
136 | "
\n",
137 | " \n",
138 | " 4 | \n",
139 | " 1 | \n",
140 | " 0 | \n",
141 | " 1 | \n",
142 | " 4 | \n",
143 | " 0 | \n",
144 | " 6 | \n",
145 | " 0 | \n",
146 | " 1 | \n",
147 | " 0.24 | \n",
148 | " 0.2879 | \n",
149 | " 0.75 | \n",
150 | " 0.0 | \n",
151 | "
\n",
152 | " \n",
153 | "
\n",
154 | "
"
155 | ],
156 | "text/plain": [
157 | " season yr mnth hr holiday weekday workingday weathersit temp \\\n",
158 | "0 1 0 1 0 0 6 0 1 0.24 \n",
159 | "1 1 0 1 1 0 6 0 1 0.22 \n",
160 | "2 1 0 1 2 0 6 0 1 0.22 \n",
161 | "3 1 0 1 3 0 6 0 1 0.24 \n",
162 | "4 1 0 1 4 0 6 0 1 0.24 \n",
163 | "\n",
164 | " atemp hum windspeed \n",
165 | "0 0.2879 0.81 0.0 \n",
166 | "1 0.2727 0.80 0.0 \n",
167 | "2 0.2727 0.80 0.0 \n",
168 | "3 0.2879 0.75 0.0 \n",
169 | "4 0.2879 0.75 0.0 "
170 | ]
171 | },
172 | "execution_count": 18,
173 | "metadata": {},
174 | "output_type": "execute_result"
175 | }
176 | ],
177 | "source": [
178 | "df = pd.read_csv('hour.csv')\n",
179 | "y = df.pop('cnt')\n",
180 | "df.drop(['instant', 'casual', 'dteday', 'registered'], axis = 1, inplace = True)\n",
181 | "df.head()"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 3,
187 | "metadata": {
188 | "collapsed": true
189 | },
190 | "outputs": [],
191 | "source": [
192 | "for i in df.columns:\n",
193 | " df[i] = df[i].fillna(np.mean(df[i]))\n",
194 | " \n",
195 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2) "
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 4,
201 | "metadata": {
202 | "collapsed": true
203 | },
204 | "outputs": [],
205 | "source": [
206 | "def fit_predict(train, test, y_train, y_test, scaler = None):\n",
207 | " if scaler is None:\n",
208 | " lr = Ridge()\n",
209 | " lr.fit(train, y_train)\n",
210 | " y_pred = lr.predict(test)\n",
211 | " print('MAE score:', mean_absolute_error(y_test, y_pred))\n",
212 | " else:\n",
213 | " train_scaled = scaler.fit_transform(train)\n",
214 | " test_scaled = scaler.transform(test)\n",
215 | " lr = Ridge()\n",
216 | " lr.fit(train_scaled, y_train)\n",
217 | " y_pred = lr.predict(test_scaled)\n",
218 | " print('MAE score:', mean_absolute_error(y_test, y_pred))"
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": 5,
224 | "metadata": {},
225 | "outputs": [
226 | {
227 | "name": "stdout",
228 | "output_type": "stream",
229 | "text": [
230 | "Baseline MAE score: 104.802725573\n"
231 | ]
232 | }
233 | ],
234 | "source": [
235 | "print('Baseline', end = ' ')\n",
236 | "fit_predict(train, test, y_train, y_test)"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 6,
242 | "metadata": {
243 | "collapsed": true
244 | },
245 | "outputs": [],
246 | "source": [
247 | "def feat_eng(df):\n",
248 | " df['eng1'] = df['hum'] / df['temp']\n",
249 | " df['eng2'] = df['windspeed'] * df['hum']\n",
250 | " df['eng3'] = df['temp'] * df['hum']\n",
251 | " df['eng4'] = df['temp'] * df['atemp']\n",
252 | " return df\n",
253 | "\n",
254 | "train = feat_eng(train)\n",
255 | "test = feat_eng(test)"
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": 7,
261 | "metadata": {},
262 | "outputs": [
263 | {
264 | "name": "stdout",
265 | "output_type": "stream",
266 | "text": [
267 | "number of features is 16\n",
268 | "number of non zero features: 16\n"
269 | ]
270 | }
271 | ],
272 | "source": [
273 | "def get_feat_imp(train,y_train,alpha=0.01):\n",
274 | " lr = Lasso(alpha=alpha)\n",
275 | " lr.fit(train,y_train)\n",
276 | " return lr.coef_\n",
277 | "fi = get_feat_imp(train,y_train)\n",
278 | "print('number of features is {}'.format(train.shape[1]))\n",
279 | "print('number of non zero features:',np.sum(fi != 0))"
280 | ]
281 | },
282 | {
283 | "cell_type": "code",
284 | "execution_count": 8,
285 | "metadata": {
286 | "collapsed": true
287 | },
288 | "outputs": [],
289 | "source": [
290 | "def create_poly(train,test,degree):\n",
291 | " poly = PolynomialFeatures(degree=degree)\n",
292 | " train_poly = poly.fit_transform(train)\n",
293 | " test_poly = poly.fit_transform(test)\n",
294 | " return train_poly,test_poly"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": 11,
300 | "metadata": {},
301 | "outputs": [
302 | {
303 | "name": "stdout",
304 | "output_type": "stream",
305 | "text": [
306 | "No feature selection degree 1\n",
307 | "MAE score: 103.454477177\n",
308 | "----------\n",
309 | "No feature selection degree 2\n",
310 | "MAE score: 89.5565130706\n",
311 | "----------\n",
312 | "No feature selection degree 3\n",
313 | "MAE score: 77.537496916\n",
314 | "----------\n",
315 | "No feature selection degree 4\n",
316 | "MAE score: 71.9204017223\n",
317 | "----------\n",
318 | "No feature selection degree 5\n",
319 | "MAE score: 345.019388746\n",
320 | "----------\n"
321 | ]
322 | }
323 | ],
324 | "source": [
325 | "for degree in [1,2,3,4,5]:\n",
326 | " train_poly,test_poly = create_poly(train,test,degree)\n",
327 | " print('No feature selection degree',degree)\n",
328 | " fit_predict(train_poly,test_poly,y_train,y_test)\n",
329 | " print(10*'-')"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": 10,
335 | "metadata": {},
336 | "outputs": [
337 | {
338 | "name": "stdout",
339 | "output_type": "stream",
340 | "text": [
341 | "overall improvement is 31.38 %\n"
342 | ]
343 | }
344 | ],
345 | "source": [
346 | "original_score = 104.802725573\n",
347 | "best_score = 71.9204017223\n",
348 | "improvement = np.round(100*(original_score - best_score)/original_score,2)\n",
349 | "print('overall improvement is {} %'.format(improvement))"
350 | ]
351 | }
352 | ],
353 | "metadata": {
354 | "kernelspec": {
355 | "display_name": "Python 3",
356 | "language": "python",
357 | "name": "python3"
358 | },
359 | "language_info": {
360 | "codemirror_mode": {
361 | "name": "ipython",
362 | "version": 3
363 | },
364 | "file_extension": ".py",
365 | "mimetype": "text/x-python",
366 | "name": "python",
367 | "nbconvert_exporter": "python",
368 | "pygments_lexer": "ipython3",
369 | "version": "3.6.2"
370 | }
371 | },
372 | "nbformat": 4,
373 | "nbformat_minor": 2
374 | }
375 |
--------------------------------------------------------------------------------
/Section 2/video 2.1 Support Vector Machines.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "import matplotlib.pyplot as plt\n",
14 | "%matplotlib inline\n",
15 | "from sklearn.preprocessing import StandardScaler\n",
16 | "from sklearn.svm import SVC\n",
17 | "from sklearn.linear_model import LogisticRegression\n",
18 | "from sklearn.model_selection import train_test_split\n",
19 | "from sklearn.metrics import accuracy_score\n",
20 | "import warnings\n",
21 | "warnings.filterwarnings(\"ignore\")\n",
22 | "np.random.seed(42)"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 2,
28 | "metadata": {
29 | "collapsed": true
30 | },
31 | "outputs": [],
32 | "source": [
33 | "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
34 | "y = df.pop('quality')"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 3,
40 | "metadata": {
41 | "collapsed": true
42 | },
43 | "outputs": [],
44 | "source": [
45 | "for i in df.columns:\n",
46 | " df[i] = df[i].fillna(np.mean(df[i]))\n",
47 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 5,
53 | "metadata": {},
54 | "outputs": [
55 | {
56 | "name": "stdout",
57 | "output_type": "stream",
58 | "text": [
59 | "Accuracy score baseline: 0.514285714286\n"
60 | ]
61 | }
62 | ],
63 | "source": [
64 | "lr = LogisticRegression()\n",
65 | "lr.fit(train, y_train)\n",
66 | "y_pred = lr.predict(test)\n",
67 | "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
68 | ]
69 | },
70 | {
71 | "cell_type": "code",
72 | "execution_count": 6,
73 | "metadata": {
74 | "collapsed": true
75 | },
76 | "outputs": [],
77 | "source": [
78 | "def fit_predict(train, test, y_train, y_test, scaler, kernel = 'linear', C = 1.0, degree = 3):\n",
79 | " train_scaled = scaler.fit_transform(train)\n",
80 | " test_scaled = scaler.transform(test) \n",
81 | " lr = SVC(kernel = kernel, degree = degree, C = C)\n",
82 | " lr.fit(train_scaled, y_train)\n",
83 | " y_pred = lr.predict(test_scaled)\n",
84 | " print(accuracy_score(y_test, y_pred))"
85 | ]
86 | },
87 | {
88 | "cell_type": "raw",
89 | "metadata": {},
90 | "source": [
91 | "def fit_predict(train, test, y_train, y_test, scaler):\n",
92 | " train_scaled = scaler.fit_transform(train)\n",
93 | " test_scaled = scaler.transform(test) \n",
94 | " lr = \n",
95 | " lr.fit(train_scaled, y_train)\n",
96 | " y_pred = lr.predict(test_scaled)\n",
97 | " print(accuracy_score(y_test, y_pred))"
98 | ]
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {},
103 | "source": [
104 | "### Kernel tuning"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": 7,
110 | "metadata": {},
111 | "outputs": [
112 | {
113 | "name": "stdout",
114 | "output_type": "stream",
115 | "text": [
116 | "Accuracy score using linear kernel: 0.509183673469\n",
117 | "Accuracy score using poly kernel: 0.525510204082\n",
118 | "Accuracy score using rbf kernel: 0.561224489796\n",
119 | "Accuracy score using sigmoid kernel: 0.404081632653\n"
120 | ]
121 | }
122 | ],
123 | "source": [
124 | "for kernel in ['linear', 'poly', 'rbf', 'sigmoid']:\n",
125 | " print('Accuracy score using {0} kernel:'.format(kernel), end = ' ')\n",
126 | " fit_predict(train, test, y_train, y_test, StandardScaler(), kernel)"
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "### Penalty tuning"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": 9,
139 | "metadata": {},
140 | "outputs": [
141 | {
142 | "name": "stdout",
143 | "output_type": "stream",
144 | "text": [
145 | "Accuracy score using penalty = 0.5 with rbf kernel: 0.540816326531\n",
146 | "Accuracy score using penalty = 0.8705505632961241 with rbf kernel: 0.560204081633\n",
147 | "Accuracy score using penalty = 1.5157165665103982 with rbf kernel: 0.558163265306\n",
148 | "Accuracy score using penalty = 2.6390158215457893 with rbf kernel: 0.564285714286\n",
149 | "Accuracy score using penalty = 4.59479341998814 with rbf kernel: 0.577551020408\n",
150 | "Accuracy score using penalty = 8.0 with rbf kernel: 0.591836734694\n"
151 | ]
152 | }
153 | ],
154 | "source": [
155 | "for с in np.logspace(-1, 3, base = 2, num = 6):\n",
156 | " print('Accuracy score using penalty = {0} with rbf kernel:'.format(с), end = ' ')\n",
157 | " fit_predict(train, test, y_train, y_test, StandardScaler(), 'rbf', с)"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "### Choosing degree for poly kernel"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 10,
170 | "metadata": {},
171 | "outputs": [
172 | {
173 | "name": "stdout",
174 | "output_type": "stream",
175 | "text": [
176 | "Accuracy score using degree = 2 with poly kernel: 0.486734693878\n",
177 | "Accuracy score using degree = 3 with poly kernel: 0.518367346939\n",
178 | "Accuracy score using degree = 4 with poly kernel: 0.521428571429\n",
179 | "Accuracy score using degree = 5 with poly kernel: 0.530612244898\n"
180 | ]
181 | }
182 | ],
183 | "source": [
184 | "for degree in range(2, 6):\n",
185 | " print('Accuracy score using degree = {0} with poly kernel:'.format(degree), end = ' ')\n",
186 | " fit_predict(train, test, y_train, y_test, StandardScaler(), 'poly', 1.5, degree = degree)"
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 11,
192 | "metadata": {},
193 | "outputs": [
194 | {
195 | "name": "stdout",
196 | "output_type": "stream",
197 | "text": [
198 | "overall improvement is 15.08 %\n"
199 | ]
200 | }
201 | ],
202 | "source": [
203 | "original_score = 0.514285714286\n",
204 | "best_score = 0.591836734694\n",
205 | "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
206 | "print('overall improvement is {} %'.format(improvement))"
207 | ]
208 | }
209 | ],
210 | "metadata": {
211 | "kernelspec": {
212 | "display_name": "Python 3",
213 | "language": "python",
214 | "name": "python3"
215 | },
216 | "language_info": {
217 | "codemirror_mode": {
218 | "name": "ipython",
219 | "version": 3
220 | },
221 | "file_extension": ".py",
222 | "mimetype": "text/x-python",
223 | "name": "python",
224 | "nbconvert_exporter": "python",
225 | "pygments_lexer": "ipython3",
226 | "version": "3.6.1"
227 | }
228 | },
229 | "nbformat": 4,
230 | "nbformat_minor": 2
231 | }
232 |
--------------------------------------------------------------------------------
/Section 2/video 2.2 Implementing kNN on the dataset.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "import matplotlib.pyplot as plt\n",
14 | "%matplotlib inline\n",
15 | "from sklearn.linear_model import LogisticRegression\n",
16 | "from sklearn.preprocessing import StandardScaler\n",
17 | "from sklearn.neighbors import KNeighborsClassifier\n",
18 | "from sklearn.model_selection import train_test_split\n",
19 | "from sklearn.metrics import accuracy_score\n",
20 | "from sklearn.preprocessing import PolynomialFeatures\n",
21 | "import warnings\n",
22 | "warnings.filterwarnings(\"ignore\")\n",
23 | "np.random.seed(42)"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 2,
29 | "metadata": {
30 | "collapsed": true
31 | },
32 | "outputs": [],
33 | "source": [
34 | "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
35 | "y = df.pop('quality')"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 3,
41 | "metadata": {
42 | "collapsed": true
43 | },
44 | "outputs": [],
45 | "source": [
46 | "for i in df.columns:\n",
47 | " df[i] = df[i].fillna(np.mean(df[i]))\n",
48 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": 4,
54 | "metadata": {},
55 | "outputs": [
56 | {
57 | "name": "stdout",
58 | "output_type": "stream",
59 | "text": [
60 | "Accuracy score baseline: 0.514285714286\n"
61 | ]
62 | }
63 | ],
64 | "source": [
65 | "lr = LogisticRegression()\n",
66 | "lr.fit(train, y_train)\n",
67 | "y_pred = lr.predict(test)\n",
68 | "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 5,
74 | "metadata": {
75 | "collapsed": true
76 | },
77 | "outputs": [],
78 | "source": [
79 | "def fit_predict(train, test, y_train, y_test, scaler, \n",
80 | " n_neighbours, metric = 'manhattan', weights = 'uniform'):\n",
81 | " train_scaled = scaler.fit_transform(train)\n",
82 | " test_scaled = scaler.transform(test) \n",
83 | " knn = KNeighborsClassifier(n_neighbors=n_neighbours, metric=metric, \n",
84 | " weights=weights, n_jobs = 4)\n",
85 | " knn.fit(train_scaled, y_train)\n",
86 | " y_pred = knn.predict(test_scaled)\n",
87 | " print(accuracy_score(y_test, y_pred))"
88 | ]
89 | },
90 | {
91 | "cell_type": "markdown",
92 | "metadata": {},
93 | "source": [
94 | "### Neighbours tuning"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 6,
100 | "metadata": {},
101 | "outputs": [
102 | {
103 | "name": "stdout",
104 | "output_type": "stream",
105 | "text": [
106 | "Accuracy score on kNN using n_neighbours = 2: 0.572448979592\n",
107 | "Accuracy score on kNN using n_neighbours = 4: 0.555102040816\n",
108 | "Accuracy score on kNN using n_neighbours = 8: 0.54387755102\n",
109 | "Accuracy score on kNN using n_neighbours = 16: 0.541836734694\n",
110 | "Accuracy score on kNN using n_neighbours = 32: 0.552040816327\n",
111 | "Accuracy score on kNN using n_neighbours = 64: 0.538775510204\n",
112 | "Accuracy score on kNN using n_neighbours = 128: 0.529591836735\n",
113 | "Accuracy score on kNN using n_neighbours = 256: 0.516326530612\n",
114 | "Accuracy score on kNN using n_neighbours = 512: 0.504081632653\n",
115 | "Accuracy score on kNN using n_neighbours = 1024: 0.472448979592\n"
116 | ]
117 | }
118 | ],
119 | "source": [
120 | "for k in range(1,11):\n",
121 | " print('Accuracy score on kNN using n_neighbours = {0}:'.format(2**k), end = ' ')\n",
122 | " fit_predict(train, test, y_train, y_test, StandardScaler(), 2**k)"
123 | ]
124 | },
125 | {
126 | "cell_type": "raw",
127 | "metadata": {},
128 | "source": [
129 | "for k in np.logspace(2, 11, base = 2, num = 11, dtype=int).tolist():\n",
130 | " print('Accuracy score on kNN using n_neighbours = {0}:'.format(k), end = ' ')\n",
131 | " fit_predict(train, test, y_train, y_test, StandardScaler(), k)"
132 | ]
133 | },
134 | {
135 | "cell_type": "markdown",
136 | "metadata": {},
137 | "source": [
138 | "### Metric tuning"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": 7,
144 | "metadata": {},
145 | "outputs": [
146 | {
147 | "name": "stdout",
148 | "output_type": "stream",
149 | "text": [
150 | "Accuracy score on kNN using euclidean metric and 10 neighbours: 0.573469387755\n",
151 | "Accuracy score on kNN using cosine metric and 10 neighbours: 0.551020408163\n",
152 | "Accuracy score on kNN using manhattan metric and 10 neighbours: 0.572448979592\n",
153 | "Accuracy score on kNN using chebyshev metric and 10 neighbours: 0.574489795918\n"
154 | ]
155 | }
156 | ],
157 | "source": [
158 | "for metric in ['euclidean', 'cosine', 'manhattan', 'chebyshev']:\n",
159 | " print('Accuracy score on kNN using {} metric and {} neighbours:'.format(metric,k), end = ' ')\n",
160 | " fit_predict(train, test, y_train, y_test, StandardScaler(), 2, metric)"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "### Weighted kNN"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 8,
173 | "metadata": {},
174 | "outputs": [
175 | {
176 | "name": "stdout",
177 | "output_type": "stream",
178 | "text": [
179 | "Accuracy score on kNN using weights = uniform: 0.574489795918\n",
180 | "Accuracy score on kNN using weights = distance: 0.648979591837\n"
181 | ]
182 | }
183 | ],
184 | "source": [
185 | "for weights in ['uniform', 'distance']:\n",
186 | " print('Accuracy score on kNN using weights = {0}:'.format(weights), end = ' ')\n",
187 | " fit_predict(train, test, y_train, y_test, StandardScaler(), 2, 'chebyshev', weights = weights)"
188 | ]
189 | },
190 | {
191 | "cell_type": "markdown",
192 | "metadata": {},
193 | "source": [
194 | "### Engineering"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 9,
200 | "metadata": {
201 | "collapsed": true
202 | },
203 | "outputs": [],
204 | "source": [
205 | "def create_poly(train,test,degree):\n",
206 | " poly = PolynomialFeatures(degree=degree)\n",
207 | " train_poly = poly.fit_transform(train)\n",
208 | " test_poly = poly.fit_transform(test)\n",
209 | " return train_poly,test_poly"
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 10,
215 | "metadata": {},
216 | "outputs": [
217 | {
218 | "name": "stdout",
219 | "output_type": "stream",
220 | "text": [
221 | "Polynomial degree 1\n",
222 | "0.648979591837\n",
223 | "----------\n",
224 | "Polynomial degree 2\n",
225 | "0.640816326531\n",
226 | "----------\n",
227 | "Polynomial degree 3\n",
228 | "0.642857142857\n",
229 | "----------\n"
230 | ]
231 | }
232 | ],
233 | "source": [
234 | "for degree in [1,2,3]:\n",
235 | " train_poly, test_poly = create_poly(train, test, degree)\n",
236 | " print('Polynomial degree',degree)\n",
237 | " fit_predict(train_poly, test_poly, y_train, y_test, StandardScaler(), 2, 'chebyshev', weights = 'distance')\n",
238 | " print(10*'-')\n",
239 | " \n",
240 | "train_poly, test_poly = create_poly(train, test, 2) "
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": 11,
246 | "metadata": {
247 | "collapsed": true
248 | },
249 | "outputs": [],
250 | "source": [
251 | "def feat_eng(df):\n",
252 | " df['eng1'] = df['fixed acidity'] * df['pH']\n",
253 | " df['eng2'] = df['total sulfur dioxide'] / df['free sulfur dioxide']\n",
254 | " df['eng3'] = df['sulphates'] / df['chlorides']\n",
255 | " df['eng4'] = df['chlorides'] / df['sulphates']\n",
256 | " return df\n",
257 | "\n",
258 | "train = feat_eng(train)\n",
259 | "test = feat_eng(test)"
260 | ]
261 | },
262 | {
263 | "cell_type": "code",
264 | "execution_count": 12,
265 | "metadata": {},
266 | "outputs": [
267 | {
268 | "name": "stdout",
269 | "output_type": "stream",
270 | "text": [
271 | "Accuracy score after engineering: 0.670408163265\n"
272 | ]
273 | }
274 | ],
275 | "source": [
276 | "print('Accuracy score after engineering:', end = ' ')\n",
277 | "fit_predict(train, test, y_train, y_test, StandardScaler(), 2, 'chebyshev', weights = 'distance')"
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": 15,
283 | "metadata": {},
284 | "outputs": [
285 | {
286 | "name": "stdout",
287 | "output_type": "stream",
288 | "text": [
289 | "overall improvement is 30.36 %\n"
290 | ]
291 | }
292 | ],
293 | "source": [
294 | "original_score = 0.514285714286\n",
295 | "best_score = 0.670408163265\n",
296 | "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
297 | "print('overall improvement is {} %'.format(improvement))"
298 | ]
299 | }
300 | ],
301 | "metadata": {
302 | "kernelspec": {
303 | "display_name": "Python 3",
304 | "language": "python",
305 | "name": "python3"
306 | },
307 | "language_info": {
308 | "codemirror_mode": {
309 | "name": "ipython",
310 | "version": 3
311 | },
312 | "file_extension": ".py",
313 | "mimetype": "text/x-python",
314 | "name": "python",
315 | "nbconvert_exporter": "python",
316 | "pygments_lexer": "ipython3",
317 | "version": "3.6.2"
318 | }
319 | },
320 | "nbformat": 4,
321 | "nbformat_minor": 2
322 | }
323 |
--------------------------------------------------------------------------------
/Section 2/video 2.3 Decision Tree as predictive model.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "import matplotlib.pyplot as plt\n",
14 | "%matplotlib inline\n",
15 | "from sklearn.preprocessing import StandardScaler\n",
16 | "from sklearn.linear_model import LogisticRegression\n",
17 | "from sklearn.tree import DecisionTreeClassifier\n",
18 | "from sklearn.model_selection import train_test_split\n",
19 | "from sklearn.metrics import accuracy_score\n",
20 | "from sklearn.preprocessing import PolynomialFeatures\n",
21 | "import warnings\n",
22 | "warnings.filterwarnings(\"ignore\")\n",
23 | "np.random.seed(42)"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 2,
29 | "metadata": {
30 | "collapsed": true
31 | },
32 | "outputs": [],
33 | "source": [
34 | "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
35 | "y = df.pop('quality')"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 3,
41 | "metadata": {
42 | "collapsed": true
43 | },
44 | "outputs": [],
45 | "source": [
46 | "for i in df.columns:\n",
47 | " df[i] = df[i].fillna(np.mean(df[i]))\n",
48 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": 4,
54 | "metadata": {},
55 | "outputs": [
56 | {
57 | "name": "stdout",
58 | "output_type": "stream",
59 | "text": [
60 | "Accuracy score baseline: 0.514285714286\n"
61 | ]
62 | }
63 | ],
64 | "source": [
65 | "lr = LogisticRegression()\n",
66 | "lr.fit(train, y_train)\n",
67 | "y_pred = lr.predict(test)\n",
68 | "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 5,
74 | "metadata": {
75 | "collapsed": true
76 | },
77 | "outputs": [],
78 | "source": [
79 | "def fit_predict(train, test, y_train, y_test, scaler, max_depth, \n",
80 | " criterion = 'entropy', max_features = 1, min_samples_split = 4):\n",
81 | " train_scaled = scaler.fit_transform(train)\n",
82 | " test_scaled = scaler.transform(test) \n",
83 | " dt = DecisionTreeClassifier(criterion = criterion, max_depth=max_depth, \n",
84 | " random_state=42, max_features=max_features,\n",
85 | " min_samples_split=min_samples_split)\n",
86 | " dt.fit(train_scaled, y_train)\n",
87 | " y_pred = dt.predict(test_scaled)\n",
88 | " print(accuracy_score(y_test, y_pred))"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "### Max depth tuning"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 6,
101 | "metadata": {},
102 | "outputs": [
103 | {
104 | "name": "stdout",
105 | "output_type": "stream",
106 | "text": [
107 | "Accuracy score using max_depth = 1: 0.440816326531\n",
108 | "Accuracy score using max_depth = 2: 0.440816326531\n",
109 | "Accuracy score using max_depth = 3: 0.45306122449\n",
110 | "Accuracy score using max_depth = 4: 0.460204081633\n",
111 | "Accuracy score using max_depth = 5: 0.486734693878\n",
112 | "Accuracy score using max_depth = 6: 0.460204081633\n",
113 | "Accuracy score using max_depth = 7: 0.497959183673\n",
114 | "Accuracy score using max_depth = 8: 0.50306122449\n",
115 | "Accuracy score using max_depth = 9: 0.518367346939\n",
116 | "Accuracy score using max_depth = 10: 0.49693877551\n",
117 | "Accuracy score using max_depth = 11: 0.514285714286\n",
118 | "Accuracy score using max_depth = 12: 0.492857142857\n",
119 | "Accuracy score using max_depth = 13: 0.575510204082\n",
120 | "Accuracy score using max_depth = 14: 0.571428571429\n",
121 | "Accuracy score using max_depth = 15: 0.533673469388\n",
122 | "Accuracy score using max_depth = 16: 0.548979591837\n",
123 | "Accuracy score using max_depth = 17: 0.548979591837\n",
124 | "Accuracy score using max_depth = 18: 0.577551020408\n",
125 | "Accuracy score using max_depth = 19: 0.561224489796\n"
126 | ]
127 | }
128 | ],
129 | "source": [
130 | "for i in range(1, 20):\n",
131 | " print('Accuracy score using max_depth =', i, end = ': ')\n",
132 | " fit_predict(train, test, y_train, y_test, StandardScaler(), i)"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "### Max features tuning"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 7,
145 | "metadata": {},
146 | "outputs": [
147 | {
148 | "name": "stdout",
149 | "output_type": "stream",
150 | "text": [
151 | "Accuracy score using max features = 0.1: 0.577551020408\n",
152 | "Accuracy score using max features = 0.2: 0.608163265306\n",
153 | "Accuracy score using max features = 0.3: 0.608163265306\n",
154 | "Accuracy score using max features = 0.4: 0.595918367347\n",
155 | "Accuracy score using max features = 0.5: 0.60612244898\n",
156 | "Accuracy score using max features = 0.6: 0.580612244898\n",
157 | "Accuracy score using max features = 0.7: 0.60306122449\n",
158 | "Accuracy score using max features = 0.8: 0.572448979592\n",
159 | "Accuracy score using max features = 0.9: 0.576530612245\n"
160 | ]
161 | }
162 | ],
163 | "source": [
164 | "for i in np.arange(0.1, 1.0, 0.1):\n",
165 | " print('Accuracy score using max features =', i, end = ': ')\n",
166 | " fit_predict(train, test, y_train, y_test, StandardScaler(), max_depth = 18, max_features=i)"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "### Min samples split tuning"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 8,
179 | "metadata": {},
180 | "outputs": [
181 | {
182 | "name": "stdout",
183 | "output_type": "stream",
184 | "text": [
185 | "Accuracy score using min samples split = 2: 0.613265306122\n",
186 | "Accuracy score using min samples split = 3: 0.592857142857\n",
187 | "Accuracy score using min samples split = 4: 0.608163265306\n",
188 | "Accuracy score using min samples split = 5: 0.561224489796\n",
189 | "Accuracy score using min samples split = 6: 0.580612244898\n",
190 | "Accuracy score using min samples split = 7: 0.563265306122\n",
191 | "Accuracy score using min samples split = 8: 0.595918367347\n",
192 | "Accuracy score using min samples split = 9: 0.55612244898\n"
193 | ]
194 | }
195 | ],
196 | "source": [
197 | "for i in range(2, 10):\n",
198 | " print('Accuracy score using min samples split =', i, end = ': ')\n",
199 | " fit_predict(train, test, y_train, y_test, StandardScaler(), 18, max_features=0.3, min_samples_split=i)"
200 | ]
201 | },
202 | {
203 | "cell_type": "markdown",
204 | "metadata": {},
205 | "source": [
206 | "### Criterion tuning"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": 9,
212 | "metadata": {},
213 | "outputs": [
214 | {
215 | "name": "stdout",
216 | "output_type": "stream",
217 | "text": [
218 | "Accuracy score using criterion = gini: 0.613265306122\n",
219 | "Accuracy score using criterion = entropy: 0.613265306122\n"
220 | ]
221 | }
222 | ],
223 | "source": [
224 | "for i in ['gini', 'entropy']:\n",
225 | " print('Accuracy score using criterion =', i, end = ': ')\n",
226 | " fit_predict(train, test, y_train, y_test, StandardScaler(), 18, \n",
227 | " max_features=0.3, min_samples_split=2, criterion = 'entropy')"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 10,
233 | "metadata": {
234 | "collapsed": true
235 | },
236 | "outputs": [],
237 | "source": [
238 | "def create_poly(train,test,degree):\n",
239 | " poly = PolynomialFeatures(degree=degree)\n",
240 | " train_poly = poly.fit_transform(train)\n",
241 | " test_poly = poly.fit_transform(test)\n",
242 | " return train_poly,test_poly"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": 11,
248 | "metadata": {},
249 | "outputs": [
250 | {
251 | "name": "stdout",
252 | "output_type": "stream",
253 | "text": [
254 | "Polynomial degree 1\n",
255 | "0.604081632653\n",
256 | "----------\n",
257 | "Polynomial degree 2\n",
258 | "0.625510204082\n",
259 | "----------\n",
260 | "Polynomial degree 3\n",
261 | "0.625510204082\n",
262 | "----------\n",
263 | "Polynomial degree 4\n",
264 | "0.609183673469\n",
265 | "----------\n"
266 | ]
267 | }
268 | ],
269 | "source": [
270 | "for degree in [1,2,3,4]:\n",
271 | " train_poly, test_poly = create_poly(train, test, degree)\n",
272 | " print('Polynomial degree',degree)\n",
273 | " fit_predict(train_poly, test_poly, y_train, y_test, StandardScaler(), 18, \n",
274 | " max_features=0.3, min_samples_split=2, criterion = 'entropy')\n",
275 | " print(10*'-')\n",
276 | " \n",
277 | "train_poly, test_poly = create_poly(train, test, 2) "
278 | ]
279 | },
280 | {
281 | "cell_type": "code",
282 | "execution_count": 12,
283 | "metadata": {},
284 | "outputs": [
285 | {
286 | "name": "stdout",
287 | "output_type": "stream",
288 | "text": [
289 | "Additional feature engineering:\n",
290 | "0.592857142857\n",
291 | "0.617346938776\n"
292 | ]
293 | }
294 | ],
295 | "source": [
296 | "def feat_eng(df):\n",
297 | " df['eng1'] = df['fixed acidity'] * df['pH']\n",
298 | " df['eng2'] = df['total sulfur dioxide'] / df['free sulfur dioxide']\n",
299 | " df['eng3'] = df['sulphates'] / df['chlorides']\n",
300 | " df['eng4'] = df['chlorides'] / df['sulphates']\n",
301 | " return df\n",
302 | "\n",
303 | "train = feat_eng(train)\n",
304 | "test = feat_eng(test)\n",
305 | "\n",
306 | "print('Additional feature engineering:')\n",
307 | "\n",
308 | "fit_predict(train, test, y_train, y_test, StandardScaler(), 18, \n",
309 | " max_features=0.3, min_samples_split=2, criterion = 'entropy')\n",
310 | "\n",
311 | "train_poly, test_poly = create_poly(train, test, 2)\n",
312 | "\n",
313 | "fit_predict(train_poly, test_poly, y_train, y_test, StandardScaler(), 18, \n",
314 | " max_features=0.3, min_samples_split=2, criterion = 'entropy')\n"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 13,
320 | "metadata": {},
321 | "outputs": [
322 | {
323 | "name": "stdout",
324 | "output_type": "stream",
325 | "text": [
326 | "overall improvement is 21.63 %\n"
327 | ]
328 | }
329 | ],
330 | "source": [
331 | "original_score = 0.514285714286\n",
332 | "best_score = 0.625510204082\n",
333 | "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
334 | "print('overall improvement is {} %'.format(improvement))"
335 | ]
336 | }
337 | ],
338 | "metadata": {
339 | "kernelspec": {
340 | "display_name": "Python 3",
341 | "language": "python",
342 | "name": "python3"
343 | },
344 | "language_info": {
345 | "codemirror_mode": {
346 | "name": "ipython",
347 | "version": 3
348 | },
349 | "file_extension": ".py",
350 | "mimetype": "text/x-python",
351 | "name": "python",
352 | "nbconvert_exporter": "python",
353 | "pygments_lexer": "ipython3",
354 | "version": "3.6.1"
355 | }
356 | },
357 | "nbformat": 4,
358 | "nbformat_minor": 2
359 | }
360 |
--------------------------------------------------------------------------------
/Section 2/video 2.5 combining all together.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 9,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import pandas as pd\n",
10 | "import numpy as np\n",
11 | "import matplotlib.pyplot as plt\n",
12 | "%matplotlib inline\n",
13 | "from sklearn.preprocessing import StandardScaler\n",
14 | "from sklearn.neighbors import KNeighborsClassifier\n",
15 | "from sklearn.tree import DecisionTreeClassifier\n",
16 | "from sklearn.svm import SVC\n",
17 | "from sklearn.model_selection import train_test_split\n",
18 | "from sklearn.metrics import roc_auc_score\n",
19 | "from sklearn.linear_model import LogisticRegression\n",
20 | "from keras.models import Sequential, Input, Model\n",
21 | "from keras.layers import Dense, Activation, Dropout, Flatten, BatchNormalization\n",
22 | "from keras.callbacks import EarlyStopping, ReduceLROnPlateau\n",
23 | "from sklearn.metrics import accuracy_score\n",
24 | "import warnings\n",
25 | "warnings.filterwarnings(\"ignore\")\n",
26 | "np.random.seed(42)\n"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": 5,
32 | "metadata": {},
33 | "outputs": [],
34 | "source": [
35 | "df = pd.read_csv(\"train.csv\", sep = ',')\n",
36 | "df = df.sample(frac = 0.2, random_state = 123)\n",
37 | "y = df.pop('target')\n",
38 | "df.drop('id', axis = 1, inplace=True)\n",
39 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 18,
45 | "metadata": {},
46 | "outputs": [
47 | {
48 | "name": "stdout",
49 | "output_type": "stream",
50 | "text": [
51 | "AUC baseline: 0.6205363530135511\n"
52 | ]
53 | }
54 | ],
55 | "source": [
56 | "lr = LogisticRegression()\n",
57 | "lr.fit(train, y_train)\n",
58 | "y_pred = lr.predict_proba(test)\n",
59 | "print('AUC baseline:', roc_auc_score(y_test, y_pred[:,1]))"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": 19,
65 | "metadata": {
66 | "collapsed": true
67 | },
68 | "outputs": [],
69 | "source": [
70 | "scaler = StandardScaler()\n",
71 | "df_values = scaler.fit_transform(df)"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 20,
77 | "metadata": {
78 | "collapsed": true
79 | },
80 | "outputs": [],
81 | "source": [
82 | "def fit_knn(train, test, y_train, y_test, \n",
83 | " n_neighbours = 64, metric = 'euclidean', weights = 'distance'): \n",
84 | " knn = KNeighborsClassifier(n_neighbors=n_neighbours, metric=metric, \n",
85 | " weights=weights, n_jobs = 4)\n",
86 | " knn.fit(train, y_train)\n",
87 | " y_pred = knn.predict_proba(test)\n",
88 | " print(roc_auc_score(y_test, y_pred[:, 1]))"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": 21,
94 | "metadata": {
95 | "collapsed": true
96 | },
97 | "outputs": [],
98 | "source": [
99 | "def fit_svm(train, test, y_train, y_test, kernel = 'linear', C = 1.5, degree = 3): \n",
100 | " svm = SVC(kernel = kernel, degree = degree, C = C, max_iter=100, probability=True)\n",
101 | " svm.fit(train, y_train)\n",
102 | " y_pred = svm.predict_proba(test)\n",
103 | " print(roc_auc_score(y_test, y_pred[:, 1]))"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 22,
109 | "metadata": {
110 | "collapsed": true
111 | },
112 | "outputs": [],
113 | "source": [
114 | "def fit_tree(train, test, y_train, y_test, max_depth = 9, \n",
115 | " criterion = 'entropy', max_features = 0.8, min_samples_split = 6):\n",
116 | " tree = DecisionTreeClassifier(criterion = criterion, max_depth=max_depth, \n",
117 | " random_state=111, max_features=max_features,\n",
118 | " min_samples_split=min_samples_split)\n",
119 | " tree.fit(train, y_train)\n",
120 | " y_pred = tree.predict_proba(test)\n",
121 | " print(roc_auc_score(y_test, y_pred[:, 1]))"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": 23,
127 | "metadata": {
128 | "collapsed": true
129 | },
130 | "outputs": [],
131 | "source": [
132 | "def create_autoencoder_model(object_size=df.shape[1], encoder_layer_shapes=[128, 64, 32], decoder_layer_shapes=[64, 128]):\n",
133 | " # входные параметры:\n",
134 | " # object_size: int, размер входного и выходного слоя автоэнкодера\n",
135 | " # encoder_layer_shapes: list of int, количество нейронов в каждом слое энкодера. \n",
136 | " # последний элемент списка - размер \"бутылочного горлышка\"\n",
137 | " # decoder_layer_shapes: ist of int, количество нейронов в каждом слое декодера\n",
138 | " \n",
139 | " # выход:\n",
140 | " # keras модель\n",
141 | " input_ = Input(shape=(object_size,))\n",
142 | " encoded = Dense(encoder_layer_shapes[0], activation='elu')(input_)\n",
143 | " encoded = BatchNormalization()(encoded)\n",
144 | " encoded = Dense(encoder_layer_shapes[1], activation='elu')(encoded)\n",
145 | " encoded = BatchNormalization()(encoded)\n",
146 | " encoded = Dense(encoder_layer_shapes[2], activation='elu')(encoded)\n",
147 | " encoded = BatchNormalization()(encoded)\n",
148 | " decoded = Dense(decoder_layer_shapes[0], activation='elu')(encoded)\n",
149 | " decoded = BatchNormalization()(decoded)\n",
150 | " decoded = Dense(decoder_layer_shapes[1], activation='elu')(decoded)\n",
151 | " decoded = BatchNormalization()(decoded)\n",
152 | " decoded = Dense(object_size, activation='sigmoid')(decoded)\n",
153 | " \n",
154 | " model = Model(input_, decoded)\n",
155 | " model.compile(optimizer = 'Adam', loss='mean_squared_error')\n",
156 | " return model"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 24,
162 | "metadata": {
163 | "collapsed": true
164 | },
165 | "outputs": [],
166 | "source": [
167 | "train, test, y_train, y_test = train_test_split(df_values, y, test_size = 0.2)"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 25,
173 | "metadata": {},
174 | "outputs": [
175 | {
176 | "name": "stdout",
177 | "output_type": "stream",
178 | "text": [
179 | "Train on 95233 samples, validate on 23809 samples\n",
180 | "Epoch 1/100\n",
181 | "95233/95233 [==============================] - 4s 39us/step - loss: 0.8554 - val_loss: 0.7237\n",
182 | "Epoch 2/100\n",
183 | "95233/95233 [==============================] - 3s 30us/step - loss: 0.6612 - val_loss: 0.6782\n",
184 | "Epoch 3/100\n",
185 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6390 - val_loss: 0.6564\n",
186 | "Epoch 4/100\n",
187 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6311 - val_loss: 0.6472\n",
188 | "Epoch 5/100\n",
189 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6258 - val_loss: 0.6417\n",
190 | "Epoch 6/100\n",
191 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6228 - val_loss: 0.6384\n",
192 | "Epoch 7/100\n",
193 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6210 - val_loss: 0.6368\n",
194 | "Epoch 8/100\n",
195 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6194 - val_loss: 0.6344\n",
196 | "Epoch 9/100\n",
197 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6179 - val_loss: 0.6328\n",
198 | "Epoch 10/100\n",
199 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6166 - val_loss: 0.6315\n",
200 | "Epoch 11/100\n",
201 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6155 - val_loss: 0.6306\n",
202 | "Epoch 12/100\n",
203 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6147 - val_loss: 0.6298\n",
204 | "Epoch 13/100\n",
205 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6141 - val_loss: 0.6289\n",
206 | "Epoch 14/100\n",
207 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6136 - val_loss: 0.6283\n",
208 | "Epoch 15/100\n",
209 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6130 - val_loss: 0.6281\n",
210 | "Epoch 16/100\n",
211 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6124 - val_loss: 0.6273\n",
212 | "Epoch 17/100\n",
213 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6119 - val_loss: 0.6268\n",
214 | "Epoch 18/100\n",
215 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6114 - val_loss: 0.6260\n",
216 | "Epoch 19/100\n",
217 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6111 - val_loss: 0.6261\n",
218 | "Epoch 20/100\n",
219 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6108 - val_loss: 0.6257\n",
220 | "Epoch 21/100\n",
221 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6105 - val_loss: 0.6253\n",
222 | "Epoch 22/100\n",
223 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6103 - val_loss: 0.6251\n",
224 | "Epoch 23/100\n",
225 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6102 - val_loss: 0.6249\n",
226 | "Epoch 24/100\n",
227 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6101 - val_loss: 0.6247\n",
228 | "Epoch 25/100\n",
229 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6098 - val_loss: 0.6246\n",
230 | "Epoch 26/100\n",
231 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6098 - val_loss: 0.6247\n",
232 | "Epoch 27/100\n",
233 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6096 - val_loss: 0.6242\n",
234 | "Epoch 28/100\n",
235 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6095 - val_loss: 0.6240\n",
236 | "Epoch 29/100\n",
237 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6094 - val_loss: 0.6240\n",
238 | "Epoch 30/100\n",
239 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6092 - val_loss: 0.6240\n",
240 | "Epoch 31/100\n",
241 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6091 - val_loss: 0.6235\n",
242 | "Epoch 32/100\n",
243 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6090 - val_loss: 0.6235\n",
244 | "Epoch 33/100\n",
245 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6089 - val_loss: 0.6234\n",
246 | "Epoch 34/100\n",
247 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6089 - val_loss: 0.6236\n",
248 | "Epoch 35/100\n",
249 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6088 - val_loss: 0.6233\n",
250 | "Epoch 36/100\n",
251 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6086 - val_loss: 0.6230\n",
252 | "Epoch 37/100\n",
253 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6087 - val_loss: 0.6233\n",
254 | "Epoch 38/100\n",
255 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6086 - val_loss: 0.6231\n",
256 | "Epoch 39/100\n",
257 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6085 - val_loss: 0.6232\n",
258 | "Epoch 40/100\n",
259 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6084 - val_loss: 0.6228\n",
260 | "Epoch 41/100\n",
261 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6084 - val_loss: 0.6232\n",
262 | "Epoch 42/100\n",
263 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6083 - val_loss: 0.6230\n",
264 | "Epoch 43/100\n",
265 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6084 - val_loss: 0.6229\n",
266 | "Epoch 44/100\n",
267 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6082 - val_loss: 0.6230\n",
268 | "Epoch 45/100\n",
269 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6082 - val_loss: 0.6230\n",
270 | "Epoch 46/100\n",
271 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6081 - val_loss: 0.6227\n",
272 | "\n",
273 | "Epoch 00046: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.\n",
274 | "Epoch 47/100\n",
275 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6075 - val_loss: 0.6216\n",
276 | "Epoch 48/100\n",
277 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6074 - val_loss: 0.6216\n",
278 | "Epoch 49/100\n",
279 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6074 - val_loss: 0.6215\n",
280 | "Epoch 50/100\n",
281 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6074 - val_loss: 0.6215\n",
282 | "Epoch 51/100\n",
283 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6074 - val_loss: 0.6215\n",
284 | "Epoch 52/100\n",
285 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6074 - val_loss: 0.6216\n",
286 | "Epoch 53/100\n",
287 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6215\n",
288 | "Epoch 54/100\n",
289 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6215\n",
290 | "Epoch 55/100\n",
291 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6074 - val_loss: 0.6214\n",
292 | "Epoch 56/100\n",
293 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6215\n",
294 | "Epoch 57/100\n",
295 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6215\n",
296 | "Epoch 58/100\n",
297 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6214\n",
298 | "Epoch 59/100\n",
299 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6215\n",
300 | "Epoch 60/100\n",
301 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6214\n",
302 | "Epoch 61/100\n",
303 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6214\n",
304 | "\n",
305 | "Epoch 00061: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.\n",
306 | "Epoch 62/100\n",
307 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
308 | "Epoch 63/100\n",
309 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6214\n",
310 | "Epoch 64/100\n",
311 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
312 | "Epoch 65/100\n",
313 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6215\n",
314 | "Epoch 66/100\n",
315 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
316 | "Epoch 67/100\n",
317 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
318 | "Epoch 68/100\n",
319 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
320 | "\n",
321 | "Epoch 00068: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.\n",
322 | "Epoch 69/100\n",
323 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
324 | "Epoch 70/100\n",
325 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
326 | "Epoch 71/100\n",
327 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
328 | "Epoch 72/100\n",
329 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
330 | "Epoch 73/100\n",
331 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6213\n",
332 | "Epoch 74/100\n",
333 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n"
334 | ]
335 | },
336 | {
337 | "name": "stdout",
338 | "output_type": "stream",
339 | "text": [
340 | "Epoch 75/100\n",
341 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
342 | "\n",
343 | "Epoch 00075: ReduceLROnPlateau reducing learning rate to 1.0000001111620805e-07.\n",
344 | "Epoch 76/100\n",
345 | "95233/95233 [==============================] - 3s 28us/step - loss: 0.6072 - val_loss: 0.6215\n",
346 | "Epoch 77/100\n",
347 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
348 | "Epoch 78/100\n",
349 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
350 | "Epoch 79/100\n",
351 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
352 | "Epoch 80/100\n",
353 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
354 | "Epoch 81/100\n",
355 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
356 | "Epoch 82/100\n",
357 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
358 | "\n",
359 | "Epoch 00082: ReduceLROnPlateau reducing learning rate to 1.000000082740371e-08.\n",
360 | "Epoch 83/100\n",
361 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
362 | "Epoch 84/100\n",
363 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6215\n",
364 | "Epoch 85/100\n",
365 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
366 | "Epoch 86/100\n",
367 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
368 | "Epoch 87/100\n",
369 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
370 | "Epoch 88/100\n",
371 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
372 | "Epoch 89/100\n",
373 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6071 - val_loss: 0.6214\n",
374 | "\n",
375 | "Epoch 00089: ReduceLROnPlateau reducing learning rate to 1.000000082740371e-09.\n",
376 | "Epoch 90/100\n",
377 | "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6215\n",
378 | "Epoch 00090: early stopping\n"
379 | ]
380 | },
381 | {
382 | "data": {
383 | "text/plain": [
384 | ""
385 | ]
386 | },
387 | "execution_count": 25,
388 | "metadata": {},
389 | "output_type": "execute_result"
390 | }
391 | ],
392 | "source": [
393 | "autoencoder = create_autoencoder_model()\n",
394 | "\n",
395 | "early_stop = EarlyStopping(monitor='val_loss',\n",
396 | " patience=35,\n",
397 | " verbose=1,\n",
398 | " min_delta=1e-4)\n",
399 | "\n",
400 | "reduce_lr = ReduceLROnPlateau(monitor='val_loss',\n",
401 | " factor=0.1,\n",
402 | " patience=5,\n",
403 | " cooldown=2,\n",
404 | " verbose=1)\n",
405 | "\n",
406 | "autoencoder.fit(train, train,\n",
407 | " epochs=100,\n",
408 | " batch_size=512,\n",
409 | " validation_data=(test, test), callbacks = [early_stop, reduce_lr])"
410 | ]
411 | },
412 | {
413 | "cell_type": "code",
414 | "execution_count": 26,
415 | "metadata": {},
416 | "outputs": [
417 | {
418 | "name": "stdout",
419 | "output_type": "stream",
420 | "text": [
421 | "95233/95233 [==============================] - 2s 26us/step\n",
422 | "23809/23809 [==============================] - 1s 25us/step\n"
423 | ]
424 | }
425 | ],
426 | "source": [
427 | "model_bn = Model(autoencoder.input, autoencoder.layers[3].output)\n",
428 | "decompose_train = model_bn.predict(train, verbose = 1)\n",
429 | "decompose_test = model_bn.predict(test, verbose = 1)\n"
430 | ]
431 | },
432 | {
433 | "cell_type": "code",
434 | "execution_count": 27,
435 | "metadata": {},
436 | "outputs": [
437 | {
438 | "name": "stdout",
439 | "output_type": "stream",
440 | "text": [
441 | "ROC-AUC score on kNN:\n",
442 | "0.547487388360222\n",
443 | "ROC-AUC score on SVM:\n",
444 | "0.5264453710504442\n",
445 | "ROC-AUC score on Decision tree:\n",
446 | "0.5499497836853388\n"
447 | ]
448 | }
449 | ],
450 | "source": [
451 | "print('ROC-AUC score on kNN:')\n",
452 | "fit_knn(decompose_train, decompose_test, y_train, y_test)\n",
453 | "print('ROC-AUC score on SVM:')\n",
454 | "fit_svm(decompose_train, decompose_test, y_train, y_test)\n",
455 | "print('ROC-AUC score on Decision tree:')\n",
456 | "fit_tree(decompose_train, decompose_test, y_train, y_test)"
457 | ]
458 | },
459 | {
460 | "cell_type": "code",
461 | "execution_count": 28,
462 | "metadata": {},
463 | "outputs": [
464 | {
465 | "name": "stdout",
466 | "output_type": "stream",
467 | "text": [
468 | "ROC-AUC score on kNN:\n",
469 | "0.5609669115700139\n",
470 | "ROC-AUC score on SVM:\n",
471 | "0.4533384754593952\n",
472 | "ROC-AUC score on Decision tree:\n",
473 | "0.5525507861696763\n"
474 | ]
475 | }
476 | ],
477 | "source": [
478 | "print('ROC-AUC score on kNN:')\n",
479 | "fit_knn(train, test, y_train, y_test)\n",
480 | "print('ROC-AUC score on SVM:')\n",
481 | "fit_svm(train, test, y_train, y_test)\n",
482 | "print('ROC-AUC score on Decision tree:')\n",
483 | "fit_tree(train, test, y_train, y_test)"
484 | ]
485 | }
486 | ],
487 | "metadata": {
488 | "kernelspec": {
489 | "display_name": "Python 3",
490 | "language": "python",
491 | "name": "python3"
492 | },
493 | "language_info": {
494 | "codemirror_mode": {
495 | "name": "ipython",
496 | "version": 3
497 | },
498 | "file_extension": ".py",
499 | "mimetype": "text/x-python",
500 | "name": "python",
501 | "nbconvert_exporter": "python",
502 | "pygments_lexer": "ipython3",
503 | "version": "3.6.1"
504 | }
505 | },
506 | "nbformat": 4,
507 | "nbformat_minor": 2
508 | }
509 |
--------------------------------------------------------------------------------
/Section 3/3.1 Random Forest for classification.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "import matplotlib.pyplot as plt\n",
14 | "%matplotlib inline\n",
15 | "from sklearn.preprocessing import StandardScaler\n",
16 | "from sklearn.linear_model import LogisticRegression\n",
17 | "from sklearn.ensemble import RandomForestClassifier\n",
18 | "from sklearn.model_selection import train_test_split\n",
19 | "from sklearn.metrics import accuracy_score\n",
20 | "from sklearn.preprocessing import PolynomialFeatures\n",
21 | "import warnings\n",
22 | "warnings.filterwarnings(\"ignore\")\n",
23 | "np.random.seed(42)"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 2,
29 | "metadata": {
30 | "collapsed": true
31 | },
32 | "outputs": [],
33 | "source": [
34 | "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
35 | "y = df.pop('quality')\n",
36 | "for i in df.columns:\n",
37 | " df[i] = df[i].fillna(np.mean(df[i]))\n",
38 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 3,
44 | "metadata": {},
45 | "outputs": [
46 | {
47 | "name": "stdout",
48 | "output_type": "stream",
49 | "text": [
50 | "Accuracy score baseline: 0.5142857142857142\n"
51 | ]
52 | }
53 | ],
54 | "source": [
55 | "lr = LogisticRegression()\n",
56 | "lr.fit(train, y_train)\n",
57 | "y_pred = lr.predict(test)\n",
58 | "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 4,
64 | "metadata": {
65 | "collapsed": true
66 | },
67 | "outputs": [],
68 | "source": [
69 | "def fit_predict(train, test, y_train, y_test, max_depth = None , \n",
70 | " n_estimators = 10, max_features = 'auto', min_samples_split = 2,scaler = None):\n",
71 | " if scaler:\n",
72 | " train = scaler.fit_transform(train)\n",
73 | " test = scaler.transform(test) \n",
74 | " RF = RandomForestClassifier(n_estimators = n_estimators, max_depth=max_depth, \n",
75 | " random_state = 42, max_features = max_features,\n",
76 | " min_samples_split = min_samples_split)\n",
77 | " RF.fit(train, y_train)\n",
78 | " y_pred = RF.predict(test)\n",
79 | " print(accuracy_score(y_test, y_pred))"
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": 5,
85 | "metadata": {},
86 | "outputs": [
87 | {
88 | "name": "stdout",
89 | "output_type": "stream",
90 | "text": [
91 | "baseline accuracy score: 0.6428571428571429\n",
92 | "baseline accuracy score with scaler: 0.6418367346938776\n"
93 | ]
94 | }
95 | ],
96 | "source": [
97 | "print('baseline accuracy score', end = ': ')\n",
98 | "fit_predict(train,test,y_train,y_test)\n",
99 | "print('baseline accuracy score with scaler', end = ': ')\n",
100 | "fit_predict(train,test,y_train,y_test,scaler=StandardScaler())"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 6,
106 | "metadata": {},
107 | "outputs": [
108 | {
109 | "name": "stdout",
110 | "output_type": "stream",
111 | "text": [
112 | "Accuracy score using n_estimators = 20: 0.6591836734693878\n",
113 | "Accuracy score using n_estimators = 40: 0.6744897959183673\n",
114 | "Accuracy score using n_estimators = 60: 0.6816326530612244\n",
115 | "Accuracy score using n_estimators = 80: 0.6877551020408164\n",
116 | "Accuracy score using n_estimators = 100: 0.6908163265306122\n",
117 | "Accuracy score using n_estimators = 120: 0.6979591836734694\n",
118 | "Accuracy score using n_estimators = 140: 0.6908163265306122\n",
119 | "Accuracy score using n_estimators = 160: 0.6959183673469388\n",
120 | "Accuracy score using n_estimators = 180: 0.6948979591836735\n"
121 | ]
122 | }
123 | ],
124 | "source": [
125 | "for n_estimators in range(20,200,20):\n",
126 | " print('Accuracy score using n_estimators =', n_estimators, end = ': ')\n",
127 | " fit_predict(train,test,y_train,y_test,n_estimators = n_estimators)\n"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": 7,
133 | "metadata": {},
134 | "outputs": [
135 | {
136 | "name": "stdout",
137 | "output_type": "stream",
138 | "text": [
139 | "Accuracy score using max_depth = 1: 0.44081632653061226\n",
140 | "Accuracy score using max_depth = 2: 0.4897959183673469\n",
141 | "Accuracy score using max_depth = 3: 0.49387755102040815\n",
142 | "Accuracy score using max_depth = 4: 0.5051020408163265\n",
143 | "Accuracy score using max_depth = 5: 0.5244897959183673\n",
144 | "Accuracy score using max_depth = 6: 0.5357142857142857\n",
145 | "Accuracy score using max_depth = 7: 0.563265306122449\n",
146 | "Accuracy score using max_depth = 8: 0.5826530612244898\n",
147 | "Accuracy score using max_depth = 9: 0.5959183673469388\n",
148 | "Accuracy score using max_depth = 10: 0.6091836734693877\n",
149 | "Accuracy score using max_depth = 11: 0.6469387755102041\n",
150 | "Accuracy score using max_depth = 12: 0.6744897959183673\n",
151 | "Accuracy score using max_depth = 13: 0.6795918367346939\n",
152 | "Accuracy score using max_depth = 14: 0.6979591836734694\n",
153 | "Accuracy score using max_depth = 15: 0.7010204081632653\n",
154 | "Accuracy score using max_depth = 16: 0.6959183673469388\n",
155 | "Accuracy score using max_depth = 17: 0.6969387755102041\n",
156 | "Accuracy score using max_depth = 18: 0.7020408163265306\n",
157 | "Accuracy score using max_depth = 19: 0.7010204081632653\n"
158 | ]
159 | }
160 | ],
161 | "source": [
162 | "for max_depth in range(1,20):\n",
163 | " print('Accuracy score using max_depth =', max_depth, end = ': ')\n",
164 | " fit_predict(train,test,y_train,y_test,n_estimators = 160,max_depth = max_depth)\n"
165 | ]
166 | },
167 | {
168 | "cell_type": "code",
169 | "execution_count": 8,
170 | "metadata": {},
171 | "outputs": [
172 | {
173 | "name": "stdout",
174 | "output_type": "stream",
175 | "text": [
176 | "Accuracy score using max_features = 0.1: 0.6969387755102041\n",
177 | "Accuracy score using max_features = 0.2: 0.7040816326530612\n",
178 | "Accuracy score using max_features = 0.30000000000000004: 0.7020408163265306\n",
179 | "Accuracy score using max_features = 0.4: 0.6948979591836735\n",
180 | "Accuracy score using max_features = 0.5: 0.6969387755102041\n",
181 | "Accuracy score using max_features = 0.6: 0.6908163265306122\n",
182 | "Accuracy score using max_features = 0.7000000000000001: 0.6969387755102041\n",
183 | "Accuracy score using max_features = 0.8: 0.6989795918367347\n",
184 | "Accuracy score using max_features = 0.9: 0.6918367346938775\n",
185 | "Accuracy score using max_features = 1.0: 0.7020408163265306\n"
186 | ]
187 | }
188 | ],
189 | "source": [
190 | "for max_features in np.linspace(0.1,1,10):\n",
191 | " print('Accuracy score using max_features =', max_features, end = ': ')\n",
192 | " fit_predict(train,test,y_train,y_test,n_estimators = 160,max_features = max_features,max_depth = 18)\n"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": 9,
198 | "metadata": {},
199 | "outputs": [
200 | {
201 | "name": "stdout",
202 | "output_type": "stream",
203 | "text": [
204 | "Accuracy score using min_samples_split = 2: 0.7040816326530612\n",
205 | "Accuracy score using min_samples_split = 3: 0.7193877551020408\n",
206 | "Accuracy score using min_samples_split = 4: 0.7040816326530612\n",
207 | "Accuracy score using min_samples_split = 5: 0.6938775510204082\n",
208 | "Accuracy score using min_samples_split = 6: 0.6938775510204082\n",
209 | "Accuracy score using min_samples_split = 7: 0.6857142857142857\n",
210 | "Accuracy score using min_samples_split = 8: 0.6806122448979591\n",
211 | "Accuracy score using min_samples_split = 9: 0.6714285714285714\n"
212 | ]
213 | }
214 | ],
215 | "source": [
216 | "for min_samples_split in range(2,10):\n",
217 | " print('Accuracy score using min_samples_split =', min_samples_split, end = ': ')\n",
218 | " fit_predict(train,test,y_train,y_test,n_estimators = 160,max_features = 0.2,min_samples_split=min_samples_split\n",
219 | " ,max_depth = 18)\n"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": 10,
225 | "metadata": {},
226 | "outputs": [
227 | {
228 | "name": "stdout",
229 | "output_type": "stream",
230 | "text": [
231 | "tuned accuracy score: 0.7193877551020408\n",
232 | "tuned accuracy score with scaler: 0.7173469387755103\n"
233 | ]
234 | }
235 | ],
236 | "source": [
237 | "print('tuned accuracy score', end = ': ')\n",
238 | "fit_predict(train,test,y_train,y_test,n_estimators = 160,max_features = 0.2,min_samples_split=3,max_depth = 18)\n",
239 | "print('tuned accuracy score with scaler', end = ': ')\n",
240 | "\n",
241 | "fit_predict(train,test,y_train,y_test,n_estimators = 160,max_features = 0.2,min_samples_split=3,\n",
242 | " max_depth = 18,scaler=StandardScaler())"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": 11,
248 | "metadata": {},
249 | "outputs": [
250 | {
251 | "name": "stdout",
252 | "output_type": "stream",
253 | "text": [
254 | "overall improvement is 39.88 %\n"
255 | ]
256 | }
257 | ],
258 | "source": [
259 | "original_score = 0.514285714286\n",
260 | "best_score = 0.7193877551020408\n",
261 | "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
262 | "print('overall improvement is {} %'.format(improvement))"
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": 12,
268 | "metadata": {},
269 | "outputs": [
270 | {
271 | "name": "stdout",
272 | "output_type": "stream",
273 | "text": [
274 | "overall improvement compare to non tuned model is 11.9 %\n"
275 | ]
276 | }
277 | ],
278 | "source": [
279 | "original_score = 0.6428571428571429\n",
280 | "best_score = 0.7193877551020408\n",
281 | "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
282 | "print('overall improvement compare to non tuned model is {} %'.format(improvement))"
283 | ]
284 | }
285 | ],
286 | "metadata": {
287 | "kernelspec": {
288 | "display_name": "Python 3",
289 | "language": "python",
290 | "name": "python3"
291 | },
292 | "language_info": {
293 | "codemirror_mode": {
294 | "name": "ipython",
295 | "version": 3
296 | },
297 | "file_extension": ".py",
298 | "mimetype": "text/x-python",
299 | "name": "python",
300 | "nbconvert_exporter": "python",
301 | "pygments_lexer": "ipython3",
302 | "version": "3.6.1"
303 | }
304 | },
305 | "nbformat": 4,
306 | "nbformat_minor": 2
307 | }
308 |
--------------------------------------------------------------------------------
/Section 3/3.5 stacking.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "import lightgbm as lgb\n",
14 | "from catboost import CatBoostClassifier\n",
15 | "from sklearn.model_selection import train_test_split\n",
16 | "from sklearn.metrics import accuracy_score\n",
17 | "from sklearn.linear_model import LogisticRegression\n",
18 | "import warnings\n",
19 | "from sklearn import preprocessing\n",
20 | "from sklearn.neighbors import KNeighborsClassifier\n",
21 | "from sklearn.preprocessing import StandardScaler\n",
22 | "warnings.filterwarnings(\"ignore\")\n",
23 | "np.random.seed(42)"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 2,
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "data": {
33 | "text/html": [
34 | "\n",
35 | "\n",
48 | "
\n",
49 | " \n",
50 | " \n",
51 | " | \n",
52 | " v1 | \n",
53 | " v2 | \n",
54 | " v3 | \n",
55 | " v4 | \n",
56 | " v5 | \n",
57 | " v6 | \n",
58 | " v7 | \n",
59 | " v8 | \n",
60 | " v9 | \n",
61 | " v10 | \n",
62 | " ... | \n",
63 | " v122 | \n",
64 | " v123 | \n",
65 | " v124 | \n",
66 | " v125 | \n",
67 | " v126 | \n",
68 | " v127 | \n",
69 | " v128 | \n",
70 | " v129 | \n",
71 | " v130 | \n",
72 | " v131 | \n",
73 | "
\n",
74 | " \n",
75 | " \n",
76 | " \n",
77 | " 0 | \n",
78 | " 1.335739 | \n",
79 | " 8.727474 | \n",
80 | " C | \n",
81 | " 3.921026 | \n",
82 | " 7.915266 | \n",
83 | " 2.599278 | \n",
84 | " 3.176895 | \n",
85 | " 0.012941 | \n",
86 | " 9.999999 | \n",
87 | " 0.503281 | \n",
88 | " ... | \n",
89 | " 8.000000 | \n",
90 | " 1.989780 | \n",
91 | " 0.035754 | \n",
92 | " AU | \n",
93 | " 1.804126 | \n",
94 | " 3.113719 | \n",
95 | " 2.024285 | \n",
96 | " 0 | \n",
97 | " 0.636365 | \n",
98 | " 2.857144 | \n",
99 | "
\n",
100 | " \n",
101 | " 1 | \n",
102 | " NaN | \n",
103 | " NaN | \n",
104 | " C | \n",
105 | " NaN | \n",
106 | " 9.191265 | \n",
107 | " NaN | \n",
108 | " NaN | \n",
109 | " 2.301630 | \n",
110 | " NaN | \n",
111 | " 1.312910 | \n",
112 | " ... | \n",
113 | " NaN | \n",
114 | " NaN | \n",
115 | " 0.598896 | \n",
116 | " AF | \n",
117 | " NaN | \n",
118 | " NaN | \n",
119 | " 1.957825 | \n",
120 | " 0 | \n",
121 | " NaN | \n",
122 | " NaN | \n",
123 | "
\n",
124 | " \n",
125 | " 2 | \n",
126 | " 0.943877 | \n",
127 | " 5.310079 | \n",
128 | " C | \n",
129 | " 4.410969 | \n",
130 | " 5.326159 | \n",
131 | " 3.979592 | \n",
132 | " 3.928571 | \n",
133 | " 0.019645 | \n",
134 | " 12.666667 | \n",
135 | " 0.765864 | \n",
136 | " ... | \n",
137 | " 9.333333 | \n",
138 | " 2.477596 | \n",
139 | " 0.013452 | \n",
140 | " AE | \n",
141 | " 1.773709 | \n",
142 | " 3.922193 | \n",
143 | " 1.120468 | \n",
144 | " 2 | \n",
145 | " 0.883118 | \n",
146 | " 1.176472 | \n",
147 | "
\n",
148 | " \n",
149 | " 3 | \n",
150 | " 0.797415 | \n",
151 | " 8.304757 | \n",
152 | " C | \n",
153 | " 4.225930 | \n",
154 | " 11.627438 | \n",
155 | " 2.097700 | \n",
156 | " 1.987549 | \n",
157 | " 0.171947 | \n",
158 | " 8.965516 | \n",
159 | " 6.542669 | \n",
160 | " ... | \n",
161 | " 7.018256 | \n",
162 | " 1.812795 | \n",
163 | " 0.002267 | \n",
164 | " CJ | \n",
165 | " 1.415230 | \n",
166 | " 2.954381 | \n",
167 | " 1.990847 | \n",
168 | " 1 | \n",
169 | " 1.677108 | \n",
170 | " 1.034483 | \n",
171 | "
\n",
172 | " \n",
173 | " 4 | \n",
174 | " NaN | \n",
175 | " NaN | \n",
176 | " C | \n",
177 | " NaN | \n",
178 | " NaN | \n",
179 | " NaN | \n",
180 | " NaN | \n",
181 | " NaN | \n",
182 | " NaN | \n",
183 | " 1.050328 | \n",
184 | " ... | \n",
185 | " NaN | \n",
186 | " NaN | \n",
187 | " NaN | \n",
188 | " Z | \n",
189 | " NaN | \n",
190 | " NaN | \n",
191 | " NaN | \n",
192 | " 0 | \n",
193 | " NaN | \n",
194 | " NaN | \n",
195 | "
\n",
196 | " \n",
197 | "
\n",
198 | "
5 rows × 131 columns
\n",
199 | "
"
200 | ],
201 | "text/plain": [
202 | " v1 v2 v3 v4 v5 v6 v7 v8 \\\n",
203 | "0 1.335739 8.727474 C 3.921026 7.915266 2.599278 3.176895 0.012941 \n",
204 | "1 NaN NaN C NaN 9.191265 NaN NaN 2.301630 \n",
205 | "2 0.943877 5.310079 C 4.410969 5.326159 3.979592 3.928571 0.019645 \n",
206 | "3 0.797415 8.304757 C 4.225930 11.627438 2.097700 1.987549 0.171947 \n",
207 | "4 NaN NaN C NaN NaN NaN NaN NaN \n",
208 | "\n",
209 | " v9 v10 ... v122 v123 v124 v125 \\\n",
210 | "0 9.999999 0.503281 ... 8.000000 1.989780 0.035754 AU \n",
211 | "1 NaN 1.312910 ... NaN NaN 0.598896 AF \n",
212 | "2 12.666667 0.765864 ... 9.333333 2.477596 0.013452 AE \n",
213 | "3 8.965516 6.542669 ... 7.018256 1.812795 0.002267 CJ \n",
214 | "4 NaN 1.050328 ... NaN NaN NaN Z \n",
215 | "\n",
216 | " v126 v127 v128 v129 v130 v131 \n",
217 | "0 1.804126 3.113719 2.024285 0 0.636365 2.857144 \n",
218 | "1 NaN NaN 1.957825 0 NaN NaN \n",
219 | "2 1.773709 3.922193 1.120468 2 0.883118 1.176472 \n",
220 | "3 1.415230 2.954381 1.990847 1 1.677108 1.034483 \n",
221 | "4 NaN NaN NaN 0 NaN NaN \n",
222 | "\n",
223 | "[5 rows x 131 columns]"
224 | ]
225 | },
226 | "execution_count": 2,
227 | "metadata": {},
228 | "output_type": "execute_result"
229 | }
230 | ],
231 | "source": [
232 | "df = pd.read_csv('train.csv')\n",
233 | "y = df.target\n",
234 | "\n",
235 | "df.drop(['ID', 'target'], axis=1, inplace=True)\n",
236 | "df.head()"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 3,
242 | "metadata": {
243 | "collapsed": true
244 | },
245 | "outputs": [],
246 | "source": [
247 | "string_type = []\n",
248 | "for column in df.columns:\n",
249 | " if type(df[column].values[0]) == str:\n",
250 | " string_type.append(column)\n",
251 | "string_type.append('v113')\n",
252 | " \n",
253 | "df[string_type] = df[string_type].fillna('zero')\n",
254 | "\n",
255 | "df.fillna(-9999, inplace=True)\n",
256 | "cat_features_ids = np.where(df.apply(pd.Series.nunique) < 10000)[0].tolist()"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": 4,
262 | "metadata": {},
263 | "outputs": [
264 | {
265 | "data": {
266 | "text/html": [
267 | "\n",
268 | "\n",
281 | "
\n",
282 | " \n",
283 | " \n",
284 | " | \n",
285 | " v1 | \n",
286 | " v2 | \n",
287 | " v3 | \n",
288 | " v4 | \n",
289 | " v5 | \n",
290 | " v6 | \n",
291 | " v7 | \n",
292 | " v8 | \n",
293 | " v9 | \n",
294 | " v10 | \n",
295 | " ... | \n",
296 | " v122 | \n",
297 | " v123 | \n",
298 | " v124 | \n",
299 | " v125 | \n",
300 | " v126 | \n",
301 | " v127 | \n",
302 | " v128 | \n",
303 | " v129 | \n",
304 | " v130 | \n",
305 | " v131 | \n",
306 | "
\n",
307 | " \n",
308 | " \n",
309 | " \n",
310 | " 0 | \n",
311 | " 1.335739 | \n",
312 | " 8.727474 | \n",
313 | " C | \n",
314 | " 3.921026 | \n",
315 | " 7.915266 | \n",
316 | " 2.599278 | \n",
317 | " 3.176895 | \n",
318 | " 0.012941 | \n",
319 | " 9.999999 | \n",
320 | " 0.503281 | \n",
321 | " ... | \n",
322 | " 8.000000 | \n",
323 | " 1.989780 | \n",
324 | " 0.035754 | \n",
325 | " AU | \n",
326 | " 1.804126 | \n",
327 | " 3.113719 | \n",
328 | " 2.024285 | \n",
329 | " 0 | \n",
330 | " 0.636365 | \n",
331 | " 2.857144 | \n",
332 | "
\n",
333 | " \n",
334 | " 1 | \n",
335 | " -9999.000000 | \n",
336 | " -9999.000000 | \n",
337 | " C | \n",
338 | " -9999.000000 | \n",
339 | " 9.191265 | \n",
340 | " -9999.000000 | \n",
341 | " -9999.000000 | \n",
342 | " 2.301630 | \n",
343 | " -9999.000000 | \n",
344 | " 1.312910 | \n",
345 | " ... | \n",
346 | " -9999.000000 | \n",
347 | " -9999.000000 | \n",
348 | " 0.598896 | \n",
349 | " AF | \n",
350 | " -9999.000000 | \n",
351 | " -9999.000000 | \n",
352 | " 1.957825 | \n",
353 | " 0 | \n",
354 | " -9999.000000 | \n",
355 | " -9999.000000 | \n",
356 | "
\n",
357 | " \n",
358 | " 2 | \n",
359 | " 0.943877 | \n",
360 | " 5.310079 | \n",
361 | " C | \n",
362 | " 4.410969 | \n",
363 | " 5.326159 | \n",
364 | " 3.979592 | \n",
365 | " 3.928571 | \n",
366 | " 0.019645 | \n",
367 | " 12.666667 | \n",
368 | " 0.765864 | \n",
369 | " ... | \n",
370 | " 9.333333 | \n",
371 | " 2.477596 | \n",
372 | " 0.013452 | \n",
373 | " AE | \n",
374 | " 1.773709 | \n",
375 | " 3.922193 | \n",
376 | " 1.120468 | \n",
377 | " 2 | \n",
378 | " 0.883118 | \n",
379 | " 1.176472 | \n",
380 | "
\n",
381 | " \n",
382 | " 3 | \n",
383 | " 0.797415 | \n",
384 | " 8.304757 | \n",
385 | " C | \n",
386 | " 4.225930 | \n",
387 | " 11.627438 | \n",
388 | " 2.097700 | \n",
389 | " 1.987549 | \n",
390 | " 0.171947 | \n",
391 | " 8.965516 | \n",
392 | " 6.542669 | \n",
393 | " ... | \n",
394 | " 7.018256 | \n",
395 | " 1.812795 | \n",
396 | " 0.002267 | \n",
397 | " CJ | \n",
398 | " 1.415230 | \n",
399 | " 2.954381 | \n",
400 | " 1.990847 | \n",
401 | " 1 | \n",
402 | " 1.677108 | \n",
403 | " 1.034483 | \n",
404 | "
\n",
405 | " \n",
406 | " 4 | \n",
407 | " -9999.000000 | \n",
408 | " -9999.000000 | \n",
409 | " C | \n",
410 | " -9999.000000 | \n",
411 | " -9999.000000 | \n",
412 | " -9999.000000 | \n",
413 | " -9999.000000 | \n",
414 | " -9999.000000 | \n",
415 | " -9999.000000 | \n",
416 | " 1.050328 | \n",
417 | " ... | \n",
418 | " -9999.000000 | \n",
419 | " -9999.000000 | \n",
420 | " -9999.000000 | \n",
421 | " Z | \n",
422 | " -9999.000000 | \n",
423 | " -9999.000000 | \n",
424 | " -9999.000000 | \n",
425 | " 0 | \n",
426 | " -9999.000000 | \n",
427 | " -9999.000000 | \n",
428 | "
\n",
429 | " \n",
430 | "
\n",
431 | "
5 rows × 131 columns
\n",
432 | "
"
433 | ],
434 | "text/plain": [
435 | " v1 v2 v3 v4 v5 v6 \\\n",
436 | "0 1.335739 8.727474 C 3.921026 7.915266 2.599278 \n",
437 | "1 -9999.000000 -9999.000000 C -9999.000000 9.191265 -9999.000000 \n",
438 | "2 0.943877 5.310079 C 4.410969 5.326159 3.979592 \n",
439 | "3 0.797415 8.304757 C 4.225930 11.627438 2.097700 \n",
440 | "4 -9999.000000 -9999.000000 C -9999.000000 -9999.000000 -9999.000000 \n",
441 | "\n",
442 | " v7 v8 v9 v10 ... v122 \\\n",
443 | "0 3.176895 0.012941 9.999999 0.503281 ... 8.000000 \n",
444 | "1 -9999.000000 2.301630 -9999.000000 1.312910 ... -9999.000000 \n",
445 | "2 3.928571 0.019645 12.666667 0.765864 ... 9.333333 \n",
446 | "3 1.987549 0.171947 8.965516 6.542669 ... 7.018256 \n",
447 | "4 -9999.000000 -9999.000000 -9999.000000 1.050328 ... -9999.000000 \n",
448 | "\n",
449 | " v123 v124 v125 v126 v127 v128 \\\n",
450 | "0 1.989780 0.035754 AU 1.804126 3.113719 2.024285 \n",
451 | "1 -9999.000000 0.598896 AF -9999.000000 -9999.000000 1.957825 \n",
452 | "2 2.477596 0.013452 AE 1.773709 3.922193 1.120468 \n",
453 | "3 1.812795 0.002267 CJ 1.415230 2.954381 1.990847 \n",
454 | "4 -9999.000000 -9999.000000 Z -9999.000000 -9999.000000 -9999.000000 \n",
455 | "\n",
456 | " v129 v130 v131 \n",
457 | "0 0 0.636365 2.857144 \n",
458 | "1 0 -9999.000000 -9999.000000 \n",
459 | "2 2 0.883118 1.176472 \n",
460 | "3 1 1.677108 1.034483 \n",
461 | "4 0 -9999.000000 -9999.000000 \n",
462 | "\n",
463 | "[5 rows x 131 columns]"
464 | ]
465 | },
466 | "execution_count": 4,
467 | "metadata": {},
468 | "output_type": "execute_result"
469 | }
470 | ],
471 | "source": [
472 | "df.head()"
473 | ]
474 | },
475 | {
476 | "cell_type": "code",
477 | "execution_count": 5,
478 | "metadata": {
479 | "collapsed": true
480 | },
481 | "outputs": [],
482 | "source": [
483 | "le = preprocessing.LabelEncoder()\n",
484 | "for column in string_type:\n",
485 | " df[column] = le.fit_transform(df[column])"
486 | ]
487 | },
488 | {
489 | "cell_type": "code",
490 | "execution_count": 6,
491 | "metadata": {
492 | "collapsed": true
493 | },
494 | "outputs": [],
495 | "source": [
496 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
497 | ]
498 | },
499 | {
500 | "cell_type": "code",
501 | "execution_count": 7,
502 | "metadata": {},
503 | "outputs": [
504 | {
505 | "name": "stdout",
506 | "output_type": "stream",
507 | "text": [
508 | "Accuracy score baseline: 0.7616881696916685\n"
509 | ]
510 | }
511 | ],
512 | "source": [
513 | "lr = LogisticRegression()\n",
514 | "lr.fit(train, y_train)\n",
515 | "y_pred = lr.predict(test)\n",
516 | "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
517 | ]
518 | },
519 | {
520 | "cell_type": "code",
521 | "execution_count": 8,
522 | "metadata": {
523 | "collapsed": true
524 | },
525 | "outputs": [],
526 | "source": [
527 | "folds = pd.DataFrame(list(range(len(train))))\n",
528 | "folds[0] = folds.values % 3\n",
529 | "folds.rename(columns={0:'fold'},inplace=True)"
530 | ]
531 | },
532 | {
533 | "cell_type": "markdown",
534 | "metadata": {},
535 | "source": [
536 | "## Catboost"
537 | ]
538 | },
539 | {
540 | "cell_type": "code",
541 | "execution_count": 9,
542 | "metadata": {
543 | "collapsed": true
544 | },
545 | "outputs": [],
546 | "source": [
547 | "dict_OOF_predict_cbst = {}\n",
548 | "dict_test_predict_cbst = {}\n",
549 | "for fold in [0,1,2]:\n",
550 | "\n",
551 | " \n",
552 | " clf = CatBoostClassifier(learning_rate=0.1, iterations=500, random_seed=42, logging_level='Silent')\n",
553 | " clf.fit(train.values[folds.fold != fold], y_train.values[folds.fold != fold], cat_features=cat_features_ids)\n",
554 | " \n",
555 | " predicts = clf.predict_proba(test)[:,1]\n",
556 | " dict_test_predict_cbst[fold] = predicts\n",
557 | "\n",
558 | " predicts_OOF = clf.predict_proba(train.values[folds.fold == fold])[:,1]\n",
559 | " dict_OOF_predict_cbst[fold] = predicts_OOF\n",
560 | " del clf\n",
561 | " \n"
562 | ]
563 | },
564 | {
565 | "cell_type": "code",
566 | "execution_count": 20,
567 | "metadata": {
568 | "collapsed": true
569 | },
570 | "outputs": [],
571 | "source": [
572 | " clf = CatBoostClassifier(learning_rate=0.1, iterations=500, random_seed=42, logging_level='Silent')\n",
573 | " clf.fit()"
574 | ]
575 | },
576 | {
577 | "cell_type": "code",
578 | "execution_count": 10,
579 | "metadata": {},
580 | "outputs": [
581 | {
582 | "name": "stdout",
583 | "output_type": "stream",
584 | "text": [
585 | "OOF Catboost accuracy 0.7817748425472358\n"
586 | ]
587 | }
588 | ],
589 | "source": [
590 | "OOF_X_cbst = np.zeros_like(y_train)\n",
591 | "OOF_X_cbst = pd.DataFrame(OOF_X_cbst)\n",
592 | "for fold in dict_OOF_predict_cbst.keys():\n",
593 | " OOF_X_cbst[folds.fold == fold] = dict_OOF_predict_cbst[fold].reshape((dict_OOF_predict_cbst[fold].shape[0],1))\n",
594 | "print('OOF Catboost accuracy',accuracy_score(y_train,np.round(OOF_X_cbst)))"
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "execution_count": 11,
600 | "metadata": {},
601 | "outputs": [
602 | {
603 | "name": "stdout",
604 | "output_type": "stream",
605 | "text": [
606 | "test Catboost accuracy 0.7865296304395364\n"
607 | ]
608 | }
609 | ],
610 | "source": [
611 | "f_0,f_1,f_2 = dict_test_predict_cbst.keys()\n",
612 | "sub_cbst = dict_test_predict_cbst[f_0] + dict_test_predict_cbst[f_1] + dict_test_predict_cbst[f_2]\n",
613 | "sub_cbst/=3\n",
614 | "print('test Catboost accuracy',accuracy_score(y_test,np.round(sub_cbst)))"
615 | ]
616 | },
617 | {
618 | "cell_type": "markdown",
619 | "metadata": {},
620 | "source": [
621 | "## KNN 2"
622 | ]
623 | },
624 | {
625 | "cell_type": "code",
626 | "execution_count": 12,
627 | "metadata": {},
628 | "outputs": [
629 | {
630 | "name": "stdout",
631 | "output_type": "stream",
632 | "text": [
633 | "OOF kNN 2 accuracy 0.6638820853743876\n",
634 | "test kNN 2 accuracy 0.7161163350098404\n"
635 | ]
636 | }
637 | ],
638 | "source": [
639 | "\n",
640 | "dict_OOF_predict_knn_2 = {}\n",
641 | "dict_test_predict_knn_2 = {}\n",
642 | "for fold in [0,1,2]:\n",
643 | "\n",
644 | " \n",
645 | " clf = KNeighborsClassifier(n_neighbors = 2,weights='distance',n_jobs=32)\n",
646 | " clf.fit(train.values[folds.fold != fold], y_train.values[folds.fold != fold])\n",
647 | " \n",
648 | " predicts = clf.predict_proba(test)[:,1]\n",
649 | " dict_test_predict_knn_2[fold] = predicts\n",
650 | "\n",
651 | " predicts_OOF = clf.predict_proba(train.values[folds.fold == fold])[:,1]\n",
652 | " dict_OOF_predict_knn_2[fold] = predicts_OOF\n",
653 | " del clf\n",
654 | " \n",
655 | "OOF_X_knn_2 = np.zeros_like(y_train)\n",
656 | "OOF_X_knn_2 = pd.DataFrame(OOF_X_knn_2)\n",
657 | "for fold in dict_OOF_predict_knn_2.keys():\n",
658 | " OOF_X_knn_2[folds.fold == fold] = dict_OOF_predict_knn_2[fold].reshape((dict_OOF_predict_knn_2[fold].shape[0],1))\n",
659 | "print('OOF kNN 2 accuracy',accuracy_score(y_train,np.round(OOF_X_knn_2)))\n",
660 | "\n",
661 | "f_0,f_1,f_2 = dict_test_predict_knn_2.keys()\n",
662 | "sub_knn_2 = dict_test_predict_knn_2[f_0] + dict_test_predict_knn_2[f_1] + dict_test_predict_knn_2[f_2]\n",
663 | "sub_knn_2/=3\n",
664 | "print('test kNN 2 accuracy',accuracy_score(y_test,np.round(sub_knn_2)))"
665 | ]
666 | },
667 | {
668 | "cell_type": "markdown",
669 | "metadata": {},
670 | "source": [
671 | "## KNN 4"
672 | ]
673 | },
674 | {
675 | "cell_type": "code",
676 | "execution_count": 13,
677 | "metadata": {},
678 | "outputs": [
679 | {
680 | "name": "stdout",
681 | "output_type": "stream",
682 | "text": [
683 | "OOF kNN 4 accuracy 0.7044808432470259\n",
684 | "test kNN 4 accuracy 0.7345287557402143\n"
685 | ]
686 | }
687 | ],
688 | "source": [
689 | "\n",
690 | "dict_OOF_predict_knn_4 = {}\n",
691 | "dict_test_predict_knn_4 = {}\n",
692 | "for fold in [0,1,2]:\n",
693 | "\n",
694 | " \n",
695 | " clf = KNeighborsClassifier(n_neighbors= 4,weights='distance',n_jobs=32)\n",
696 | " clf.fit(train.values[folds.fold != fold], y_train.values[folds.fold != fold])\n",
697 | " \n",
698 | " predicts = clf.predict_proba(test)[:,1]\n",
699 | " dict_test_predict_knn_4[fold] = predicts\n",
700 | "\n",
701 | " predicts_OOF = clf.predict_proba(train.values[folds.fold == fold])[:,1]\n",
702 | " dict_OOF_predict_knn_4[fold] = predicts_OOF\n",
703 | " del clf\n",
704 | " \n",
705 | "OOF_X_knn_4 = np.zeros_like(y_train)\n",
706 | "OOF_X_knn_4 = pd.DataFrame(OOF_X_knn_4)\n",
707 | "for fold in dict_OOF_predict_knn_4.keys():\n",
708 | " OOF_X_knn_4[folds.fold == fold] = dict_OOF_predict_knn_4[fold].reshape((dict_OOF_predict_knn_4[fold].shape[0],1))\n",
709 | "print('OOF kNN 4 accuracy',accuracy_score(y_train,np.round(OOF_X_knn_4)))\n",
710 | "\n",
711 | "f_0,f_1,f_2 = dict_test_predict_knn_4.keys()\n",
712 | "sub_knn_4 = dict_test_predict_knn_4[f_0] + dict_test_predict_knn_4[f_1] + dict_test_predict_knn_4[f_2]\n",
713 | "sub_knn_4/=3\n",
714 | "print('test kNN 4 accuracy',accuracy_score(y_test,np.round(sub_knn_4)))"
715 | ]
716 | },
717 | {
718 | "cell_type": "markdown",
719 | "metadata": {},
720 | "source": [
721 | "## KNN 8"
722 | ]
723 | },
724 | {
725 | "cell_type": "code",
726 | "execution_count": 14,
727 | "metadata": {},
728 | "outputs": [
729 | {
730 | "name": "stdout",
731 | "output_type": "stream",
732 | "text": [
733 | "OOF kNN 8 accuracy 0.7372397655703289\n",
734 | "test kNN 8 accuracy 0.7567461185217581\n"
735 | ]
736 | }
737 | ],
738 | "source": [
739 | "\n",
740 | "dict_OOF_predict_knn_8 = {}\n",
741 | "dict_test_predict_knn_8 = {}\n",
742 | "for fold in [0,1,2]:\n",
743 | "\n",
744 | " \n",
745 | " clf = KNeighborsClassifier(n_neighbors = 8,weights='distance',n_jobs=32)\n",
746 | " clf.fit(train.values[folds.fold != fold], y_train.values[folds.fold != fold])\n",
747 | " \n",
748 | " predicts = clf.predict_proba(test)[:,1]\n",
749 | " dict_test_predict_knn_8[fold] = predicts\n",
750 | "\n",
751 | " predicts_OOF = clf.predict_proba(train.values[folds.fold == fold])[:,1]\n",
752 | " dict_OOF_predict_knn_8[fold] = predicts_OOF\n",
753 | " del clf\n",
754 | " \n",
755 | "OOF_X_knn_8 = np.zeros_like(y_train)\n",
756 | "OOF_X_knn_8 = pd.DataFrame(OOF_X_knn_8)\n",
757 | "for fold in dict_OOF_predict_knn_8.keys():\n",
758 | " OOF_X_knn_8[folds.fold == fold] = dict_OOF_predict_knn_8[fold].reshape((dict_OOF_predict_knn_8[fold].shape[0],1))\n",
759 | "print('OOF kNN 8 accuracy',accuracy_score(y_train,np.round(OOF_X_knn_8)))\n",
760 | "\n",
761 | "f_0,f_1,f_2 = dict_test_predict_knn_8.keys()\n",
762 | "sub_knn_8 = dict_test_predict_knn_8[f_0] + dict_test_predict_knn_8[f_1] + dict_test_predict_knn_8[f_2]\n",
763 | "sub_knn_8/=3\n",
764 | "print('test kNN 8 accuracy',accuracy_score(y_test,np.round(sub_knn_8)))"
765 | ]
766 | },
767 | {
768 | "cell_type": "markdown",
769 | "metadata": {},
770 | "source": [
771 | "## KNN 16"
772 | ]
773 | },
774 | {
775 | "cell_type": "code",
776 | "execution_count": 15,
777 | "metadata": {},
778 | "outputs": [
779 | {
780 | "name": "stdout",
781 | "output_type": "stream",
782 | "text": [
783 | "OOF kNN 16 accuracy 0.75659333449965\n",
784 | "test kNN 16 accuracy 0.7642685326918872\n"
785 | ]
786 | }
787 | ],
788 | "source": [
789 | "\n",
790 | "dict_OOF_predict_knn_16 = {}\n",
791 | "dict_test_predict_knn_16 = {}\n",
792 | "for fold in [0,1,2]:\n",
793 | "\n",
794 | " \n",
795 | " clf = KNeighborsClassifier(n_neighbors = 16,weights='distance',n_jobs=32)\n",
796 | " clf.fit(train.values[folds.fold != fold], y_train.values[folds.fold != fold])\n",
797 | " \n",
798 | " predicts = clf.predict_proba(test)[:,1]\n",
799 | " dict_test_predict_knn_16[fold] = predicts\n",
800 | "\n",
801 | " predicts_OOF = clf.predict_proba(train.values[folds.fold == fold])[:,1]\n",
802 | " dict_OOF_predict_knn_16[fold] = predicts_OOF\n",
803 | " del clf\n",
804 | " \n",
805 | "OOF_X_knn_16 = np.zeros_like(y_train)\n",
806 | "OOF_X_knn_16 = pd.DataFrame(OOF_X_knn_16)\n",
807 | "for fold in dict_OOF_predict_knn_16.keys():\n",
808 | " OOF_X_knn_16[folds.fold == fold] = dict_OOF_predict_knn_16[fold].reshape((dict_OOF_predict_knn_16[fold].shape[0],1))\n",
809 | "print('OOF kNN 16 accuracy',accuracy_score(y_train,np.round(OOF_X_knn_16)))\n",
810 | "\n",
811 | "\n",
812 | "f_0,f_1,f_2 = dict_test_predict_knn_16.keys()\n",
813 | "sub_knn_16 = dict_test_predict_knn_16[f_0] + dict_test_predict_knn_16[f_1] + dict_test_predict_knn_16[f_2]\n",
814 | "sub_knn_16/=3\n",
815 | "print('test kNN 16 accuracy',accuracy_score(y_test,np.round(sub_knn_16)))"
816 | ]
817 | },
818 | {
819 | "cell_type": "markdown",
820 | "metadata": {},
821 | "source": [
822 | "## Stacking"
823 | ]
824 | },
825 | {
826 | "cell_type": "code",
827 | "execution_count": 16,
828 | "metadata": {
829 | "collapsed": true
830 | },
831 | "outputs": [],
832 | "source": [
833 | "stacked_data_set_train = pd.DataFrame(OOF_X_cbst)\n",
834 | "stacked_data_set_test = pd.DataFrame(sub_cbst)\n",
835 | "\n",
836 | "stacked_data_set_train[1] = OOF_X_knn_2\n",
837 | "stacked_data_set_test[1] = sub_knn_2\n",
838 | "\n",
839 | "stacked_data_set_train[2] = OOF_X_knn_4\n",
840 | "stacked_data_set_test[2] = sub_knn_4 \n",
841 | "\n",
842 | "stacked_data_set_train[3] = OOF_X_knn_8\n",
843 | "stacked_data_set_test[3] = sub_knn_8\n",
844 | "\n",
845 | "stacked_data_set_train[4] = OOF_X_knn_16\n",
846 | "stacked_data_set_test[4] = sub_knn_16 \n"
847 | ]
848 | },
849 | {
850 | "cell_type": "code",
851 | "execution_count": 17,
852 | "metadata": {},
853 | "outputs": [
854 | {
855 | "name": "stdout",
856 | "output_type": "stream",
857 | "text": [
858 | "Accuracy score baseline: 0.787797944456593\n"
859 | ]
860 | }
861 | ],
862 | "source": [
863 | "lr = LogisticRegression()\n",
864 | "lr.fit(stacked_data_set_train, y_train)\n",
865 | "y_pred = lr.predict(stacked_data_set_test)\n",
866 | "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
867 | ]
868 | },
869 | {
870 | "cell_type": "code",
871 | "execution_count": 18,
872 | "metadata": {},
873 | "outputs": [
874 | {
875 | "name": "stdout",
876 | "output_type": "stream",
877 | "text": [
878 | "overall improvement with stacking is 0.16 %\n"
879 | ]
880 | }
881 | ],
882 | "source": [
883 | "original_score = 0.7865296304395364\n",
884 | "best_score = 0.787797944456593\n",
885 | "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
886 | "print('overall improvement with stacking is {} %'.format(improvement))"
887 | ]
888 | },
889 | {
890 | "cell_type": "code",
891 | "execution_count": 19,
892 | "metadata": {},
893 | "outputs": [
894 | {
895 | "name": "stdout",
896 | "output_type": "stream",
897 | "text": [
898 | "additional value is 182.9136 samples\n"
899 | ]
900 | }
901 | ],
902 | "source": [
903 | "print('additional value is {} samples'.format(df.shape[0] * 0.16 / 100))"
904 | ]
905 | }
906 | ],
907 | "metadata": {
908 | "kernelspec": {
909 | "display_name": "Python 3",
910 | "language": "python",
911 | "name": "python3"
912 | },
913 | "language_info": {
914 | "codemirror_mode": {
915 | "name": "ipython",
916 | "version": 3
917 | },
918 | "file_extension": ".py",
919 | "mimetype": "text/x-python",
920 | "name": "python",
921 | "nbconvert_exporter": "python",
922 | "pygments_lexer": "ipython3",
923 | "version": "3.6.1"
924 | }
925 | },
926 | "nbformat": 4,
927 | "nbformat_minor": 2
928 | }
929 |
--------------------------------------------------------------------------------
/Section 4/4_1_Memory based collaborative filtering.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "import scipy\n",
14 | "import json\n",
15 | "import os\n",
16 | "\n",
17 | "from tqdm import tqdm_notebook\n",
18 | "from sklearn.model_selection import train_test_split\n",
19 | "from sklearn.metrics.pairwise import pairwise_distances"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "def load_df(): #loading dataframe\n",
31 | " df = pd.DataFrame()\n",
32 | " with open('./mpd.slice.0-999.json') as data_file:\n",
33 | " data_string = data_file.read()\n",
34 | " try:\n",
35 | " data = json.loads(data_string)\n",
36 | " except ValueError:\n",
37 | " print('Failed:')\n",
38 | " print(repr(data_string))\n",
39 | " df = pd.concat([df, pd.DataFrame(data['playlists'])], ignore_index=True)\n",
40 | " return df"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 3,
46 | "metadata": {},
47 | "outputs": [
48 | {
49 | "data": {
50 | "application/vnd.jupyter.widget-view+json": {
51 | "model_id": "143c4d7efa9445b99273421fef710c52"
52 | }
53 | },
54 | "metadata": {},
55 | "output_type": "display_data"
56 | },
57 | {
58 | "name": "stdout",
59 | "output_type": "stream",
60 | "text": [
61 | "\n"
62 | ]
63 | }
64 | ],
65 | "source": [
66 | "df = load_df()\n",
67 | "df.drop(['description','name', 'pid', 'num_albums','num_artists', \n",
68 | " 'num_edits', 'num_followers', 'num_tracks', 'collaborative'], axis = 1, inplace = True) #dropping columns \n",
69 | " #that we are not going\n",
70 | " #to use\n",
71 | "\n",
72 | "artist_list = []\n",
73 | "vocab_artist = set()\n",
74 | "\n",
75 | "for row in tqdm_notebook(df.iterrows()): #iterating through df to get sequence of artists name \n",
76 | " #that are contained in playlist\n",
77 | " artists = [x['artist_name'] for x in row[1]['tracks']] #getting artists from playlist(json type)\n",
78 | " for x in row[1]['tracks']:\n",
79 | " vocab_artist.add(x['artist_name']) #creating set with unique artists name\n",
80 | " artist_list.append(artists) \n",
81 | "\n",
82 | "df['artist_list'] = artist_list \n",
83 | "\n",
84 | "w2x_artist = {artist:i for i, artist in enumerate(vocab_artist)} #artist name to index\n",
85 | "x2w_artist = {i:artist for i, artist in enumerate(vocab_artist)} #index to artist name\n",
86 | "\n",
87 | "df['artist_idx'] = df['artist_list'].apply(lambda x: [w2x_artist[a] for a in x]) #converting sequence of artist name \n",
88 | " #to sequence of artists idx\n",
89 | "\n",
90 | "\n",
91 | "df['train_seq_artist'] = df['artist_idx'].apply(lambda x: x[:-3]) #creating train sequence\n",
92 | "df['target_val_artist'] = df['artist_idx'].apply(lambda x: x[-3:]) #creating validation sequence"
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": 4,
98 | "metadata": {},
99 | "outputs": [
100 | {
101 | "data": {
102 | "application/vnd.jupyter.widget-view+json": {
103 | "model_id": "5312415c36e94a17aa363a0e0ace3b57"
104 | }
105 | },
106 | "metadata": {},
107 | "output_type": "display_data"
108 | },
109 | {
110 | "name": "stdout",
111 | "output_type": "stream",
112 | "text": [
113 | "\n"
114 | ]
115 | }
116 | ],
117 | "source": [
118 | "inds = df['train_seq_artist']\n",
119 | "playlist_artist = scipy.sparse.lil_matrix((df.shape[0], len(vocab_artist)), dtype=np.int8) #binary sparse artist-\n",
120 | " #-playlist matrix\n",
121 | "for i, row in tqdm_notebook(enumerate(inds)):\n",
122 | " playlist_artist[i, row] = 1 "
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 5,
128 | "metadata": {},
129 | "outputs": [
130 | {
131 | "name": "stdout",
132 | "output_type": "stream",
133 | "text": [
134 | "Hit rate using most popular benchmark: 0.037\n",
135 | "Precision using most popular benchmark: 0.013\n"
136 | ]
137 | }
138 | ],
139 | "source": [
140 | "precision = []\n",
141 | "hr = []\n",
142 | "sum_artists = np.asarray(np.sum(playlist_artist, axis = 0)).reshape((9722, ))\n",
143 | "preds = np.argsort(sum_artists)[-3:]\n",
144 | "y_true = df['target_val_artist']\n",
145 | "for y in y_true:\n",
146 | " score = len(set(preds) & set(y))\n",
147 | " precision.append(score/3)\n",
148 | " hr.append(int(score > 0))\n",
149 | "print('Hit rate using most popular benchmark:', np.mean(hr))\n",
150 | "print('Precision using most popular benchmark:', np.mean(precision))"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 6,
156 | "metadata": {},
157 | "outputs": [
158 | {
159 | "data": {
160 | "application/vnd.jupyter.widget-view+json": {
161 | "model_id": "6f57dad00099406ab8ca4d1cc45ac0b9"
162 | }
163 | },
164 | "metadata": {},
165 | "output_type": "display_data"
166 | },
167 | {
168 | "name": "stdout",
169 | "output_type": "stream",
170 | "text": [
171 | "\n",
172 | "Hit rate using most popular benchmark: 0.013\n",
173 | "Precision using most popular benchmark: 0.004333333333333333\n"
174 | ]
175 | }
176 | ],
177 | "source": [
178 | "precision = []\n",
179 | "hr = []\n",
180 | "playlist_distances = pairwise_distances(playlist_artist, metric='cosine', n_jobs = 32)\n",
181 | "i = 0\n",
182 | "for row, y in tqdm_notebook(zip(playlist_artist, y_true)):\n",
183 | " distances = playlist_distances[i, :]\n",
184 | " most_similar = np.argsort(distances)[-3:].tolist()\n",
185 | " preds = []\n",
186 | "\n",
187 | " if i in most_similar:\n",
188 | " most_similar.remove(i)\n",
189 | " \n",
190 | " for user in most_similar:\n",
191 | " pred = df.loc[user, 'train_seq_artist']\n",
192 | " preds += pred\n",
193 | " \n",
194 | " preds = np.asarray(np.unique(preds))\n",
195 | " preds_ind = np.argsort(sum_artists[preds])[-3:]\n",
196 | " y_pred = preds[preds_ind]\n",
197 | " score = len(set(y_pred) & set(y))\n",
198 | " precision.append(score/3)\n",
199 | " hr.append(int(score > 0))\n",
200 | " i+=1\n",
201 | "print('Hit rate using most popular benchmark:', np.mean(hr))\n",
202 | "print('Precision using most popular benchmark:', np.mean(precision))"
203 | ]
204 | }
205 | ],
206 | "metadata": {
207 | "kernelspec": {
208 | "display_name": "Python 3",
209 | "language": "python",
210 | "name": "python3"
211 | },
212 | "language_info": {
213 | "codemirror_mode": {
214 | "name": "ipython",
215 | "version": 3
216 | },
217 | "file_extension": ".py",
218 | "mimetype": "text/x-python",
219 | "name": "python",
220 | "nbconvert_exporter": "python",
221 | "pygments_lexer": "ipython3",
222 | "version": "3.6.1"
223 | }
224 | },
225 | "nbformat": 4,
226 | "nbformat_minor": 2
227 | }
228 |
--------------------------------------------------------------------------------
/Section 4/4_2_Item to item recommendation with kNN.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "import scipy\n",
14 | "import json\n",
15 | "import os\n",
16 | "\n",
17 | "from tqdm import tqdm_notebook\n",
18 | "from sklearn.model_selection import train_test_split"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {
25 | "collapsed": true
26 | },
27 | "outputs": [],
28 | "source": [
29 | "def load_df():\n",
30 | " df = pd.DataFrame()\n",
31 | " with open('./mpd.slice.0-999.json') as data_file:\n",
32 | " data_string = data_file.read()\n",
33 | " try:\n",
34 | " data = json.loads(data_string)\n",
35 | " except ValueError:\n",
36 | " print('Failed:')\n",
37 | " print(repr(data_string))\n",
38 | " df = pd.concat([df, pd.DataFrame(data['playlists'])], ignore_index=True)\n",
39 | " return df"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 3,
45 | "metadata": {},
46 | "outputs": [
47 | {
48 | "data": {
49 | "application/vnd.jupyter.widget-view+json": {
50 | "model_id": "6ed0a8815ac24965a293030c7042189d"
51 | }
52 | },
53 | "metadata": {},
54 | "output_type": "display_data"
55 | },
56 | {
57 | "name": "stdout",
58 | "output_type": "stream",
59 | "text": [
60 | "\n"
61 | ]
62 | }
63 | ],
64 | "source": [
65 | "df = load_df()\n",
66 | "df.drop(['description','name', 'pid', 'num_albums','num_artists', \n",
67 | " 'num_edits', 'num_followers', 'num_tracks', 'collaborative'], axis = 1, inplace = True) #dropping columns \n",
68 | " #that we are not going\n",
69 | " #to use\n",
70 | "\n",
71 | "artist_list = []\n",
72 | "vocab_artist = set()\n",
73 | "\n",
74 | "for row in tqdm_notebook(df.iterrows()): #iterating through df to get sequence of artists name \n",
75 | " #that are contained in playlist\n",
76 | " artists = [x['artist_name'] for x in row[1]['tracks']] #getting artists from playlist(json type)\n",
77 | " for x in row[1]['tracks']:\n",
78 | " vocab_artist.add(x['artist_name']) #creating set with unique artists name\n",
79 | " artist_list.append(artists) \n",
80 | "\n",
81 | "df['artist_list'] = artist_list \n",
82 | "\n",
83 | "w2x_artist = {artist:i for i, artist in enumerate(vocab_artist)} #artist name to index\n",
84 | "x2w_artist = {i:artist for i, artist in enumerate(vocab_artist)} #index to artist name\n",
85 | "\n",
86 | "df['artist_idx'] = df['artist_list'].apply(lambda x: [w2x_artist[a] for a in x]) #converting sequence of artist name \n",
87 | " #to sequence of artists idx\n",
88 | "\n",
89 | "\n",
90 | "df['train_seq_artist'] = df['artist_idx'].apply(lambda x: x[:-3]) #creating train sequence\n",
91 | "df['target_val_artist'] = df['artist_idx'].apply(lambda x: x[-3:]) #creating validation sequence"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 4,
97 | "metadata": {},
98 | "outputs": [
99 | {
100 | "data": {
101 | "application/vnd.jupyter.widget-view+json": {
102 | "model_id": "66b4d8229fe94f949d9ec1ad31ee1605"
103 | }
104 | },
105 | "metadata": {},
106 | "output_type": "display_data"
107 | },
108 | {
109 | "name": "stdout",
110 | "output_type": "stream",
111 | "text": [
112 | "\n"
113 | ]
114 | }
115 | ],
116 | "source": [
117 | "inds = df['train_seq_artist']\n",
118 | "playlist_artist = scipy.sparse.lil_matrix((df.shape[0], len(vocab_artist)), dtype=np.int8)\n",
119 | "for i, row in tqdm_notebook(enumerate(inds)):\n",
120 | " playlist_artist[i, row] = 1 "
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 5,
126 | "metadata": {},
127 | "outputs": [
128 | {
129 | "name": "stdout",
130 | "output_type": "stream",
131 | "text": [
132 | "Hit rate using most popular benchmark: 0.037\n",
133 | "Precision using most popular benchmark: 0.013\n"
134 | ]
135 | }
136 | ],
137 | "source": [
138 | "precision = []\n",
139 | "hr = []\n",
140 | "sum_artists = np.asarray(np.sum(playlist_artist, axis = 0)).reshape((playlist_artist.shape[1], ))\n",
141 | "preds = np.argsort(sum_artists)[-3:]\n",
142 | "y_true = df['target_val_artist']\n",
143 | "for y in y_true:\n",
144 | " score = len(set(preds) & set(y))\n",
145 | " precision.append(score/3)\n",
146 | " hr.append(int(score > 0))\n",
147 | "print('Hit rate using most popular benchmark:', np.mean(hr))\n",
148 | "print('Precision using most popular benchmark:', np.mean(precision))"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": 6,
154 | "metadata": {
155 | "collapsed": true
156 | },
157 | "outputs": [],
158 | "source": [
159 | "from sklearn.neighbors import NearestNeighbors"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": 8,
165 | "metadata": {},
166 | "outputs": [
167 | {
168 | "data": {
169 | "application/vnd.jupyter.widget-view+json": {
170 | "model_id": "c7c51a403d4443ed87e1a00b381c5640"
171 | }
172 | },
173 | "metadata": {},
174 | "output_type": "display_data"
175 | },
176 | {
177 | "name": "stdout",
178 | "output_type": "stream",
179 | "text": [
180 | "\n",
181 | "Hit rate using item to item kNN: 0.077\n",
182 | "Precision using item to item kNN: 0.026666666666666665\n"
183 | ]
184 | }
185 | ],
186 | "source": [
187 | "precision = []\n",
188 | "hr = []\n",
189 | "nn = NearestNeighbors(n_jobs=32,n_neighbors=3)\n",
190 | "nn.fit(playlist_artist.T)\n",
191 | "distances = nn.kneighbors(playlist_artist.T)[1]\n",
192 | "for row, y in tqdm_notebook(zip(playlist_artist, y_true)):\n",
193 | " last_listened = np.nonzero(row)[1]\n",
194 | " preds = distances[last_listened[-1]]\n",
195 | " score = len(set(preds) & set(y))\n",
196 | " precision.append(score/3)\n",
197 | " hr.append(int(score > 0))\n",
198 | "print('Hit rate using item to item kNN:', np.mean(hr))\n",
199 | "print('Precision using item to item kNN:', np.mean(precision))"
200 | ]
201 | }
202 | ],
203 | "metadata": {
204 | "kernelspec": {
205 | "display_name": "Python 3",
206 | "language": "python",
207 | "name": "python3"
208 | },
209 | "language_info": {
210 | "codemirror_mode": {
211 | "name": "ipython",
212 | "version": 3
213 | },
214 | "file_extension": ".py",
215 | "mimetype": "text/x-python",
216 | "name": "python",
217 | "nbconvert_exporter": "python",
218 | "pygments_lexer": "ipython3",
219 | "version": "3.6.1"
220 | }
221 | },
222 | "nbformat": 4,
223 | "nbformat_minor": 2
224 | }
225 |
--------------------------------------------------------------------------------
/Section 4/4_3_Applying Matrix Factorization on dataset.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "import scipy.sparse\n",
14 | "import json\n",
15 | "import os\n",
16 | "\n",
17 | "from tqdm import tqdm_notebook"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 2,
23 | "metadata": {
24 | "collapsed": true
25 | },
26 | "outputs": [],
27 | "source": [
28 | "def load_df():\n",
29 | " df = pd.DataFrame()\n",
30 | " with open('./mpd.slice.0-999.json') as data_file:\n",
31 | " data_string = data_file.read()\n",
32 | " try:\n",
33 | " data = json.loads(data_string)\n",
34 | " except ValueError:\n",
35 | " print('Failed:')\n",
36 | " print(repr(data_string))\n",
37 | " df = pd.concat([df, pd.DataFrame(data['playlists'])], ignore_index=True)\n",
38 | " return df"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 3,
44 | "metadata": {},
45 | "outputs": [
46 | {
47 | "data": {
48 | "application/vnd.jupyter.widget-view+json": {
49 | "model_id": "021cd30c4ab948929279de7ce40ccbed"
50 | }
51 | },
52 | "metadata": {},
53 | "output_type": "display_data"
54 | },
55 | {
56 | "name": "stdout",
57 | "output_type": "stream",
58 | "text": [
59 | "\n"
60 | ]
61 | }
62 | ],
63 | "source": [
64 | "df = load_df()\n",
65 | "df.drop(['description','name', 'pid', 'num_albums','num_artists', \n",
66 | " 'num_edits', 'num_followers', 'num_tracks', 'collaborative'], axis = 1, inplace = True) #dropping columns \n",
67 | " #that we are not going\n",
68 | " #to use\n",
69 | "\n",
70 | "artist_list = []\n",
71 | "vocab_artist = set()\n",
72 | "\n",
73 | "for row in tqdm_notebook(df.iterrows()): #iterating through df to get sequence of artists name \n",
74 | " #that are contained in playlist\n",
75 | " artists = [x['artist_name'] for x in row[1]['tracks']] #getting artists from playlist(json type)\n",
76 | " for x in row[1]['tracks']:\n",
77 | " vocab_artist.add(x['artist_name']) #creating set with unique artists name\n",
78 | " artist_list.append(artists) \n",
79 | "\n",
80 | "df['artist_list'] = artist_list \n",
81 | "\n",
82 | "w2x_artist = {artist:i for i, artist in enumerate(vocab_artist)} #artist name to index\n",
83 | "x2w_artist = {i:artist for i, artist in enumerate(vocab_artist)} #index to artist name\n",
84 | "\n",
85 | "df['artist_idx'] = df['artist_list'].apply(lambda x: [w2x_artist[a] for a in x]) #converting sequence of artist name \n",
86 | " #to sequence of artists idx\n",
87 | "\n",
88 | "\n",
89 | "df['train_seq_artist'] = df['artist_idx'].apply(lambda x: x[:-3]) #creating train sequence\n",
90 | "df['target_val_artist'] = df['artist_idx'].apply(lambda x: x[-3:]) #creating validation sequence"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 4,
96 | "metadata": {},
97 | "outputs": [
98 | {
99 | "data": {
100 | "application/vnd.jupyter.widget-view+json": {
101 | "model_id": "f2b51261211746bdb1a82ec4f4a5205a"
102 | }
103 | },
104 | "metadata": {},
105 | "output_type": "display_data"
106 | },
107 | {
108 | "name": "stdout",
109 | "output_type": "stream",
110 | "text": [
111 | "\n"
112 | ]
113 | }
114 | ],
115 | "source": [
116 | "inds = df['train_seq_artist']\n",
117 | "playlist_artist_train = scipy.sparse.lil_matrix((df.shape[0], len(vocab_artist)), dtype=np.int8) \n",
118 | "#creating binary playlist artist matrix for train\n",
119 | "for i, row in tqdm_notebook(enumerate(inds)):\n",
120 | " playlist_artist_train[i, row] = 1 "
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 5,
126 | "metadata": {},
127 | "outputs": [
128 | {
129 | "data": {
130 | "application/vnd.jupyter.widget-view+json": {
131 | "model_id": "d69d1ef08794412d9c6785ab878d3429"
132 | }
133 | },
134 | "metadata": {},
135 | "output_type": "display_data"
136 | },
137 | {
138 | "name": "stdout",
139 | "output_type": "stream",
140 | "text": [
141 | "\n"
142 | ]
143 | }
144 | ],
145 | "source": [
146 | "inds = df['target_val_artist']\n",
147 | "playlist_artist_val = scipy.sparse.lil_matrix((df.shape[0], len(vocab_artist)), dtype=np.int8)\n",
148 | "#creating binary playlist artist matrix for validation\n",
149 | "for i, row in tqdm_notebook(enumerate(inds)):\n",
150 | " playlist_artist_val[i, row] = 1 "
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 6,
156 | "metadata": {},
157 | "outputs": [
158 | {
159 | "name": "stdout",
160 | "output_type": "stream",
161 | "text": [
162 | "Hit rate using most popular benchmark: 0.037\n",
163 | "Precision using most popular benchmark: 0.013\n"
164 | ]
165 | }
166 | ],
167 | "source": [
168 | "precision = []\n",
169 | "hr = []\n",
170 | "sum_artists = np.asarray(np.sum(playlist_artist_train, axis = 0)).reshape((9722, ))\n",
171 | "preds = np.argsort(sum_artists)[-3:]\n",
172 | "y_true = df['target_val_artist']\n",
173 | "for y in y_true:\n",
174 | " score = len(set(preds) & set(y))\n",
175 | " precision.append(score/3)\n",
176 | " hr.append(int(score > 0))\n",
177 | "print('Hit rate using most popular benchmark:', np.mean(hr))\n",
178 | "print('Precision using most popular benchmark:', np.mean(precision))"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 7,
184 | "metadata": {
185 | "collapsed": true
186 | },
187 | "outputs": [],
188 | "source": [
189 | "def get_neg_candidates_train(i):\n",
190 | " #getting negative candidates for supervised learning algorithm for train\n",
191 | " np.random.seed(42)\n",
192 | " neg = np.where(playlist_artist_train.getrow(i).toarray()[0] == 0)[0]\n",
193 | " ind = np.random.randint(0, neg.shape[0], size = 3).tolist()\n",
194 | " return neg[ind].tolist()"
195 | ]
196 | },
197 | {
198 | "cell_type": "code",
199 | "execution_count": 8,
200 | "metadata": {},
201 | "outputs": [
202 | {
203 | "data": {
204 | "application/vnd.jupyter.widget-view+json": {
205 | "model_id": "75353c20f7f948ca9a4ddf7d95df6dfe"
206 | }
207 | },
208 | "metadata": {},
209 | "output_type": "display_data"
210 | },
211 | {
212 | "name": "stdout",
213 | "output_type": "stream",
214 | "text": [
215 | "\n"
216 | ]
217 | }
218 | ],
219 | "source": [
220 | "f = open('libfm-1.40.windows/train.txt', 'w')\n",
221 | "n_users = 1000\n",
222 | "for row in tqdm_notebook(enumerate(playlist_artist_train)): #converting train data for libfm format \n",
223 | " for j in np.nonzero(row[1].toarray())[1]: #writing down positive candidates for playlist №row\n",
224 | " f.write(str(1) + ' ')\n",
225 | " f.write(str(row[0]) + ':' + '1 ') \n",
226 | " f.write(str(n_users + j) + ':' + '1 ')\n",
227 | " f.write('\\n')\n",
228 | " neg_candidates = get_neg_candidates_train(row[0]) #writing down negative candidates for playlist №row\n",
229 | " for j in neg_candidates:\n",
230 | " f.write(str(0) + ' ')\n",
231 | " f.write(str(row[0]) + ':' + '1 ')\n",
232 | " f.write(str(j) + ':' + '1 ')\n",
233 | " f.write('\\n')\n",
234 | "f.close()"
235 | ]
236 | },
237 | {
238 | "cell_type": "code",
239 | "execution_count": 9,
240 | "metadata": {
241 | "collapsed": true
242 | },
243 | "outputs": [],
244 | "source": [
245 | "def get_neg_candidates_val(i):\n",
246 | " #getting negative candidates for supervised learning algorithm for evaluating MF algorithm\n",
247 | " np.random.seed(42)\n",
248 | " neg = np.where(playlist_artist_val.getrow(i).toarray()[0] == 0)[0]\n",
249 | " ind = np.random.randint(0, neg.shape[0], size = 3).tolist()\n",
250 | " return neg[ind].tolist()"
251 | ]
252 | },
253 | {
254 | "cell_type": "code",
255 | "execution_count": 10,
256 | "metadata": {},
257 | "outputs": [
258 | {
259 | "data": {
260 | "application/vnd.jupyter.widget-view+json": {
261 | "model_id": "b3e185ab54e54bb291e57cad8f925289"
262 | }
263 | },
264 | "metadata": {},
265 | "output_type": "display_data"
266 | },
267 | {
268 | "name": "stdout",
269 | "output_type": "stream",
270 | "text": [
271 | "\n"
272 | ]
273 | }
274 | ],
275 | "source": [
276 | "f = open('libfm-1.40.windows/val.txt', 'w')\n",
277 | "n_users = 1000\n",
278 | "answer_dict = {i:[] for i in range(playlist_artist_val.shape[0])}\n",
279 | "for row in tqdm_notebook(enumerate(playlist_artist_val)): #converting train data for libfm format \n",
280 | " positive_candidates = np.nonzero(row[1].toarray())[1] #writing down positive candidates for playlist №row\n",
281 | " for j in positive_candidates:\n",
282 | " f.write(str(1) + ' ')\n",
283 | " f.write(str(row[0]) + ':' + '1 ')\n",
284 | " f.write(str(n_users + j) + ':' + '1 ')\n",
285 | " f.write('\\n')\n",
286 | " neg_candidates = get_neg_candidates_val(row[0])\n",
287 | " answer_dict[row[0]] += positive_candidates.tolist() + neg_candidates #using dict playlist : pos + neg candidates\n",
288 | " for j in neg_candidates: #writing down negative candidates for playlist №row\n",
289 | " f.write(str(0) + ' ')\n",
290 | " f.write(str(row[0]) + ':' + '1 ')\n",
291 | " f.write(str(n_users + j) + ':' + '1 ')\n",
292 | " f.write('\\n')\n",
293 | "f.close()"
294 | ]
295 | },
296 | {
297 | "cell_type": "code",
298 | "execution_count": 11,
299 | "metadata": {
300 | "collapsed": true
301 | },
302 | "outputs": [],
303 | "source": [
304 | "import subprocess"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": 12,
310 | "metadata": {},
311 | "outputs": [
312 | {
313 | "name": "stdout",
314 | "output_type": "stream",
315 | "text": [
316 | "Finished training\n"
317 | ]
318 | }
319 | ],
320 | "source": [
321 | "cmd = ' '.join(['libfm-1.40.windows/libFM', '-task', 'r', '-train', 'libfm-1.40.windows/train.txt', \n",
322 | " '-test', '../libfm-1.42.src/bin/val.txt', '-iter', '20', '-method', 'sgd',\n",
323 | " '-regular', '’3,3,15’', '-dim', '’1,1,4’', '-init_stdev',\n",
324 | " '0.1', '-out', 'output.txt', '-learn_rate', '0.001']) #hyperparameters (for mor info see \n",
325 | " #manual attached to the course)\n",
326 | "proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) #starting subprocess \n",
327 | " #in console\n",
328 | "for line in iter(proc.stdout.readline, ''): #evaluating libfm\n",
329 | " if line == b'':\n",
330 | " print('Finished training')\n",
331 | " break"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": 13,
337 | "metadata": {
338 | "collapsed": true
339 | },
340 | "outputs": [],
341 | "source": [
342 | "with open('./output.txt', 'r') as f:\n",
343 | " val_answers = [float(x.strip()) for x in f.readlines()] #opening file with answers"
344 | ]
345 | },
346 | {
347 | "cell_type": "code",
348 | "execution_count": 14,
349 | "metadata": {},
350 | "outputs": [
351 | {
352 | "data": {
353 | "application/vnd.jupyter.widget-view+json": {
354 | "model_id": "d30cbbacfbe24855a0291de0b6a3b0ca"
355 | }
356 | },
357 | "metadata": {},
358 | "output_type": "display_data"
359 | },
360 | {
361 | "name": "stdout",
362 | "output_type": "stream",
363 | "text": [
364 | "\n",
365 | "MF HR@3 score: 0.939\n",
366 | "MF precision@3 score: 0.569\n"
367 | ]
368 | }
369 | ],
370 | "source": [
371 | "num_read = 0\n",
372 | "precision = []\n",
373 | "hr = []\n",
374 | "for i in tqdm_notebook(range(playlist_artist_val.shape[0])): #calculating metric\n",
375 | " all_answers = np.asarray(answer_dict[i])\n",
376 | " y_true = all_answers[:-3] #true answers (first 3 elements in the array)\n",
377 | " mf_answers = val_answers[num_read:num_read + len(all_answers)] #answers from algorithm\n",
378 | " num_read += len(all_answers) #num of rows that were read from the val_answer\n",
379 | " y_pred_ind = np.argsort(mf_answers)[-3:] #top3 by probability\n",
380 | " y_pred = all_answers[y_pred_ind] #getting idx of these artists\n",
381 | " score = len(set(y_pred) & set(y_true)) #num of guessed artists\n",
382 | " precision.append(score/3)\n",
383 | " hr.append(int(score > 0))\n",
384 | "print('MF HR@3 score:', np.mean(hr))\n",
385 | "print('MF precision@3 score:', np.mean(precision))"
386 | ]
387 | },
388 | {
389 | "cell_type": "code",
390 | "execution_count": 15,
391 | "metadata": {},
392 | "outputs": [
393 | {
394 | "data": {
395 | "text/plain": [
396 | "[1325, 3410, 9260, 7272, 860, 5392]"
397 | ]
398 | },
399 | "execution_count": 15,
400 | "metadata": {},
401 | "output_type": "execute_result"
402 | }
403 | ],
404 | "source": [
405 | "answer_dict[0]"
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "execution_count": null,
411 | "metadata": {
412 | "collapsed": true
413 | },
414 | "outputs": [],
415 | "source": []
416 | }
417 | ],
418 | "metadata": {
419 | "kernelspec": {
420 | "display_name": "Python 3",
421 | "language": "python",
422 | "name": "python3"
423 | },
424 | "language_info": {
425 | "codemirror_mode": {
426 | "name": "ipython",
427 | "version": 3
428 | },
429 | "file_extension": ".py",
430 | "mimetype": "text/x-python",
431 | "name": "python",
432 | "nbconvert_exporter": "python",
433 | "pygments_lexer": "ipython3",
434 | "version": "3.6.1"
435 | }
436 | },
437 | "nbformat": 4,
438 | "nbformat_minor": 2
439 | }
440 |
--------------------------------------------------------------------------------
/Section 4/4_4_Wordbach at use.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
12 | "from sklearn.model_selection import train_test_split\n",
13 | "from scipy.sparse import csr_matrix, hstack\n",
14 | "from wordbatch.models import FTRL, FM_FTRL\n",
15 | "from nltk.corpus import stopwords\n",
16 | "\n",
17 | "import re\n",
18 | "import wordbatch\n",
19 | "import pandas as pd\n",
20 | "import numpy as np\n"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 2,
26 | "metadata": {
27 | "collapsed": true
28 | },
29 | "outputs": [],
30 | "source": [
31 | "def rmsle(y, y0): #defining metric\n",
32 | " assert len(y) == len(y0)\n",
33 | " return np.sqrt(np.mean(np.power(np.log1p(y) - np.log1p(y0), 2))) "
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 3,
39 | "metadata": {
40 | "collapsed": true
41 | },
42 | "outputs": [],
43 | "source": [
44 | "stopwords = {x: 1 for x in stopwords.words('english')}\n",
45 | "non_alphanums = re.compile(u'[^A-Za-z0-9]+') #using only numbers + english alphabet\n",
46 | "\n",
47 | "\n",
48 | "def normalize_text(text):\n",
49 | " return u\" \".join(\n",
50 | " [x for x in [y for y in non_alphanums.sub(' ', text).lower().strip().split(\" \")] \\\n",
51 | " if len(x) > 1 and x not in stopwords]) #removing stop words and using only numbers + english alphabet"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 4,
57 | "metadata": {
58 | "collapsed": true
59 | },
60 | "outputs": [],
61 | "source": [
62 | "def handle_missing_inplace(df): #filling all nans\n",
63 | " df['category_name'].fillna(value='missing/missing/missing', inplace=True)\n",
64 | " df['brand_name'].fillna(value='missing', inplace=True)\n",
65 | " df['item_description'].fillna(value='missing', inplace=True)\n",
66 | " return df"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": 5,
72 | "metadata": {},
73 | "outputs": [
74 | {
75 | "data": {
76 | "text/html": [
77 | "\n",
78 | "\n",
91 | "
\n",
92 | " \n",
93 | " \n",
94 | " | \n",
95 | " train_id | \n",
96 | " name | \n",
97 | " item_condition_id | \n",
98 | " category_name | \n",
99 | " brand_name | \n",
100 | " price | \n",
101 | " shipping | \n",
102 | " item_description | \n",
103 | "
\n",
104 | " \n",
105 | " \n",
106 | " \n",
107 | " 0 | \n",
108 | " 0 | \n",
109 | " MLB Cincinnati Reds T Shirt Size XL | \n",
110 | " 3 | \n",
111 | " Men/Tops/T-shirts | \n",
112 | " NaN | \n",
113 | " 10.0 | \n",
114 | " 1 | \n",
115 | " No description yet | \n",
116 | "
\n",
117 | " \n",
118 | " 1 | \n",
119 | " 1 | \n",
120 | " Razer BlackWidow Chroma Keyboard | \n",
121 | " 3 | \n",
122 | " Electronics/Computers & Tablets/Components & P... | \n",
123 | " Razer | \n",
124 | " 52.0 | \n",
125 | " 0 | \n",
126 | " This keyboard is in great condition and works ... | \n",
127 | "
\n",
128 | " \n",
129 | " 2 | \n",
130 | " 2 | \n",
131 | " AVA-VIV Blouse | \n",
132 | " 1 | \n",
133 | " Women/Tops & Blouses/Blouse | \n",
134 | " Target | \n",
135 | " 10.0 | \n",
136 | " 1 | \n",
137 | " Adorable top with a hint of lace and a key hol... | \n",
138 | "
\n",
139 | " \n",
140 | " 3 | \n",
141 | " 3 | \n",
142 | " Leather Horse Statues | \n",
143 | " 1 | \n",
144 | " Home/Home Décor/Home Décor Accents | \n",
145 | " NaN | \n",
146 | " 35.0 | \n",
147 | " 1 | \n",
148 | " New with tags. Leather horses. Retail for [rm]... | \n",
149 | "
\n",
150 | " \n",
151 | " 4 | \n",
152 | " 4 | \n",
153 | " 24K GOLD plated rose | \n",
154 | " 1 | \n",
155 | " Women/Jewelry/Necklaces | \n",
156 | " NaN | \n",
157 | " 44.0 | \n",
158 | " 0 | \n",
159 | " Complete with certificate of authenticity | \n",
160 | "
\n",
161 | " \n",
162 | "
\n",
163 | "
"
164 | ],
165 | "text/plain": [
166 | " train_id name item_condition_id \\\n",
167 | "0 0 MLB Cincinnati Reds T Shirt Size XL 3 \n",
168 | "1 1 Razer BlackWidow Chroma Keyboard 3 \n",
169 | "2 2 AVA-VIV Blouse 1 \n",
170 | "3 3 Leather Horse Statues 1 \n",
171 | "4 4 24K GOLD plated rose 1 \n",
172 | "\n",
173 | " category_name brand_name price \\\n",
174 | "0 Men/Tops/T-shirts NaN 10.0 \n",
175 | "1 Electronics/Computers & Tablets/Components & P... Razer 52.0 \n",
176 | "2 Women/Tops & Blouses/Blouse Target 10.0 \n",
177 | "3 Home/Home Décor/Home Décor Accents NaN 35.0 \n",
178 | "4 Women/Jewelry/Necklaces NaN 44.0 \n",
179 | "\n",
180 | " shipping item_description \n",
181 | "0 1 No description yet \n",
182 | "1 0 This keyboard is in great condition and works ... \n",
183 | "2 1 Adorable top with a hint of lace and a key hol... \n",
184 | "3 1 New with tags. Leather horses. Retail for [rm]... \n",
185 | "4 0 Complete with certificate of authenticity "
186 | ]
187 | },
188 | "execution_count": 5,
189 | "metadata": {},
190 | "output_type": "execute_result"
191 | }
192 | ],
193 | "source": [
194 | "train = pd.read_csv('./train.tsv', sep = '\\t') #loading train\n",
195 | "train.head()"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 13,
201 | "metadata": {
202 | "collapsed": true
203 | },
204 | "outputs": [],
205 | "source": [
206 | "sample = train.sample(frac = 0.05, random_state = 42)#using 5% sample\n",
207 | "sample = handle_missing_inplace(sample) #filling all nans\n",
208 | "y = sample.pop('price')\n",
209 | "\n",
210 | "#splitting categories into 3 sub categories\n",
211 | "sample['cat1'] = sample['category_name'].apply(lambda x: x.split('/')[0])\n",
212 | "sample['cat2'] = sample['category_name'].apply(lambda x: x.split('/')[1])\n",
213 | "sample['cat3'] = sample['category_name'].apply(lambda x: x.split('/')[2])"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": 16,
219 | "metadata": {},
220 | "outputs": [
221 | {
222 | "data": {
223 | "text/html": [
224 | "\n",
225 | "\n",
238 | "
\n",
239 | " \n",
240 | " \n",
241 | " | \n",
242 | " train_id | \n",
243 | " name | \n",
244 | " item_condition_id | \n",
245 | " category_name | \n",
246 | " brand_name | \n",
247 | " shipping | \n",
248 | " item_description | \n",
249 | " cat1 | \n",
250 | " cat2 | \n",
251 | " cat3 | \n",
252 | "
\n",
253 | " \n",
254 | " \n",
255 | " \n",
256 | " 525834 | \n",
257 | " 525834 | \n",
258 | " Under armor sweatpants | \n",
259 | " 4 | \n",
260 | " Women/Athletic Apparel/Pants, Tights, Leggings | \n",
261 | " Under Armour | \n",
262 | " 1 | \n",
263 | " Used condition size small black in color two p... | \n",
264 | " Women | \n",
265 | " Athletic Apparel | \n",
266 | " Pants, Tights, Leggings | \n",
267 | "
\n",
268 | " \n",
269 | " 149839 | \n",
270 | " 149839 | \n",
271 | " Men's watch | \n",
272 | " 3 | \n",
273 | " Men/Men's Accessories/Watches | \n",
274 | " Tommy Bahama | \n",
275 | " 0 | \n",
276 | " Tommy Bahama watch in good condition with new ... | \n",
277 | " Men | \n",
278 | " Men's Accessories | \n",
279 | " Watches | \n",
280 | "
\n",
281 | " \n",
282 | " 536234 | \n",
283 | " 536234 | \n",
284 | " Eileen Fisher gray Cardigan | \n",
285 | " 3 | \n",
286 | " Women/Sweaters/Cardigan | \n",
287 | " Eileen Fisher | \n",
288 | " 0 | \n",
289 | " Large but fits medium or small | \n",
290 | " Women | \n",
291 | " Sweaters | \n",
292 | " Cardigan | \n",
293 | "
\n",
294 | " \n",
295 | " 427908 | \n",
296 | " 427908 | \n",
297 | " Blue Patagonia | \n",
298 | " 2 | \n",
299 | " Men/Sweats & Hoodies/Sweatshirt, Pullover | \n",
300 | " Patagonia, Inc. | \n",
301 | " 0 | \n",
302 | " No description yet | \n",
303 | " Men | \n",
304 | " Sweats & Hoodies | \n",
305 | " Sweatshirt, Pullover | \n",
306 | "
\n",
307 | " \n",
308 | " 193641 | \n",
309 | " 193641 | \n",
310 | " ✨4 YMED NIKE PRO for Lindsay✨ | \n",
311 | " 1 | \n",
312 | " Kids/Girls (4+)/Other | \n",
313 | " Nike | \n",
314 | " 0 | \n",
315 | " 4 YMED NIKE PRO compression shorts All NWT | \n",
316 | " Kids | \n",
317 | " Girls (4+) | \n",
318 | " Other | \n",
319 | "
\n",
320 | " \n",
321 | "
\n",
322 | "
"
323 | ],
324 | "text/plain": [
325 | " train_id name item_condition_id \\\n",
326 | "525834 525834 Under armor sweatpants 4 \n",
327 | "149839 149839 Men's watch 3 \n",
328 | "536234 536234 Eileen Fisher gray Cardigan 3 \n",
329 | "427908 427908 Blue Patagonia 2 \n",
330 | "193641 193641 ✨4 YMED NIKE PRO for Lindsay✨ 1 \n",
331 | "\n",
332 | " category_name brand_name \\\n",
333 | "525834 Women/Athletic Apparel/Pants, Tights, Leggings Under Armour \n",
334 | "149839 Men/Men's Accessories/Watches Tommy Bahama \n",
335 | "536234 Women/Sweaters/Cardigan Eileen Fisher \n",
336 | "427908 Men/Sweats & Hoodies/Sweatshirt, Pullover Patagonia, Inc. \n",
337 | "193641 Kids/Girls (4+)/Other Nike \n",
338 | "\n",
339 | " shipping item_description cat1 \\\n",
340 | "525834 1 Used condition size small black in color two p... Women \n",
341 | "149839 0 Tommy Bahama watch in good condition with new ... Men \n",
342 | "536234 0 Large but fits medium or small Women \n",
343 | "427908 0 No description yet Men \n",
344 | "193641 0 4 YMED NIKE PRO compression shorts All NWT Kids \n",
345 | "\n",
346 | " cat2 cat3 \n",
347 | "525834 Athletic Apparel Pants, Tights, Leggings \n",
348 | "149839 Men's Accessories Watches \n",
349 | "536234 Sweaters Cardigan \n",
350 | "427908 Sweats & Hoodies Sweatshirt, Pullover \n",
351 | "193641 Girls (4+) Other "
352 | ]
353 | },
354 | "execution_count": 16,
355 | "metadata": {},
356 | "output_type": "execute_result"
357 | }
358 | ],
359 | "source": [
360 | "sample.head()"
361 | ]
362 | },
363 | {
364 | "cell_type": "code",
365 | "execution_count": 8,
366 | "metadata": {
367 | "collapsed": true
368 | },
369 | "outputs": [],
370 | "source": [
371 | "tf = TfidfVectorizer(max_features=10000,\n",
372 | " max_df = 0.95, min_df = 100) #using tf-idf preprocessing to convert text in numerical matrix"
373 | ]
374 | },
375 | {
376 | "cell_type": "code",
377 | "execution_count": 9,
378 | "metadata": {},
379 | "outputs": [
380 | {
381 | "name": "stdout",
382 | "output_type": "stream",
383 | "text": [
384 | "Working with name\n"
385 | ]
386 | },
387 | {
388 | "name": "stderr",
389 | "output_type": "stream",
390 | "text": [
391 | "C:\\Users\\Piboditheowl\\Anaconda3\\lib\\site-packages\\sklearn\\feature_extraction\\text.py:1059: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
392 | " if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):\n"
393 | ]
394 | },
395 | {
396 | "name": "stdout",
397 | "output_type": "stream",
398 | "text": [
399 | "-------\n",
400 | "Working with item_description\n",
401 | "-------\n",
402 | "Working with cat1\n",
403 | "-------\n",
404 | "Working with cat2\n",
405 | "-------\n",
406 | "Working with cat3\n",
407 | "-------\n"
408 | ]
409 | }
410 | ],
411 | "source": [
412 | "#Evaluating tf-idf (transformig text into matrix)\n",
413 | "\n",
414 | "print('Working with name')\n",
415 | "x_name = tf.fit_transform(sample['name'].values)\n",
416 | "print(7*'-')\n",
417 | "print('Working with item_description')\n",
418 | "x_description = tf.fit_transform(sample['item_description'].values)\n",
419 | "print(7*'-')\n",
420 | "print('Working with cat1')\n",
421 | "x_cat1 = tf.fit_transform(sample['cat1'].values)\n",
422 | "print(7*'-')\n",
423 | "print('Working with cat2')\n",
424 | "x_cat2 = tf.fit_transform(sample['cat2'].values)\n",
425 | "print(7*'-')\n",
426 | "print('Working with cat3')\n",
427 | "x_cat3 = tf.fit_transform(sample['cat3'].values)\n",
428 | "print(7*'-')"
429 | ]
430 | },
431 | {
432 | "cell_type": "code",
433 | "execution_count": 10,
434 | "metadata": {
435 | "collapsed": true
436 | },
437 | "outputs": [],
438 | "source": [
439 | "sample_preprocessed = hstack((x_name, x_description, x_cat1, x_cat2, x_cat3)).tocsr() #concatenating together and \n",
440 | " #using scipy sparse for low-memory\n",
441 | " #allocation of matrix \n",
442 | "mask = np.array(np.clip(sample_preprocessed.getnnz(axis=0) - 1, 0, 1), dtype=bool)\n",
443 | "sample_preprocessed = sample_preprocessed[:, mask]\n",
444 | "\n",
445 | "x_train, x_val, y_train, y_val = train_test_split(sample_preprocessed, y, test_size = 0.15) #splitting into test and train"
446 | ]
447 | },
448 | {
449 | "cell_type": "code",
450 | "execution_count": 11,
451 | "metadata": {
452 | "collapsed": true
453 | },
454 | "outputs": [],
455 | "source": [
456 | "model = FM_FTRL(alpha=0.01, beta=0.01, L1=0.00001, L2=0.1, D = sample_preprocessed.shape[1], alpha_fm=0.01, L2_fm=0.0, init_fm=0.01,\n",
457 | " D_fm=200, e_noise=0.0001, iters=15, inv_link=\"identity\", threads=16) #defining model"
458 | ]
459 | },
460 | {
461 | "cell_type": "code",
462 | "execution_count": 12,
463 | "metadata": {},
464 | "outputs": [
465 | {
466 | "name": "stdout",
467 | "output_type": "stream",
468 | "text": [
469 | "RMSLE score using FM_FTRL: 0.7428922496558461\n"
470 | ]
471 | }
472 | ],
473 | "source": [
474 | "model.fit(x_train, y_train) #training algorithm \n",
475 | "y_pred = model.predict(x_val)#evaluating algorithm \n",
476 | "print('RMSLE score using FM_FTRL:', rmsle(y_val, y_pred))"
477 | ]
478 | }
479 | ],
480 | "metadata": {
481 | "kernelspec": {
482 | "display_name": "Python 3",
483 | "language": "python",
484 | "name": "python3"
485 | },
486 | "language_info": {
487 | "codemirror_mode": {
488 | "name": "ipython",
489 | "version": 3
490 | },
491 | "file_extension": ".py",
492 | "mimetype": "text/x-python",
493 | "name": "python",
494 | "nbconvert_exporter": "python",
495 | "pygments_lexer": "ipython3",
496 | "version": "3.6.1"
497 | }
498 | },
499 | "nbformat": 4,
500 | "nbformat_minor": 2
501 | }
502 |
--------------------------------------------------------------------------------
/Section 5/5_1_Validation dataset tuning.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "from itertools import combinations\n",
14 | "from catboost import CatBoostClassifier\n",
15 | "from sklearn.model_selection import train_test_split, KFold\n",
16 | "from sklearn.metrics import roc_auc_score\n",
17 | "import warnings\n",
18 | "warnings.filterwarnings(\"ignore\")\n",
19 | "np.random.seed(42)"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "df = pd.read_csv('train.csv')\n",
31 | "y = df.target\n",
32 | "\n",
33 | "df.drop(['ID', 'target'], axis=1, inplace=True)\n",
34 | "df.fillna(-9999, inplace=True)\n",
35 | "cat_features_ids = np.where(df.apply(pd.Series.nunique) < 30000)[0].tolist()"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 3,
41 | "metadata": {
42 | "collapsed": true
43 | },
44 | "outputs": [],
45 | "source": [
46 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.1,random_state = 42)\n",
47 | "train, val, y_train, y_val = train_test_split(train, y_train, test_size = 0.25)"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 4,
53 | "metadata": {
54 | "scrolled": false
55 | },
56 | "outputs": [
57 | {
58 | "name": "stdout",
59 | "output_type": "stream",
60 | "text": [
61 | "Roc-auc score with Catboost: 0.7841281938499387\n"
62 | ]
63 | }
64 | ],
65 | "source": [
66 | "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, eval_metric='AUC', logging_level='Silent')\n",
67 | "clf.fit(train, y_train, cat_features=cat_features_ids, eval_set=(val, y_val))\n",
68 | "prediction = clf.predict_proba(test)\n",
69 | "print('Roc-auc score with Catboost:',roc_auc_score(y_test, prediction[:, 1]))"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": 5,
75 | "metadata": {},
76 | "outputs": [
77 | {
78 | "name": "stdout",
79 | "output_type": "stream",
80 | "text": [
81 | "Roc-auc score with Catboost: 0.7930162585925847\n"
82 | ]
83 | }
84 | ],
85 | "source": [
86 | "kfold = KFold(n_splits=10)\n",
87 | "pred = []\n",
88 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.1,random_state = 42)\n",
89 | "for train_ind, test_ind in kfold.split(train):\n",
90 | " train_val, test_val, y_train_val, y_test_val = train.iloc[train_ind, :], train.iloc[test_ind, :],\\\n",
91 | " y_train.iloc[train_ind], y_train.iloc[test_ind]\n",
92 | " clf.fit(train_val, y_train_val, cat_features=cat_features_ids, eval_set=(test_val, y_test_val))\n",
93 | " prediction = clf.predict_proba(test)\n",
94 | " pred.append(\n",
95 | " prediction[:, 1]\n",
96 | " )\n",
97 | " \n",
98 | "\n",
99 | "print('Roc-auc score with Catboost:',roc_auc_score(y_test, np.mean(pred, axis = 0)))"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": null,
105 | "metadata": {
106 | "collapsed": true
107 | },
108 | "outputs": [],
109 | "source": []
110 | }
111 | ],
112 | "metadata": {
113 | "kernelspec": {
114 | "display_name": "Python 3",
115 | "language": "python",
116 | "name": "python3"
117 | },
118 | "language_info": {
119 | "codemirror_mode": {
120 | "name": "ipython",
121 | "version": 3
122 | },
123 | "file_extension": ".py",
124 | "mimetype": "text/x-python",
125 | "name": "python",
126 | "nbconvert_exporter": "python",
127 | "pygments_lexer": "ipython3",
128 | "version": "3.6.1"
129 | }
130 | },
131 | "nbformat": 4,
132 | "nbformat_minor": 2
133 | }
134 |
--------------------------------------------------------------------------------
/Section 5/5_2_Regularizing model to avoid overfitting.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "from itertools import combinations\n",
14 | "from catboost import CatBoostClassifier\n",
15 | "from sklearn.model_selection import train_test_split, KFold\n",
16 | "from sklearn.metrics import roc_auc_score\n",
17 | "import warnings\n",
18 | "warnings.filterwarnings(\"ignore\")\n",
19 | "np.random.seed(42)"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "df = pd.read_csv('train.csv')\n",
31 | "y = df.target\n",
32 | "\n",
33 | "df.drop(['ID', 'target'], axis=1, inplace=True)\n",
34 | "df.fillna(-9999, inplace=True)\n",
35 | "cat_features_ids = np.where(df.apply(pd.Series.nunique) < 30000)[0].tolist()"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 3,
41 | "metadata": {
42 | "collapsed": true
43 | },
44 | "outputs": [],
45 | "source": [
46 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.1)"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 4,
52 | "metadata": {},
53 | "outputs": [
54 | {
55 | "name": "stdout",
56 | "output_type": "stream",
57 | "text": [
58 | "Roc-auc score with Catboost without regularization: 0.7939610054617733\n",
59 | "Roc-auc score with Catboost with regularization: 0.7961023589633582\n"
60 | ]
61 | }
62 | ],
63 | "source": [
64 | "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, eval_metric='AUC', logging_level='Silent')\n",
65 | "clf.fit(train, y_train, cat_features=cat_features_ids)\n",
66 | "prediction = clf.predict_proba(test)\n",
67 | "print('Roc-auc score with Catboost without regularization:',roc_auc_score(y_test, prediction[:, 1]))\n",
68 | "\n",
69 | "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, \n",
70 | " eval_metric='AUC', logging_level='Silent', l2_leaf_reg=3, \n",
71 | " model_size_reg = 3)\n",
72 | "clf.fit(train, y_train, cat_features=cat_features_ids)\n",
73 | "prediction = clf.predict_proba(test)\n",
74 | "print('Roc-auc score with Catboost with regularization:',roc_auc_score(y_test, prediction[:, 1]))"
75 | ]
76 | }
77 | ],
78 | "metadata": {
79 | "kernelspec": {
80 | "display_name": "Python 3",
81 | "language": "python",
82 | "name": "python3"
83 | },
84 | "language_info": {
85 | "codemirror_mode": {
86 | "name": "ipython",
87 | "version": 3
88 | },
89 | "file_extension": ".py",
90 | "mimetype": "text/x-python",
91 | "name": "python",
92 | "nbconvert_exporter": "python",
93 | "pygments_lexer": "ipython3",
94 | "version": "3.6.1"
95 | }
96 | },
97 | "nbformat": 4,
98 | "nbformat_minor": 2
99 | }
100 |
--------------------------------------------------------------------------------
/Section 5/5_3_Adversarial Validation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "from itertools import combinations\n",
14 | "from catboost import CatBoostClassifier\n",
15 | "from sklearn.model_selection import train_test_split, KFold\n",
16 | "from sklearn.metrics import roc_auc_score\n",
17 | "import warnings\n",
18 | "warnings.filterwarnings(\"ignore\")\n",
19 | "np.random.seed(42)"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "train = pd.read_csv('train.csv')\n",
31 | "y = train.target\n",
32 | "test = pd.read_csv('./test.csv')\n",
33 | "train.drop(['ID', 'target'], axis=1, inplace=True)\n",
34 | "test.drop(['ID'], axis=1, inplace=True)\n",
35 | "train.fillna(-9999, inplace=True)\n",
36 | "test.fillna(-9999, inplace=True)\n",
37 | "cat_features_ids = np.where(train.apply(pd.Series.nunique) < 30000)[0].tolist()"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 3,
43 | "metadata": {},
44 | "outputs": [
45 | {
46 | "name": "stdout",
47 | "output_type": "stream",
48 | "text": [
49 | "Number of train samples from test distribution: 49142\n"
50 | ]
51 | }
52 | ],
53 | "source": [
54 | "y1 = np.ones_like(y)\n",
55 | "y2 = np.zeros((test.shape[0],))\n",
56 | "y_all = np.hstack([y1, y2])\n",
57 | "all_ = pd.concat([train, test])\n",
58 | "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, eval_metric='AUC', logging_level='Silent')\n",
59 | "clf.fit(all_, y_all, cat_features=cat_features_ids)\n",
60 | "prediction = clf.predict(train)\n",
61 | "best_val = train[prediction == 0]\n",
62 | "print('Number of train samples from test distribution:', best_val.shape[0])"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 4,
68 | "metadata": {},
69 | "outputs": [
70 | {
71 | "name": "stdout",
72 | "output_type": "stream",
73 | "text": [
74 | "Validation score: 0.7470119528903851\n"
75 | ]
76 | }
77 | ],
78 | "source": [
79 | "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, eval_metric='AUC', logging_level='Silent')\n",
80 | "clf.fit(train.loc[prediction != 0, :], y[prediction != 0], cat_features=cat_features_ids)\n",
81 | "prediction_val = clf.predict_proba(best_val)\n",
82 | "print('Validation score:', roc_auc_score(y[prediction == 0], prediction_val[:, 1]))"
83 | ]
84 | }
85 | ],
86 | "metadata": {
87 | "kernelspec": {
88 | "display_name": "Python 3",
89 | "language": "python",
90 | "name": "python3"
91 | },
92 | "language_info": {
93 | "codemirror_mode": {
94 | "name": "ipython",
95 | "version": 3
96 | },
97 | "file_extension": ".py",
98 | "mimetype": "text/x-python",
99 | "name": "python",
100 | "nbconvert_exporter": "python",
101 | "pygments_lexer": "ipython3",
102 | "version": "3.6.1"
103 | }
104 | },
105 | "nbformat": 4,
106 | "nbformat_minor": 2
107 | }
108 |
--------------------------------------------------------------------------------
/Section 5/5_4_Perform metric selection on real data.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 18,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import pandas as pd\n",
12 | "import numpy as np\n",
13 | "from itertools import combinations\n",
14 | "from catboost import CatBoostClassifier, CatBoostRegressor\n",
15 | "from sklearn.model_selection import train_test_split, KFold\n",
16 | "from sklearn.metrics import mean_squared_error, accuracy_score, recall_score, precision_score, f1_score, roc_auc_score\n",
17 | "import warnings\n",
18 | "warnings.filterwarnings(\"ignore\")\n",
19 | "np.random.seed(42)"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 19,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "df = pd.read_csv('./train_merc.csv')\n",
31 | "y = df.y\n",
32 | "df.drop(['ID', 'y'], axis = 1, inplace=True)\n",
33 | "cat_features_ids = np.where(df.apply(pd.Series.nunique) < 30000)[0].tolist()"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 20,
39 | "metadata": {},
40 | "outputs": [
41 | {
42 | "name": "stdout",
43 | "output_type": "stream",
44 | "text": [
45 | "28.460498941515414\n",
46 | "27.65863337187866\n"
47 | ]
48 | }
49 | ],
50 | "source": [
51 | "pred = [10,10,10,10,10,10,10,10,10,10]\n",
52 | "y_real = [10,10,10,10,10,10,10,10,10,100]\n",
53 | "print(np.sqrt(mean_squared_error(pred, y_real)))\n",
54 | "\n",
55 | "pred = [25,25,25,25,25,25,25,25,25,25]\n",
56 | "y_real = [10,10,10,10,10,10,10,10,10,100]\n",
57 | "print(np.sqrt(mean_squared_error(pred, y_real)))"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 21,
63 | "metadata": {},
64 | "outputs": [
65 | {
66 | "name": "stdout",
67 | "output_type": "stream",
68 | "text": [
69 | "RMSE score: 7.373856282754891\n",
70 | "RMSLE score: 0.06776094386349317\n"
71 | ]
72 | }
73 | ],
74 | "source": [
75 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.1)\n",
76 | "clf = CatBoostRegressor(learning_rate=0.1, iterations=100, random_seed=42, logging_level='Silent')\n",
77 | "clf.fit(train, y_train, cat_features=cat_features_ids)\n",
78 | "prediction = clf.predict(test)\n",
79 | "print('RMSE score:', np.sqrt(mean_squared_error(y_test, prediction)))\n",
80 | "print('RMSLE score:', np.sqrt(mean_squared_error(np.log1p(y_test), np.log1p(prediction))))"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 23,
86 | "metadata": {
87 | "collapsed": true
88 | },
89 | "outputs": [],
90 | "source": [
91 | "df = pd.read_csv('./train_sample.csv.zip')\n",
92 | "y = df.is_attributed\n",
93 | "df.drop(['click_time', 'attributed_time', 'is_attributed'], axis = 1, inplace=True)\n",
94 | "cat_features_ids = np.where(df.apply(pd.Series.nunique) < 30000)[0].tolist()"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 24,
100 | "metadata": {},
101 | "outputs": [
102 | {
103 | "name": "stdout",
104 | "output_type": "stream",
105 | "text": [
106 | "\t\t $a(x)$ = 1 \t\t\n",
107 | "\n",
108 | "Accuracy all positive: 0.00227\n",
109 | "Recall all positive: 1.0\n",
110 | "Precision all positive: 0.00227\n",
111 | "F1 score all positive: 0.004529717541181518\n",
112 | "Roc auc score all positive: 0.5\n",
113 | "\n",
114 | "\n",
115 | "\n",
116 | "\t\t $a(x)$ = 0 \t\t\n",
117 | "\n",
118 | "Accuracy all negative: 0.99773\n",
119 | "Recall all negative: 0.0\n",
120 | "Precision all negative: 0.0\n",
121 | "F1 score all negative: 0.0\n",
122 | "Roc auc score all positive: 0.5\n",
123 | "\n",
124 | "\n",
125 | "\n",
126 | "\t\t Catboost \t\t\n",
127 | "\n",
128 | "Accuracy using Catboost: 0.9986\n",
129 | "Recall using Catboost: 0.391304347826087\n",
130 | "Precision using Catboost: 1.0\n",
131 | "F1 score using Catboost: 0.5625\n",
132 | "Roc auc score using Catboost: 0.9189287535244104\n"
133 | ]
134 | }
135 | ],
136 | "source": [
137 | "y_positive = np.ones_like(y)\n",
138 | "y_negative = np.zeros_like(y)\n",
139 | "print('\\t\\t $a(x)$ = 1 \\t\\t\\n')\n",
140 | "print('Accuracy all positive:', accuracy_score(y, y_positive))\n",
141 | "print('Recall all positive:', recall_score(y, y_positive))\n",
142 | "print('Precision all positive:', precision_score(y, y_positive))\n",
143 | "print('F1 score all positive:', f1_score(y, y_positive))\n",
144 | "print('Roc auc score all positive:', roc_auc_score(y, y_positive))\n",
145 | "print('\\n\\n')\n",
146 | "print('\\t\\t $a(x)$ = 0 \\t\\t\\n')\n",
147 | "print('Accuracy all negative:', accuracy_score(y, y_negative))\n",
148 | "print('Recall all negative:', recall_score(y, y_negative))\n",
149 | "print('Precision all negative:', precision_score(y, y_negative))\n",
150 | "print('F1 score all negative:', f1_score(y, y_negative))\n",
151 | "print('Roc auc score all positive:', roc_auc_score(y, y_negative))\n",
152 | "\n",
153 | "print('\\n\\n')\n",
154 | "print('\\t\\t Catboost \\t\\t\\n')\n",
155 | "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.1)\n",
156 | "\n",
157 | "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, \n",
158 | " eval_metric='AUC', logging_level='Silent', l2_leaf_reg=3, \n",
159 | " model_size_reg = 3)\n",
160 | "clf.fit(train, y_train, cat_features=cat_features_ids)\n",
161 | "prediction = clf.predict_proba(test)\n",
162 | "\n",
163 | "print('Accuracy using Catboost:', accuracy_score(y_test, prediction[:, 1] > 0.5))\n",
164 | "print('Recall using Catboost:', recall_score(y_test, prediction[:, 1] > 0.5))\n",
165 | "print('Precision using Catboost:', precision_score(y_test, prediction[:, 1] > 0.5))\n",
166 | "print('F1 score using Catboost:', f1_score(y_test, prediction[:, 1] > 0.5))\n",
167 | "print('Roc auc score using Catboost:', roc_auc_score(y_test, prediction[:, 1]))"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": null,
173 | "metadata": {
174 | "collapsed": true
175 | },
176 | "outputs": [],
177 | "source": []
178 | }
179 | ],
180 | "metadata": {
181 | "kernelspec": {
182 | "display_name": "Python 3",
183 | "language": "python",
184 | "name": "python3"
185 | },
186 | "language_info": {
187 | "codemirror_mode": {
188 | "name": "ipython",
189 | "version": 3
190 | },
191 | "file_extension": ".py",
192 | "mimetype": "text/x-python",
193 | "name": "python",
194 | "nbconvert_exporter": "python",
195 | "pygments_lexer": "ipython3",
196 | "version": "3.6.1"
197 | }
198 | },
199 | "nbformat": 4,
200 | "nbformat_minor": 2
201 | }
202 |
--------------------------------------------------------------------------------