├── LICENSE
├── README.md
├── Section 1
    ├── Video 1.1  Improving your models using Feature engineering.ipynb
    ├── Video 1.2 Implementing feature engineering with logistic regression.ipynb
    ├── Video 1.3 Extracting data with feature selection and interaction.ipynb
    ├── video 1.4 Combining all together.ipynb
    └── video 1.5 Build model based on real-world problem.ipynb
├── Section 2
    ├── video 2.1 Support Vector Machines.ipynb
    ├── video 2.2 Implementing kNN on the dataset.ipynb
    ├── video 2.3 Decision Tree as predictive model.ipynb
    ├── video 2.4 Tricks with dimensionality reduction method.ipynb
    └── video 2.5 combining all together.ipynb
├── Section 3
    ├── 3.1 Random Forest for classification.ipynb
    ├── 3.2 Gradient boosting trees and bayes optimization.ipynb
    ├── 3.2 gradient boosting.ipynb
    ├── 3.4 Implement blending.ipynb
    └── 3.5 stacking.ipynb
├── Section 4
    ├── 4_1_Memory based collaborative filtering.ipynb
    ├── 4_2_Item to item recommendation with kNN.ipynb
    ├── 4_3_Applying Matrix Factorization on dataset.ipynb
    └── 4_4_Wordbach at use.ipynb
└── Section 5
    ├── 5_1_Validation dataset tuning.ipynb
    ├── 5_2_Regularizing model to avoid overfitting.ipynb
    ├── 5_3_Adversarial Validation.ipynb
    └── 5_4_Perform metric selection on real data.ipynb


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 Packt
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Python-Machine-Learning-Tips-Tricks-and-Techniques
 2 | Python Machine Learning: Tips, Tricks, and Techniques, published by Packt
 3 | # Python Machine Learning Tips, Tricks, and Techniques [Video]
 4 | This is the code repository for [Python Machine Learning Tips, Tricks, and Techniques [Video]](https://www.packtpub.com/big-data-and-business-intelligence/python-machine-learning-tips-tricks-and-techniques-video?utm_source=github&utm_medium=repository&utm_campaign=9781789135817), published by [Packt](https://www.packtpub.com/?utm_source=github). It contains all the supporting project files necessary to work through the video course from start to finish.
 5 | ## About the Video Course
 6 | achine learning allows us to interpret data structures and fit that data into models to identify patterns and make predictions. Python makes this easier with its huge set of libraries that can be easily used for machine learning. In this course, you will learn from a top Kaggle master to upgrade your Python skills with the latest advancements in Python. 
 7 | It is essential to keep upgrading your machine learning skills as there are immense advancements taking place every day. In this course, you will get hands-on experience of solving real problems by implementing cutting-edge techniques to significantly boost your Python Machine Learning skills and, as a consequence, achieve optimized results in almost any project you are working on. 
 8 | Each technique we cover is itself enough to improve your results. However; combining them together is where the real magic is. Throughout the course, you will work on real datasets to increase your expertise and keep adding new tools to your machine learning toolbox.
 9 | By the end of this course, you will know various tips, tricks, and techniques to upgrade your machine learning algorithms to reduce common problems, all the while building efficient machine learning models.
10 | 
11 | <H2>What You Will Learn</H2>
12 | <DIV class=book-info-will-learn-text>
13 | <UL>
14 | <LI>Tips and tricks to speed up your modeling process and obtain better results 
15 | <LI>Make predictions using advanced regression analysis with Python 
16 | <LI>Modern techniques for solving supervised learning problems 
17 | <LI>Various ways to use ensemble learning with Python to derive optimum results 
18 | <LI>Build your own recommendation engine and perform collaborative filtering 
19 | <LI>Give your production machine learning system improved reliability </LI></UL></DIV>
20 | 
21 | ## Instructions and Navigation
22 | ### Assumed Knowledge
23 | To fully benefit from the coverage included in this course, you will need:<br/>
24 | This course is for aspiring data science professionals and Machine Learning practitioners who are familiar with basic Python programming and machine learning libraries.
25 | ### Technical Requirements
26 | This course has the following software requirements:<br/>
27 | Python
28 | 
29 | ## Related Products
30 | * [Python Data Structures and Algorithms [Video]](https://www.packtpub.com/application-development/python-data-structures-and-algorithms-video?utm_source=github&utm_medium=repository&utm_campaign=9781788622066)
31 | 
32 | * [Hands-on Webpack for React Development [Video]](https://www.packtpub.com/application-development/hands-webpack-react-development-video?utm_source=github&utm_medium=repository&utm_campaign=9781789139808)
33 | 
34 | * [Object-oriented and Functional Programming with Java 8 [Integrated Course]](https://www.packtpub.com/application-development/object-oriented-and-functional-programming-java-8-integrated-course?utm_source=github&utm_medium=repository&utm_campaign=9781788294027)
35 | 
36 | 


--------------------------------------------------------------------------------
/Section 1/Video 1.1  Improving your models using Feature engineering.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "**Types of scaling**:\n",
  8 |     "\n",
  9 |     "* MinMaxScaler - scales all features to $[a, b]$ range\n",
 10 |     "\n",
 11 |     "* StandardScaler - removes mean and divides by variance of all features. $X^{new}_i = \\frac{X_i - \\mu}{\\sigma}$, where $\\mu $is for mean and $\\sigma$ is for variance\n",
 12 |     "\n",
 13 |     "* RobustScaler - same as StandardScaler but removes median and divides by IQR\n"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 7,
 19 |    "metadata": {
 20 |     "collapsed": true
 21 |    },
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "import pandas as pd\n",
 25 |     "import numpy as np"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 8,
 31 |    "metadata": {
 32 |     "collapsed": true
 33 |    },
 34 |    "outputs": [],
 35 |    "source": [
 36 |     "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\",sep = ';')"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 10,
 42 |    "metadata": {},
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "y = df.pop('quality')"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": 11,
 51 |    "metadata": {
 52 |     "collapsed": true
 53 |    },
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "for i in df.columns:\n",
 57 |     "    df[i] = df[i].fillna(np.mean(df[i]))"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": 12,
 63 |    "metadata": {
 64 |     "collapsed": true
 65 |    },
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "from sklearn.preprocessing import StandardScaler,MinMaxScaler,RobustScaler\n",
 69 |     "from sklearn.linear_model import Ridge\n",
 70 |     "from sklearn.model_selection import train_test_split\n",
 71 |     "from sklearn.metrics import mean_squared_error"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 13,
 77 |    "metadata": {
 78 |     "collapsed": true
 79 |    },
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "np.random.seed(42)\n",
 83 |     "train,test,y_train,y_test = train_test_split(df,y,test_size = 0.1)"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": 14,
 89 |    "metadata": {
 90 |     "collapsed": true
 91 |    },
 92 |    "outputs": [],
 93 |    "source": [
 94 |     "def fit_predict(train,test,y_train,y_test,scaler = None):\n",
 95 |     "    if scaler is None:\n",
 96 |     "        lr = Ridge()\n",
 97 |     "        lr.fit(train,y_train)\n",
 98 |     "        y_pred = lr.predict(test)\n",
 99 |     "        print('MSE score:', mean_squared_error(y_test,y_pred))\n",
100 |     "    else:\n",
101 |     "        train_scaled = scaler.fit_transform(train)\n",
102 |     "        test_scaled = scaler.transform(test)\n",
103 |     "        lr = Ridge()\n",
104 |     "        lr.fit(train_scaled,y_train)\n",
105 |     "        y_pred = lr.predict(test_scaled)\n",
106 |     "        print('MSE score:', mean_squared_error(y_test,y_pred))"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 15,
112 |    "metadata": {},
113 |    "outputs": [
114 |     {
115 |      "name": "stdout",
116 |      "output_type": "stream",
117 |      "text": [
118 |       "MSE score: 0.57404414001\n"
119 |      ]
120 |     }
121 |    ],
122 |    "source": [
123 |     "fit_predict(train,test,y_train,y_test)"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 16,
129 |    "metadata": {},
130 |    "outputs": [
131 |     {
132 |      "name": "stdout",
133 |      "output_type": "stream",
134 |      "text": [
135 |       "MSE score: 0.567545067343\n"
136 |      ]
137 |     }
138 |    ],
139 |    "source": [
140 |     "fit_predict(train,test,y_train,y_test,MinMaxScaler())"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": 17,
146 |    "metadata": {},
147 |    "outputs": [
148 |     {
149 |      "name": "stdout",
150 |      "output_type": "stream",
151 |      "text": [
152 |       "MSE score: 0.558144966334\n"
153 |      ]
154 |     }
155 |    ],
156 |    "source": [
157 |     "fit_predict(train,test,y_train,y_test,StandardScaler())"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": 18,
163 |    "metadata": {},
164 |    "outputs": [
165 |     {
166 |      "name": "stdout",
167 |      "output_type": "stream",
168 |      "text": [
169 |       "MSE score: 0.55823299573\n"
170 |      ]
171 |     }
172 |    ],
173 |    "source": [
174 |     "fit_predict(train,test,y_train,y_test,RobustScaler())"
175 |    ]
176 |   }
177 |  ],
178 |  "metadata": {
179 |   "kernelspec": {
180 |    "display_name": "Python 3",
181 |    "language": "python",
182 |    "name": "python3"
183 |   },
184 |   "language_info": {
185 |    "codemirror_mode": {
186 |     "name": "ipython",
187 |     "version": 3
188 |    },
189 |    "file_extension": ".py",
190 |    "mimetype": "text/x-python",
191 |    "name": "python",
192 |    "nbconvert_exporter": "python",
193 |    "pygments_lexer": "ipython3",
194 |    "version": "3.6.2"
195 |   }
196 |  },
197 |  "nbformat": 4,
198 |  "nbformat_minor": 2
199 | }
200 | 


--------------------------------------------------------------------------------
/Section 1/Video 1.2 Implementing feature engineering with logistic regression.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "from sklearn.metrics import accuracy_score\n",
 14 |     "from sklearn.linear_model import LogisticRegression\n",
 15 |     "from sklearn.model_selection import train_test_split\n",
 16 |     "import warnings\n",
 17 |     "warnings.filterwarnings(\"ignore\")\n",
 18 |     "\n",
 19 |     "np.random.seed(42)"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 2,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
 31 |     "y = df.pop('quality')\n",
 32 |     "\n",
 33 |     "for i in df.columns:\n",
 34 |     "    df[i] = df[i].fillna(np.mean(df[i]))\n",
 35 |     "    \n",
 36 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)    "
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {},
 43 |    "outputs": [
 44 |     {
 45 |      "name": "stdout",
 46 |      "output_type": "stream",
 47 |      "text": [
 48 |       "Accuracy score baseline: 0.514285714286\n"
 49 |      ]
 50 |     }
 51 |    ],
 52 |    "source": [
 53 |     "lr = LogisticRegression()\n",
 54 |     "lr.fit(train, y_train)\n",
 55 |     "y_pred = lr.predict(test)\n",
 56 |     "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 6,
 62 |    "metadata": {},
 63 |    "outputs": [
 64 |     {
 65 |      "data": {
 66 |       "text/html": [
 67 |        "<div>\n",
 68 |        "<style>\n",
 69 |        "    .dataframe thead tr:only-child th {\n",
 70 |        "        text-align: right;\n",
 71 |        "    }\n",
 72 |        "\n",
 73 |        "    .dataframe thead th {\n",
 74 |        "        text-align: left;\n",
 75 |        "    }\n",
 76 |        "\n",
 77 |        "    .dataframe tbody tr th {\n",
 78 |        "        vertical-align: top;\n",
 79 |        "    }\n",
 80 |        "</style>\n",
 81 |        "<table border=\"1\" class=\"dataframe\">\n",
 82 |        "  <thead>\n",
 83 |        "    <tr style=\"text-align: right;\">\n",
 84 |        "      <th></th>\n",
 85 |        "      <th>fixed acidity</th>\n",
 86 |        "      <th>volatile acidity</th>\n",
 87 |        "      <th>citric acid</th>\n",
 88 |        "      <th>residual sugar</th>\n",
 89 |        "      <th>chlorides</th>\n",
 90 |        "      <th>free sulfur dioxide</th>\n",
 91 |        "      <th>total sulfur dioxide</th>\n",
 92 |        "      <th>density</th>\n",
 93 |        "      <th>pH</th>\n",
 94 |        "      <th>sulphates</th>\n",
 95 |        "      <th>alcohol</th>\n",
 96 |        "    </tr>\n",
 97 |        "  </thead>\n",
 98 |        "  <tbody>\n",
 99 |        "    <tr>\n",
100 |        "      <th>4665</th>\n",
101 |        "      <td>7.3</td>\n",
102 |        "      <td>0.17</td>\n",
103 |        "      <td>0.36</td>\n",
104 |        "      <td>8.20</td>\n",
105 |        "      <td>0.028</td>\n",
106 |        "      <td>44.0</td>\n",
107 |        "      <td>111.0</td>\n",
108 |        "      <td>0.99272</td>\n",
109 |        "      <td>3.14</td>\n",
110 |        "      <td>0.41</td>\n",
111 |        "      <td>12.4</td>\n",
112 |        "    </tr>\n",
113 |        "    <tr>\n",
114 |        "      <th>1943</th>\n",
115 |        "      <td>6.3</td>\n",
116 |        "      <td>0.25</td>\n",
117 |        "      <td>0.44</td>\n",
118 |        "      <td>11.60</td>\n",
119 |        "      <td>0.041</td>\n",
120 |        "      <td>48.0</td>\n",
121 |        "      <td>195.0</td>\n",
122 |        "      <td>0.99680</td>\n",
123 |        "      <td>3.18</td>\n",
124 |        "      <td>0.52</td>\n",
125 |        "      <td>9.5</td>\n",
126 |        "    </tr>\n",
127 |        "    <tr>\n",
128 |        "      <th>3399</th>\n",
129 |        "      <td>5.6</td>\n",
130 |        "      <td>0.32</td>\n",
131 |        "      <td>0.33</td>\n",
132 |        "      <td>7.40</td>\n",
133 |        "      <td>0.037</td>\n",
134 |        "      <td>25.0</td>\n",
135 |        "      <td>95.0</td>\n",
136 |        "      <td>0.99268</td>\n",
137 |        "      <td>3.25</td>\n",
138 |        "      <td>0.49</td>\n",
139 |        "      <td>11.1</td>\n",
140 |        "    </tr>\n",
141 |        "    <tr>\n",
142 |        "      <th>843</th>\n",
143 |        "      <td>6.9</td>\n",
144 |        "      <td>0.19</td>\n",
145 |        "      <td>0.35</td>\n",
146 |        "      <td>1.70</td>\n",
147 |        "      <td>0.036</td>\n",
148 |        "      <td>33.0</td>\n",
149 |        "      <td>101.0</td>\n",
150 |        "      <td>0.99315</td>\n",
151 |        "      <td>3.21</td>\n",
152 |        "      <td>0.54</td>\n",
153 |        "      <td>10.8</td>\n",
154 |        "    </tr>\n",
155 |        "    <tr>\n",
156 |        "      <th>2580</th>\n",
157 |        "      <td>7.7</td>\n",
158 |        "      <td>0.30</td>\n",
159 |        "      <td>0.26</td>\n",
160 |        "      <td>18.95</td>\n",
161 |        "      <td>0.053</td>\n",
162 |        "      <td>36.0</td>\n",
163 |        "      <td>174.0</td>\n",
164 |        "      <td>0.99976</td>\n",
165 |        "      <td>3.20</td>\n",
166 |        "      <td>0.50</td>\n",
167 |        "      <td>10.4</td>\n",
168 |        "    </tr>\n",
169 |        "  </tbody>\n",
170 |        "</table>\n",
171 |        "</div>"
172 |       ],
173 |       "text/plain": [
174 |        "      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \\\n",
175 |        "4665            7.3              0.17         0.36            8.20      0.028   \n",
176 |        "1943            6.3              0.25         0.44           11.60      0.041   \n",
177 |        "3399            5.6              0.32         0.33            7.40      0.037   \n",
178 |        "843             6.9              0.19         0.35            1.70      0.036   \n",
179 |        "2580            7.7              0.30         0.26           18.95      0.053   \n",
180 |        "\n",
181 |        "      free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \\\n",
182 |        "4665                 44.0                 111.0  0.99272  3.14       0.41   \n",
183 |        "1943                 48.0                 195.0  0.99680  3.18       0.52   \n",
184 |        "3399                 25.0                  95.0  0.99268  3.25       0.49   \n",
185 |        "843                  33.0                 101.0  0.99315  3.21       0.54   \n",
186 |        "2580                 36.0                 174.0  0.99976  3.20       0.50   \n",
187 |        "\n",
188 |        "      alcohol  \n",
189 |        "4665     12.4  \n",
190 |        "1943      9.5  \n",
191 |        "3399     11.1  \n",
192 |        "843      10.8  \n",
193 |        "2580     10.4  "
194 |       ]
195 |      },
196 |      "execution_count": 6,
197 |      "metadata": {},
198 |      "output_type": "execute_result"
199 |     }
200 |    ],
201 |    "source": [
202 |     "train.head()"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "code",
207 |    "execution_count": 7,
208 |    "metadata": {
209 |     "collapsed": true
210 |    },
211 |    "outputs": [],
212 |    "source": [
213 |     "def feat_eng(df):\n",
214 |     "    df['eng1'] = df['fixed acidity'] * df['pH']\n",
215 |     "    df['eng2'] = df['total sulfur dioxide'] / df['free sulfur dioxide']\n",
216 |     "    df['eng3'] = df['sulphates'] / df['chlorides']\n",
217 |     "    df['eng4'] = df['chlorides'] / df['sulphates']\n",
218 |     "    return df\n",
219 |     "\n",
220 |     "train = feat_eng(train)\n",
221 |     "test = feat_eng(test)"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "code",
226 |    "execution_count": 8,
227 |    "metadata": {},
228 |    "outputs": [
229 |     {
230 |      "data": {
231 |       "text/html": [
232 |        "<div>\n",
233 |        "<style>\n",
234 |        "    .dataframe thead tr:only-child th {\n",
235 |        "        text-align: right;\n",
236 |        "    }\n",
237 |        "\n",
238 |        "    .dataframe thead th {\n",
239 |        "        text-align: left;\n",
240 |        "    }\n",
241 |        "\n",
242 |        "    .dataframe tbody tr th {\n",
243 |        "        vertical-align: top;\n",
244 |        "    }\n",
245 |        "</style>\n",
246 |        "<table border=\"1\" class=\"dataframe\">\n",
247 |        "  <thead>\n",
248 |        "    <tr style=\"text-align: right;\">\n",
249 |        "      <th></th>\n",
250 |        "      <th>fixed acidity</th>\n",
251 |        "      <th>volatile acidity</th>\n",
252 |        "      <th>citric acid</th>\n",
253 |        "      <th>residual sugar</th>\n",
254 |        "      <th>chlorides</th>\n",
255 |        "      <th>free sulfur dioxide</th>\n",
256 |        "      <th>total sulfur dioxide</th>\n",
257 |        "      <th>density</th>\n",
258 |        "      <th>pH</th>\n",
259 |        "      <th>sulphates</th>\n",
260 |        "      <th>alcohol</th>\n",
261 |        "      <th>eng1</th>\n",
262 |        "      <th>eng2</th>\n",
263 |        "      <th>eng3</th>\n",
264 |        "      <th>eng4</th>\n",
265 |        "    </tr>\n",
266 |        "  </thead>\n",
267 |        "  <tbody>\n",
268 |        "    <tr>\n",
269 |        "      <th>4665</th>\n",
270 |        "      <td>7.3</td>\n",
271 |        "      <td>0.17</td>\n",
272 |        "      <td>0.36</td>\n",
273 |        "      <td>8.20</td>\n",
274 |        "      <td>0.028</td>\n",
275 |        "      <td>44.0</td>\n",
276 |        "      <td>111.0</td>\n",
277 |        "      <td>0.99272</td>\n",
278 |        "      <td>3.14</td>\n",
279 |        "      <td>0.41</td>\n",
280 |        "      <td>12.4</td>\n",
281 |        "      <td>22.922</td>\n",
282 |        "      <td>2.522727</td>\n",
283 |        "      <td>14.642857</td>\n",
284 |        "      <td>0.068293</td>\n",
285 |        "    </tr>\n",
286 |        "    <tr>\n",
287 |        "      <th>1943</th>\n",
288 |        "      <td>6.3</td>\n",
289 |        "      <td>0.25</td>\n",
290 |        "      <td>0.44</td>\n",
291 |        "      <td>11.60</td>\n",
292 |        "      <td>0.041</td>\n",
293 |        "      <td>48.0</td>\n",
294 |        "      <td>195.0</td>\n",
295 |        "      <td>0.99680</td>\n",
296 |        "      <td>3.18</td>\n",
297 |        "      <td>0.52</td>\n",
298 |        "      <td>9.5</td>\n",
299 |        "      <td>20.034</td>\n",
300 |        "      <td>4.062500</td>\n",
301 |        "      <td>12.682927</td>\n",
302 |        "      <td>0.078846</td>\n",
303 |        "    </tr>\n",
304 |        "    <tr>\n",
305 |        "      <th>3399</th>\n",
306 |        "      <td>5.6</td>\n",
307 |        "      <td>0.32</td>\n",
308 |        "      <td>0.33</td>\n",
309 |        "      <td>7.40</td>\n",
310 |        "      <td>0.037</td>\n",
311 |        "      <td>25.0</td>\n",
312 |        "      <td>95.0</td>\n",
313 |        "      <td>0.99268</td>\n",
314 |        "      <td>3.25</td>\n",
315 |        "      <td>0.49</td>\n",
316 |        "      <td>11.1</td>\n",
317 |        "      <td>18.200</td>\n",
318 |        "      <td>3.800000</td>\n",
319 |        "      <td>13.243243</td>\n",
320 |        "      <td>0.075510</td>\n",
321 |        "    </tr>\n",
322 |        "    <tr>\n",
323 |        "      <th>843</th>\n",
324 |        "      <td>6.9</td>\n",
325 |        "      <td>0.19</td>\n",
326 |        "      <td>0.35</td>\n",
327 |        "      <td>1.70</td>\n",
328 |        "      <td>0.036</td>\n",
329 |        "      <td>33.0</td>\n",
330 |        "      <td>101.0</td>\n",
331 |        "      <td>0.99315</td>\n",
332 |        "      <td>3.21</td>\n",
333 |        "      <td>0.54</td>\n",
334 |        "      <td>10.8</td>\n",
335 |        "      <td>22.149</td>\n",
336 |        "      <td>3.060606</td>\n",
337 |        "      <td>15.000000</td>\n",
338 |        "      <td>0.066667</td>\n",
339 |        "    </tr>\n",
340 |        "    <tr>\n",
341 |        "      <th>2580</th>\n",
342 |        "      <td>7.7</td>\n",
343 |        "      <td>0.30</td>\n",
344 |        "      <td>0.26</td>\n",
345 |        "      <td>18.95</td>\n",
346 |        "      <td>0.053</td>\n",
347 |        "      <td>36.0</td>\n",
348 |        "      <td>174.0</td>\n",
349 |        "      <td>0.99976</td>\n",
350 |        "      <td>3.20</td>\n",
351 |        "      <td>0.50</td>\n",
352 |        "      <td>10.4</td>\n",
353 |        "      <td>24.640</td>\n",
354 |        "      <td>4.833333</td>\n",
355 |        "      <td>9.433962</td>\n",
356 |        "      <td>0.106000</td>\n",
357 |        "    </tr>\n",
358 |        "  </tbody>\n",
359 |        "</table>\n",
360 |        "</div>"
361 |       ],
362 |       "text/plain": [
363 |        "      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \\\n",
364 |        "4665            7.3              0.17         0.36            8.20      0.028   \n",
365 |        "1943            6.3              0.25         0.44           11.60      0.041   \n",
366 |        "3399            5.6              0.32         0.33            7.40      0.037   \n",
367 |        "843             6.9              0.19         0.35            1.70      0.036   \n",
368 |        "2580            7.7              0.30         0.26           18.95      0.053   \n",
369 |        "\n",
370 |        "      free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \\\n",
371 |        "4665                 44.0                 111.0  0.99272  3.14       0.41   \n",
372 |        "1943                 48.0                 195.0  0.99680  3.18       0.52   \n",
373 |        "3399                 25.0                  95.0  0.99268  3.25       0.49   \n",
374 |        "843                  33.0                 101.0  0.99315  3.21       0.54   \n",
375 |        "2580                 36.0                 174.0  0.99976  3.20       0.50   \n",
376 |        "\n",
377 |        "      alcohol    eng1      eng2       eng3      eng4  \n",
378 |        "4665     12.4  22.922  2.522727  14.642857  0.068293  \n",
379 |        "1943      9.5  20.034  4.062500  12.682927  0.078846  \n",
380 |        "3399     11.1  18.200  3.800000  13.243243  0.075510  \n",
381 |        "843      10.8  22.149  3.060606  15.000000  0.066667  \n",
382 |        "2580     10.4  24.640  4.833333   9.433962  0.106000  "
383 |       ]
384 |      },
385 |      "execution_count": 8,
386 |      "metadata": {},
387 |      "output_type": "execute_result"
388 |     }
389 |    ],
390 |    "source": [
391 |     "train.head()"
392 |    ]
393 |   },
394 |   {
395 |    "cell_type": "code",
396 |    "execution_count": 9,
397 |    "metadata": {},
398 |    "outputs": [
399 |     {
400 |      "name": "stdout",
401 |      "output_type": "stream",
402 |      "text": [
403 |       "Accuracy score feat eng: 0.523469387755\n"
404 |      ]
405 |     }
406 |    ],
407 |    "source": [
408 |     "lr = LogisticRegression()\n",
409 |     "lr.fit(train, y_train)\n",
410 |     "y_pred = lr.predict(test)\n",
411 |     "print('Accuracy score feat eng:', accuracy_score(y_test, y_pred))"
412 |    ]
413 |   },
414 |   {
415 |    "cell_type": "code",
416 |    "execution_count": null,
417 |    "metadata": {
418 |     "collapsed": true
419 |    },
420 |    "outputs": [],
421 |    "source": []
422 |   }
423 |  ],
424 |  "metadata": {
425 |   "kernelspec": {
426 |    "display_name": "Python 3",
427 |    "language": "python",
428 |    "name": "python3"
429 |   },
430 |   "language_info": {
431 |    "codemirror_mode": {
432 |     "name": "ipython",
433 |     "version": 3
434 |    },
435 |    "file_extension": ".py",
436 |    "mimetype": "text/x-python",
437 |    "name": "python",
438 |    "nbconvert_exporter": "python",
439 |    "pygments_lexer": "ipython3",
440 |    "version": "3.6.2"
441 |   }
442 |  },
443 |  "nbformat": 4,
444 |  "nbformat_minor": 2
445 | }
446 | 


--------------------------------------------------------------------------------
/Section 1/Video 1.3 Extracting data with feature selection and interaction.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "from sklearn.model_selection import train_test_split\n",
 14 |     "import warnings\n",
 15 |     "from sklearn.metrics import mean_absolute_error\n",
 16 |     "from sklearn.linear_model import Lasso,Ridge\n",
 17 |     "from sklearn.preprocessing import PolynomialFeatures\n",
 18 |     "warnings.filterwarnings(\"ignore\")\n",
 19 |     "np.random.seed(42)"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 2,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
 31 |     "y = df.pop('quality')\n",
 32 |     "\n",
 33 |     "for i in df.columns:\n",
 34 |     "    df[i] = df[i].fillna(np.mean(df[i]))\n",
 35 |     "    \n",
 36 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)    "
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 3,
 42 |    "metadata": {},
 43 |    "outputs": [
 44 |     {
 45 |      "name": "stdout",
 46 |      "output_type": "stream",
 47 |      "text": [
 48 |       "MAE score: 0.59404938801\n"
 49 |      ]
 50 |     }
 51 |    ],
 52 |    "source": [
 53 |     "def fit_predict(train, test, y_train, y_test, scaler = None):\n",
 54 |     "    if scaler is None:\n",
 55 |     "        lr = Ridge()\n",
 56 |     "        lr.fit(train, y_train)\n",
 57 |     "        y_pred = lr.predict(test)\n",
 58 |     "        print('MAE score:', mean_absolute_error(y_test, y_pred))\n",
 59 |     "    else:\n",
 60 |     "        train_scaled = scaler.fit_transform(train)\n",
 61 |     "        test_scaled = scaler.transform(test)\n",
 62 |     "        lr = Ridge()\n",
 63 |     "        lr.fit(train_scaled, y_train)\n",
 64 |     "        y_pred = lr.predict(test_scaled)\n",
 65 |     "        print('MAE score:', mean_absolute_error(y_test, y_pred))\n",
 66 |     "\n",
 67 |     "fit_predict(train,test,y_train,y_test)"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": 4,
 73 |    "metadata": {},
 74 |    "outputs": [
 75 |     {
 76 |      "name": "stdout",
 77 |      "output_type": "stream",
 78 |      "text": [
 79 |       "non zero features: 6\n"
 80 |      ]
 81 |     }
 82 |    ],
 83 |    "source": [
 84 |     "def get_feat_imp(train,y_train,alpha=0.01):\n",
 85 |     "    lr = Lasso(alpha=alpha)\n",
 86 |     "    lr.fit(train,y_train)\n",
 87 |     "    return lr.coef_\n",
 88 |     "fi = get_feat_imp(train,y_train)\n",
 89 |     "print('non zero features:',np.sum(fi != 0))"
 90 |    ]
 91 |   },
 92 |   {
 93 |    "cell_type": "code",
 94 |    "execution_count": 5,
 95 |    "metadata": {
 96 |     "collapsed": true
 97 |    },
 98 |    "outputs": [],
 99 |    "source": [
100 |     "bestf = np.argwhere(fi)\n",
101 |     "train_best = train.iloc[:, [x[0] for x in bestf.tolist()]]\n",
102 |     "test_best = test.iloc[:, [x[0] for x in bestf.tolist()]]"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 6,
108 |    "metadata": {
109 |     "collapsed": true
110 |    },
111 |    "outputs": [],
112 |    "source": [
113 |     "def create_poly(train,test,degree):\n",
114 |     "    poly = PolynomialFeatures(degree=degree)\n",
115 |     "    train_poly = poly.fit_transform(train)\n",
116 |     "    test_poly = poly.fit_transform(test)\n",
117 |     "    return train_poly,test_poly"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 7,
123 |    "metadata": {},
124 |    "outputs": [
125 |     {
126 |      "name": "stdout",
127 |      "output_type": "stream",
128 |      "text": [
129 |       "No feature selection degree 1\n",
130 |       "MAE score: 0.59404938801\n",
131 |       "----------\n",
132 |       "No feature selection degree 2\n",
133 |       "MAE score: 0.577238983011\n",
134 |       "----------\n",
135 |       "No feature selection degree 3\n",
136 |       "MAE score: 0.596958634563\n",
137 |       "----------\n"
138 |      ]
139 |     }
140 |    ],
141 |    "source": [
142 |     "for degree in [1,2,3]:\n",
143 |     "    train_poly,test_poly = create_poly(train,test,degree)\n",
144 |     "    print('No feature selection degree',degree)\n",
145 |     "    fit_predict(train_poly,test_poly,y_train,y_test)\n",
146 |     "    print(10*'-')"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 8,
152 |    "metadata": {},
153 |    "outputs": [
154 |     {
155 |      "name": "stdout",
156 |      "output_type": "stream",
157 |      "text": [
158 |       "Feature selection degree 1\n",
159 |       "MAE score: 0.597972321004\n",
160 |       "----------\n",
161 |       "Feature selection degree 2\n",
162 |       "MAE score: 0.591541808012\n",
163 |       "----------\n",
164 |       "Feature selection degree 3\n",
165 |       "MAE score: 0.597630769778\n",
166 |       "----------\n"
167 |      ]
168 |     }
169 |    ],
170 |    "source": [
171 |     "for degree in [1,2,3]:\n",
172 |     "    train_poly,test_poly = create_poly(train_best,test_best,degree)\n",
173 |     "    print('Feature selection degree',degree)\n",
174 |     "    fit_predict(train_poly,test_poly,y_train,y_test)\n",
175 |     "    print(10*'-')"
176 |    ]
177 |   }
178 |  ],
179 |  "metadata": {
180 |   "kernelspec": {
181 |    "display_name": "Python 3",
182 |    "language": "python",
183 |    "name": "python3"
184 |   },
185 |   "language_info": {
186 |    "codemirror_mode": {
187 |     "name": "ipython",
188 |     "version": 3
189 |    },
190 |    "file_extension": ".py",
191 |    "mimetype": "text/x-python",
192 |    "name": "python",
193 |    "nbconvert_exporter": "python",
194 |    "pygments_lexer": "ipython3",
195 |    "version": "3.6.2"
196 |   }
197 |  },
198 |  "nbformat": 4,
199 |  "nbformat_minor": 2
200 | }
201 | 


--------------------------------------------------------------------------------
/Section 1/video 1.4 Combining all together.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "from sklearn.model_selection import train_test_split\n",
 14 |     "import warnings\n",
 15 |     "from sklearn.metrics import mean_absolute_error\n",
 16 |     "from sklearn.linear_model import Lasso,Ridge\n",
 17 |     "from sklearn.preprocessing import PolynomialFeatures\n",
 18 |     "from sklearn.preprocessing import StandardScaler\n",
 19 |     "warnings.filterwarnings(\"ignore\")\n",
 20 |     "np.random.seed(42)"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 2,
 26 |    "metadata": {
 27 |     "collapsed": true
 28 |    },
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
 32 |     "y = df.pop('quality')\n",
 33 |     "\n",
 34 |     "for i in df.columns:\n",
 35 |     "    df[i] = df[i].fillna(np.mean(df[i]))\n",
 36 |     "    \n",
 37 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)    "
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 3,
 43 |    "metadata": {},
 44 |    "outputs": [
 45 |     {
 46 |      "name": "stdout",
 47 |      "output_type": "stream",
 48 |      "text": [
 49 |       "MAE score: 0.59404938801\n"
 50 |      ]
 51 |     }
 52 |    ],
 53 |    "source": [
 54 |     "def fit_predict(train, test, y_train, y_test, scaler = None):\n",
 55 |     "    if scaler is None:\n",
 56 |     "        lr = Ridge()\n",
 57 |     "        lr.fit(train, y_train)\n",
 58 |     "        y_pred = lr.predict(test)\n",
 59 |     "        print('MAE score:', mean_absolute_error(y_test, y_pred))\n",
 60 |     "    else:\n",
 61 |     "        train_scaled = scaler.fit_transform(train)\n",
 62 |     "        test_scaled = scaler.transform(test)\n",
 63 |     "        lr = Ridge()\n",
 64 |     "        lr.fit(train_scaled, y_train)\n",
 65 |     "        y_pred = lr.predict(test_scaled)\n",
 66 |     "        print('MAE score:', mean_absolute_error(y_test, y_pred))\n",
 67 |     "\n",
 68 |     "fit_predict(train,test,y_train,y_test)"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": 4,
 74 |    "metadata": {
 75 |     "collapsed": true
 76 |    },
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "def feat_eng(df):\n",
 80 |     "    df['eng1'] = df['fixed acidity'] * df['pH']\n",
 81 |     "    df['eng2'] = df['total sulfur dioxide'] / df['free sulfur dioxide']\n",
 82 |     "    df['eng3'] = df['sulphates'] / df['chlorides']\n",
 83 |     "    df['eng4'] = df['chlorides'] / df['sulphates']\n",
 84 |     "    return df\n",
 85 |     "\n",
 86 |     "train = feat_eng(train)\n",
 87 |     "test = feat_eng(test)"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "code",
 92 |    "execution_count": 5,
 93 |    "metadata": {},
 94 |    "outputs": [
 95 |     {
 96 |      "name": "stdout",
 97 |      "output_type": "stream",
 98 |      "text": [
 99 |       "non zero features: 8\n"
100 |      ]
101 |     }
102 |    ],
103 |    "source": [
104 |     "def get_feat_imp(train,y_train,alpha=0.01):\n",
105 |     "    lr = Lasso(alpha=alpha)\n",
106 |     "    lr.fit(train,y_train)\n",
107 |     "    return lr.coef_\n",
108 |     "fi = get_feat_imp(train,y_train)\n",
109 |     "print('non zero features:',np.sum(fi != 0))"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": 6,
115 |    "metadata": {
116 |     "collapsed": true
117 |    },
118 |    "outputs": [],
119 |    "source": [
120 |     "bestf = np.argwhere(fi)\n",
121 |     "train_best = train.iloc[:, [x[0] for x in bestf.tolist()]]\n",
122 |     "test_best = test.iloc[:, [x[0] for x in bestf.tolist()]]"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": 7,
128 |    "metadata": {
129 |     "collapsed": true
130 |    },
131 |    "outputs": [],
132 |    "source": [
133 |     "def create_poly(train,test,degree):\n",
134 |     "    poly = PolynomialFeatures(degree=degree)\n",
135 |     "    train_poly = poly.fit_transform(train)\n",
136 |     "    test_poly = poly.fit_transform(test)\n",
137 |     "    return train_poly,test_poly"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": 8,
143 |    "metadata": {},
144 |    "outputs": [
145 |     {
146 |      "name": "stdout",
147 |      "output_type": "stream",
148 |      "text": [
149 |       "No feature selection degree 1\n",
150 |       "MAE score: 0.579732117236\n",
151 |       "----------\n",
152 |       "No feature selection degree 2\n",
153 |       "MAE score: 0.566008983832\n",
154 |       "----------\n",
155 |       "No feature selection degree 3\n",
156 |       "MAE score: 0.557362504716\n",
157 |       "----------\n",
158 |       "No feature selection degree 4\n",
159 |       "MAE score: 0.569031127402\n",
160 |       "----------\n"
161 |      ]
162 |     }
163 |    ],
164 |    "source": [
165 |     "for degree in [1,2,3,4]:\n",
166 |     "    train_poly,test_poly = create_poly(train,test,degree)\n",
167 |     "    print('No feature selection degree',degree)\n",
168 |     "    fit_predict(train_poly,test_poly,y_train,y_test,StandardScaler())\n",
169 |     "    print(10*'-')"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": 9,
175 |    "metadata": {},
176 |    "outputs": [
177 |     {
178 |      "name": "stdout",
179 |      "output_type": "stream",
180 |      "text": [
181 |       "Feature selection degree 1\n",
182 |       "MAE score: 0.586503532959\n",
183 |       "----------\n",
184 |       "Feature selection degree 2\n",
185 |       "MAE score: 0.575863518226\n",
186 |       "----------\n",
187 |       "Feature selection degree 3\n",
188 |       "MAE score: 0.571050244983\n",
189 |       "----------\n",
190 |       "Feature selection degree 4\n",
191 |       "MAE score: 0.626462712407\n",
192 |       "----------\n"
193 |      ]
194 |     }
195 |    ],
196 |    "source": [
197 |     "for degree in [1,2,3,4]:\n",
198 |     "    train_poly,test_poly = create_poly(train_best,test_best,degree)\n",
199 |     "    print('Feature selection degree',degree)\n",
200 |     "    fit_predict(train_poly,test_poly,y_train,y_test,StandardScaler())\n",
201 |     "    print(10*'-')"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": 10,
207 |    "metadata": {},
208 |    "outputs": [
209 |     {
210 |      "name": "stdout",
211 |      "output_type": "stream",
212 |      "text": [
213 |       "overall improvement is 6.18 %\n"
214 |      ]
215 |     }
216 |    ],
217 |    "source": [
218 |     "original_score = 0.59404938801\n",
219 |     "best_score = 0.557362504716\n",
220 |     "improvement = np.round(100*(original_score - best_score)/original_score,2)\n",
221 |     "print('overall improvement is {} %'.format(improvement))"
222 |    ]
223 |   }
224 |  ],
225 |  "metadata": {
226 |   "kernelspec": {
227 |    "display_name": "Python 3",
228 |    "language": "python",
229 |    "name": "python3"
230 |   },
231 |   "language_info": {
232 |    "codemirror_mode": {
233 |     "name": "ipython",
234 |     "version": 3
235 |    },
236 |    "file_extension": ".py",
237 |    "mimetype": "text/x-python",
238 |    "name": "python",
239 |    "nbconvert_exporter": "python",
240 |    "pygments_lexer": "ipython3",
241 |    "version": "3.6.2"
242 |   }
243 |  },
244 |  "nbformat": 4,
245 |  "nbformat_minor": 2
246 | }
247 | 


--------------------------------------------------------------------------------
/Section 1/video 1.5 Build model based on real-world problem.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 17,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "from sklearn.model_selection import train_test_split\n",
 14 |     "import warnings\n",
 15 |     "from sklearn.metrics import mean_absolute_error\n",
 16 |     "from sklearn.linear_model import Lasso,Ridge\n",
 17 |     "from sklearn.preprocessing import PolynomialFeatures\n",
 18 |     "from sklearn.preprocessing import StandardScaler\n",
 19 |     "warnings.filterwarnings(\"ignore\")\n",
 20 |     "np.random.seed(42)"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {},
 26 |    "source": [
 27 |     "Data Set Information:\n",
 28 |     "\n",
 29 |     "Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues. \n",
 30 |     "\n",
 31 |     "Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.\n",
 32 |     "\n",
 33 |     "cnt: count of total rental bikes including both casual and registered"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 18,
 39 |    "metadata": {},
 40 |    "outputs": [
 41 |     {
 42 |      "data": {
 43 |       "text/html": [
 44 |        "<div>\n",
 45 |        "<style>\n",
 46 |        "    .dataframe thead tr:only-child th {\n",
 47 |        "        text-align: right;\n",
 48 |        "    }\n",
 49 |        "\n",
 50 |        "    .dataframe thead th {\n",
 51 |        "        text-align: left;\n",
 52 |        "    }\n",
 53 |        "\n",
 54 |        "    .dataframe tbody tr th {\n",
 55 |        "        vertical-align: top;\n",
 56 |        "    }\n",
 57 |        "</style>\n",
 58 |        "<table border=\"1\" class=\"dataframe\">\n",
 59 |        "  <thead>\n",
 60 |        "    <tr style=\"text-align: right;\">\n",
 61 |        "      <th></th>\n",
 62 |        "      <th>season</th>\n",
 63 |        "      <th>yr</th>\n",
 64 |        "      <th>mnth</th>\n",
 65 |        "      <th>hr</th>\n",
 66 |        "      <th>holiday</th>\n",
 67 |        "      <th>weekday</th>\n",
 68 |        "      <th>workingday</th>\n",
 69 |        "      <th>weathersit</th>\n",
 70 |        "      <th>temp</th>\n",
 71 |        "      <th>atemp</th>\n",
 72 |        "      <th>hum</th>\n",
 73 |        "      <th>windspeed</th>\n",
 74 |        "    </tr>\n",
 75 |        "  </thead>\n",
 76 |        "  <tbody>\n",
 77 |        "    <tr>\n",
 78 |        "      <th>0</th>\n",
 79 |        "      <td>1</td>\n",
 80 |        "      <td>0</td>\n",
 81 |        "      <td>1</td>\n",
 82 |        "      <td>0</td>\n",
 83 |        "      <td>0</td>\n",
 84 |        "      <td>6</td>\n",
 85 |        "      <td>0</td>\n",
 86 |        "      <td>1</td>\n",
 87 |        "      <td>0.24</td>\n",
 88 |        "      <td>0.2879</td>\n",
 89 |        "      <td>0.81</td>\n",
 90 |        "      <td>0.0</td>\n",
 91 |        "    </tr>\n",
 92 |        "    <tr>\n",
 93 |        "      <th>1</th>\n",
 94 |        "      <td>1</td>\n",
 95 |        "      <td>0</td>\n",
 96 |        "      <td>1</td>\n",
 97 |        "      <td>1</td>\n",
 98 |        "      <td>0</td>\n",
 99 |        "      <td>6</td>\n",
100 |        "      <td>0</td>\n",
101 |        "      <td>1</td>\n",
102 |        "      <td>0.22</td>\n",
103 |        "      <td>0.2727</td>\n",
104 |        "      <td>0.80</td>\n",
105 |        "      <td>0.0</td>\n",
106 |        "    </tr>\n",
107 |        "    <tr>\n",
108 |        "      <th>2</th>\n",
109 |        "      <td>1</td>\n",
110 |        "      <td>0</td>\n",
111 |        "      <td>1</td>\n",
112 |        "      <td>2</td>\n",
113 |        "      <td>0</td>\n",
114 |        "      <td>6</td>\n",
115 |        "      <td>0</td>\n",
116 |        "      <td>1</td>\n",
117 |        "      <td>0.22</td>\n",
118 |        "      <td>0.2727</td>\n",
119 |        "      <td>0.80</td>\n",
120 |        "      <td>0.0</td>\n",
121 |        "    </tr>\n",
122 |        "    <tr>\n",
123 |        "      <th>3</th>\n",
124 |        "      <td>1</td>\n",
125 |        "      <td>0</td>\n",
126 |        "      <td>1</td>\n",
127 |        "      <td>3</td>\n",
128 |        "      <td>0</td>\n",
129 |        "      <td>6</td>\n",
130 |        "      <td>0</td>\n",
131 |        "      <td>1</td>\n",
132 |        "      <td>0.24</td>\n",
133 |        "      <td>0.2879</td>\n",
134 |        "      <td>0.75</td>\n",
135 |        "      <td>0.0</td>\n",
136 |        "    </tr>\n",
137 |        "    <tr>\n",
138 |        "      <th>4</th>\n",
139 |        "      <td>1</td>\n",
140 |        "      <td>0</td>\n",
141 |        "      <td>1</td>\n",
142 |        "      <td>4</td>\n",
143 |        "      <td>0</td>\n",
144 |        "      <td>6</td>\n",
145 |        "      <td>0</td>\n",
146 |        "      <td>1</td>\n",
147 |        "      <td>0.24</td>\n",
148 |        "      <td>0.2879</td>\n",
149 |        "      <td>0.75</td>\n",
150 |        "      <td>0.0</td>\n",
151 |        "    </tr>\n",
152 |        "  </tbody>\n",
153 |        "</table>\n",
154 |        "</div>"
155 |       ],
156 |       "text/plain": [
157 |        "   season  yr  mnth  hr  holiday  weekday  workingday  weathersit  temp  \\\n",
158 |        "0       1   0     1   0        0        6           0           1  0.24   \n",
159 |        "1       1   0     1   1        0        6           0           1  0.22   \n",
160 |        "2       1   0     1   2        0        6           0           1  0.22   \n",
161 |        "3       1   0     1   3        0        6           0           1  0.24   \n",
162 |        "4       1   0     1   4        0        6           0           1  0.24   \n",
163 |        "\n",
164 |        "    atemp   hum  windspeed  \n",
165 |        "0  0.2879  0.81        0.0  \n",
166 |        "1  0.2727  0.80        0.0  \n",
167 |        "2  0.2727  0.80        0.0  \n",
168 |        "3  0.2879  0.75        0.0  \n",
169 |        "4  0.2879  0.75        0.0  "
170 |       ]
171 |      },
172 |      "execution_count": 18,
173 |      "metadata": {},
174 |      "output_type": "execute_result"
175 |     }
176 |    ],
177 |    "source": [
178 |     "df = pd.read_csv('hour.csv')\n",
179 |     "y = df.pop('cnt')\n",
180 |     "df.drop(['instant', 'casual', 'dteday', 'registered'], axis = 1, inplace = True)\n",
181 |     "df.head()"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "code",
186 |    "execution_count": 3,
187 |    "metadata": {
188 |     "collapsed": true
189 |    },
190 |    "outputs": [],
191 |    "source": [
192 |     "for i in df.columns:\n",
193 |     "    df[i] = df[i].fillna(np.mean(df[i]))\n",
194 |     "    \n",
195 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2) "
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 4,
201 |    "metadata": {
202 |     "collapsed": true
203 |    },
204 |    "outputs": [],
205 |    "source": [
206 |     "def fit_predict(train, test, y_train, y_test, scaler = None):\n",
207 |     "    if scaler is None:\n",
208 |     "        lr = Ridge()\n",
209 |     "        lr.fit(train, y_train)\n",
210 |     "        y_pred = lr.predict(test)\n",
211 |     "        print('MAE score:', mean_absolute_error(y_test, y_pred))\n",
212 |     "    else:\n",
213 |     "        train_scaled = scaler.fit_transform(train)\n",
214 |     "        test_scaled = scaler.transform(test)\n",
215 |     "        lr = Ridge()\n",
216 |     "        lr.fit(train_scaled, y_train)\n",
217 |     "        y_pred = lr.predict(test_scaled)\n",
218 |     "        print('MAE score:', mean_absolute_error(y_test, y_pred))"
219 |    ]
220 |   },
221 |   {
222 |    "cell_type": "code",
223 |    "execution_count": 5,
224 |    "metadata": {},
225 |    "outputs": [
226 |     {
227 |      "name": "stdout",
228 |      "output_type": "stream",
229 |      "text": [
230 |       "Baseline MAE score: 104.802725573\n"
231 |      ]
232 |     }
233 |    ],
234 |    "source": [
235 |     "print('Baseline', end = ' ')\n",
236 |     "fit_predict(train, test, y_train, y_test)"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": 6,
242 |    "metadata": {
243 |     "collapsed": true
244 |    },
245 |    "outputs": [],
246 |    "source": [
247 |     "def feat_eng(df):\n",
248 |     "    df['eng1'] = df['hum'] / df['temp']\n",
249 |     "    df['eng2'] = df['windspeed'] * df['hum']\n",
250 |     "    df['eng3'] = df['temp'] * df['hum']\n",
251 |     "    df['eng4'] = df['temp'] * df['atemp']\n",
252 |     "    return df\n",
253 |     "\n",
254 |     "train = feat_eng(train)\n",
255 |     "test = feat_eng(test)"
256 |    ]
257 |   },
258 |   {
259 |    "cell_type": "code",
260 |    "execution_count": 7,
261 |    "metadata": {},
262 |    "outputs": [
263 |     {
264 |      "name": "stdout",
265 |      "output_type": "stream",
266 |      "text": [
267 |       "number of features is 16\n",
268 |       "number of non zero features: 16\n"
269 |      ]
270 |     }
271 |    ],
272 |    "source": [
273 |     "def get_feat_imp(train,y_train,alpha=0.01):\n",
274 |     "    lr = Lasso(alpha=alpha)\n",
275 |     "    lr.fit(train,y_train)\n",
276 |     "    return lr.coef_\n",
277 |     "fi = get_feat_imp(train,y_train)\n",
278 |     "print('number of features is {}'.format(train.shape[1]))\n",
279 |     "print('number of non zero features:',np.sum(fi != 0))"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": 8,
285 |    "metadata": {
286 |     "collapsed": true
287 |    },
288 |    "outputs": [],
289 |    "source": [
290 |     "def create_poly(train,test,degree):\n",
291 |     "    poly = PolynomialFeatures(degree=degree)\n",
292 |     "    train_poly = poly.fit_transform(train)\n",
293 |     "    test_poly = poly.fit_transform(test)\n",
294 |     "    return train_poly,test_poly"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "code",
299 |    "execution_count": 11,
300 |    "metadata": {},
301 |    "outputs": [
302 |     {
303 |      "name": "stdout",
304 |      "output_type": "stream",
305 |      "text": [
306 |       "No feature selection degree 1\n",
307 |       "MAE score: 103.454477177\n",
308 |       "----------\n",
309 |       "No feature selection degree 2\n",
310 |       "MAE score: 89.5565130706\n",
311 |       "----------\n",
312 |       "No feature selection degree 3\n",
313 |       "MAE score: 77.537496916\n",
314 |       "----------\n",
315 |       "No feature selection degree 4\n",
316 |       "MAE score: 71.9204017223\n",
317 |       "----------\n",
318 |       "No feature selection degree 5\n",
319 |       "MAE score: 345.019388746\n",
320 |       "----------\n"
321 |      ]
322 |     }
323 |    ],
324 |    "source": [
325 |     "for degree in [1,2,3,4,5]:\n",
326 |     "    train_poly,test_poly = create_poly(train,test,degree)\n",
327 |     "    print('No feature selection degree',degree)\n",
328 |     "    fit_predict(train_poly,test_poly,y_train,y_test)\n",
329 |     "    print(10*'-')"
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "code",
334 |    "execution_count": 10,
335 |    "metadata": {},
336 |    "outputs": [
337 |     {
338 |      "name": "stdout",
339 |      "output_type": "stream",
340 |      "text": [
341 |       "overall improvement is 31.38 %\n"
342 |      ]
343 |     }
344 |    ],
345 |    "source": [
346 |     "original_score = 104.802725573\n",
347 |     "best_score = 71.9204017223\n",
348 |     "improvement = np.round(100*(original_score - best_score)/original_score,2)\n",
349 |     "print('overall improvement is {} %'.format(improvement))"
350 |    ]
351 |   }
352 |  ],
353 |  "metadata": {
354 |   "kernelspec": {
355 |    "display_name": "Python 3",
356 |    "language": "python",
357 |    "name": "python3"
358 |   },
359 |   "language_info": {
360 |    "codemirror_mode": {
361 |     "name": "ipython",
362 |     "version": 3
363 |    },
364 |    "file_extension": ".py",
365 |    "mimetype": "text/x-python",
366 |    "name": "python",
367 |    "nbconvert_exporter": "python",
368 |    "pygments_lexer": "ipython3",
369 |    "version": "3.6.2"
370 |   }
371 |  },
372 |  "nbformat": 4,
373 |  "nbformat_minor": 2
374 | }
375 | 


--------------------------------------------------------------------------------
/Section 2/video 2.1 Support Vector Machines.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "import matplotlib.pyplot as plt\n",
 14 |     "%matplotlib inline\n",
 15 |     "from sklearn.preprocessing import StandardScaler\n",
 16 |     "from sklearn.svm import SVC\n",
 17 |     "from sklearn.linear_model import LogisticRegression\n",
 18 |     "from sklearn.model_selection import train_test_split\n",
 19 |     "from sklearn.metrics import accuracy_score\n",
 20 |     "import warnings\n",
 21 |     "warnings.filterwarnings(\"ignore\")\n",
 22 |     "np.random.seed(42)"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 2,
 28 |    "metadata": {
 29 |     "collapsed": true
 30 |    },
 31 |    "outputs": [],
 32 |    "source": [
 33 |     "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
 34 |     "y = df.pop('quality')"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 3,
 40 |    "metadata": {
 41 |     "collapsed": true
 42 |    },
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "for i in df.columns:\n",
 46 |     "    df[i] = df[i].fillna(np.mean(df[i]))\n",
 47 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 5,
 53 |    "metadata": {},
 54 |    "outputs": [
 55 |     {
 56 |      "name": "stdout",
 57 |      "output_type": "stream",
 58 |      "text": [
 59 |       "Accuracy score baseline: 0.514285714286\n"
 60 |      ]
 61 |     }
 62 |    ],
 63 |    "source": [
 64 |     "lr = LogisticRegression()\n",
 65 |     "lr.fit(train, y_train)\n",
 66 |     "y_pred = lr.predict(test)\n",
 67 |     "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": 6,
 73 |    "metadata": {
 74 |     "collapsed": true
 75 |    },
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "def fit_predict(train, test, y_train, y_test, scaler, kernel = 'linear', C = 1.0, degree = 3):\n",
 79 |     "    train_scaled = scaler.fit_transform(train)\n",
 80 |     "    test_scaled = scaler.transform(test)        \n",
 81 |     "    lr = SVC(kernel = kernel, degree = degree, C = C)\n",
 82 |     "    lr.fit(train_scaled, y_train)\n",
 83 |     "    y_pred = lr.predict(test_scaled)\n",
 84 |     "    print(accuracy_score(y_test, y_pred))"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "raw",
 89 |    "metadata": {},
 90 |    "source": [
 91 |     "def fit_predict(train, test, y_train, y_test, scaler):\n",
 92 |     "    train_scaled = scaler.fit_transform(train)\n",
 93 |     "    test_scaled = scaler.transform(test)        \n",
 94 |     "    lr = \n",
 95 |     "    lr.fit(train_scaled, y_train)\n",
 96 |     "    y_pred = lr.predict(test_scaled)\n",
 97 |     "    print(accuracy_score(y_test, y_pred))"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "markdown",
102 |    "metadata": {},
103 |    "source": [
104 |     "### Kernel tuning"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": 7,
110 |    "metadata": {},
111 |    "outputs": [
112 |     {
113 |      "name": "stdout",
114 |      "output_type": "stream",
115 |      "text": [
116 |       "Accuracy score using linear kernel: 0.509183673469\n",
117 |       "Accuracy score using poly kernel: 0.525510204082\n",
118 |       "Accuracy score using rbf kernel: 0.561224489796\n",
119 |       "Accuracy score using sigmoid kernel: 0.404081632653\n"
120 |      ]
121 |     }
122 |    ],
123 |    "source": [
124 |     "for kernel in ['linear', 'poly', 'rbf', 'sigmoid']:\n",
125 |     "    print('Accuracy score using {0} kernel:'.format(kernel), end = ' ')\n",
126 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), kernel)"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "markdown",
131 |    "metadata": {},
132 |    "source": [
133 |     "### Penalty tuning"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": 9,
139 |    "metadata": {},
140 |    "outputs": [
141 |     {
142 |      "name": "stdout",
143 |      "output_type": "stream",
144 |      "text": [
145 |       "Accuracy score using penalty = 0.5 with rbf kernel: 0.540816326531\n",
146 |       "Accuracy score using penalty = 0.8705505632961241 with rbf kernel: 0.560204081633\n",
147 |       "Accuracy score using penalty = 1.5157165665103982 with rbf kernel: 0.558163265306\n",
148 |       "Accuracy score using penalty = 2.6390158215457893 with rbf kernel: 0.564285714286\n",
149 |       "Accuracy score using penalty = 4.59479341998814 with rbf kernel: 0.577551020408\n",
150 |       "Accuracy score using penalty = 8.0 with rbf kernel: 0.591836734694\n"
151 |      ]
152 |     }
153 |    ],
154 |    "source": [
155 |     "for с in np.logspace(-1, 3, base = 2, num = 6):\n",
156 |     "    print('Accuracy score using penalty = {0} with rbf kernel:'.format(с), end = ' ')\n",
157 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), 'rbf', с)"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "markdown",
162 |    "metadata": {},
163 |    "source": [
164 |     "### Choosing degree for poly kernel"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": 10,
170 |    "metadata": {},
171 |    "outputs": [
172 |     {
173 |      "name": "stdout",
174 |      "output_type": "stream",
175 |      "text": [
176 |       "Accuracy score using degree = 2 with poly kernel: 0.486734693878\n",
177 |       "Accuracy score using degree = 3 with poly kernel: 0.518367346939\n",
178 |       "Accuracy score using degree = 4 with poly kernel: 0.521428571429\n",
179 |       "Accuracy score using degree = 5 with poly kernel: 0.530612244898\n"
180 |      ]
181 |     }
182 |    ],
183 |    "source": [
184 |     "for degree in range(2, 6):\n",
185 |     "    print('Accuracy score using degree = {0} with poly kernel:'.format(degree), end = ' ')\n",
186 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), 'poly', 1.5, degree = degree)"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": 11,
192 |    "metadata": {},
193 |    "outputs": [
194 |     {
195 |      "name": "stdout",
196 |      "output_type": "stream",
197 |      "text": [
198 |       "overall improvement is 15.08 %\n"
199 |      ]
200 |     }
201 |    ],
202 |    "source": [
203 |     "original_score = 0.514285714286\n",
204 |     "best_score = 0.591836734694\n",
205 |     "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
206 |     "print('overall improvement is {} %'.format(improvement))"
207 |    ]
208 |   }
209 |  ],
210 |  "metadata": {
211 |   "kernelspec": {
212 |    "display_name": "Python 3",
213 |    "language": "python",
214 |    "name": "python3"
215 |   },
216 |   "language_info": {
217 |    "codemirror_mode": {
218 |     "name": "ipython",
219 |     "version": 3
220 |    },
221 |    "file_extension": ".py",
222 |    "mimetype": "text/x-python",
223 |    "name": "python",
224 |    "nbconvert_exporter": "python",
225 |    "pygments_lexer": "ipython3",
226 |    "version": "3.6.1"
227 |   }
228 |  },
229 |  "nbformat": 4,
230 |  "nbformat_minor": 2
231 | }
232 | 


--------------------------------------------------------------------------------
/Section 2/video 2.2 Implementing kNN on the dataset.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "import matplotlib.pyplot as plt\n",
 14 |     "%matplotlib inline\n",
 15 |     "from sklearn.linear_model import LogisticRegression\n",
 16 |     "from sklearn.preprocessing import StandardScaler\n",
 17 |     "from sklearn.neighbors import KNeighborsClassifier\n",
 18 |     "from sklearn.model_selection import train_test_split\n",
 19 |     "from sklearn.metrics import accuracy_score\n",
 20 |     "from sklearn.preprocessing import PolynomialFeatures\n",
 21 |     "import warnings\n",
 22 |     "warnings.filterwarnings(\"ignore\")\n",
 23 |     "np.random.seed(42)"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 2,
 29 |    "metadata": {
 30 |     "collapsed": true
 31 |    },
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
 35 |     "y = df.pop('quality')"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 3,
 41 |    "metadata": {
 42 |     "collapsed": true
 43 |    },
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "for i in df.columns:\n",
 47 |     "    df[i] = df[i].fillna(np.mean(df[i]))\n",
 48 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 4,
 54 |    "metadata": {},
 55 |    "outputs": [
 56 |     {
 57 |      "name": "stdout",
 58 |      "output_type": "stream",
 59 |      "text": [
 60 |       "Accuracy score baseline: 0.514285714286\n"
 61 |      ]
 62 |     }
 63 |    ],
 64 |    "source": [
 65 |     "lr = LogisticRegression()\n",
 66 |     "lr.fit(train, y_train)\n",
 67 |     "y_pred = lr.predict(test)\n",
 68 |     "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": 5,
 74 |    "metadata": {
 75 |     "collapsed": true
 76 |    },
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "def fit_predict(train, test, y_train, y_test, scaler, \n",
 80 |     "                n_neighbours, metric = 'manhattan', weights = 'uniform'):\n",
 81 |     "    train_scaled = scaler.fit_transform(train)\n",
 82 |     "    test_scaled = scaler.transform(test)        \n",
 83 |     "    knn = KNeighborsClassifier(n_neighbors=n_neighbours, metric=metric, \n",
 84 |     "                               weights=weights, n_jobs = 4)\n",
 85 |     "    knn.fit(train_scaled, y_train)\n",
 86 |     "    y_pred = knn.predict(test_scaled)\n",
 87 |     "    print(accuracy_score(y_test, y_pred))"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "### Neighbours tuning"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": 6,
100 |    "metadata": {},
101 |    "outputs": [
102 |     {
103 |      "name": "stdout",
104 |      "output_type": "stream",
105 |      "text": [
106 |       "Accuracy score on kNN using n_neighbours = 2: 0.572448979592\n",
107 |       "Accuracy score on kNN using n_neighbours = 4: 0.555102040816\n",
108 |       "Accuracy score on kNN using n_neighbours = 8: 0.54387755102\n",
109 |       "Accuracy score on kNN using n_neighbours = 16: 0.541836734694\n",
110 |       "Accuracy score on kNN using n_neighbours = 32: 0.552040816327\n",
111 |       "Accuracy score on kNN using n_neighbours = 64: 0.538775510204\n",
112 |       "Accuracy score on kNN using n_neighbours = 128: 0.529591836735\n",
113 |       "Accuracy score on kNN using n_neighbours = 256: 0.516326530612\n",
114 |       "Accuracy score on kNN using n_neighbours = 512: 0.504081632653\n",
115 |       "Accuracy score on kNN using n_neighbours = 1024: 0.472448979592\n"
116 |      ]
117 |     }
118 |    ],
119 |    "source": [
120 |     "for k in range(1,11):\n",
121 |     "    print('Accuracy score on kNN using n_neighbours = {0}:'.format(2**k), end = ' ')\n",
122 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), 2**k)"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "raw",
127 |    "metadata": {},
128 |    "source": [
129 |     "for k in np.logspace(2, 11, base = 2, num = 11, dtype=int).tolist():\n",
130 |     "    print('Accuracy score on kNN using n_neighbours = {0}:'.format(k), end = ' ')\n",
131 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), k)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "### Metric tuning"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": 7,
144 |    "metadata": {},
145 |    "outputs": [
146 |     {
147 |      "name": "stdout",
148 |      "output_type": "stream",
149 |      "text": [
150 |       "Accuracy score on kNN using euclidean metric and 10 neighbours: 0.573469387755\n",
151 |       "Accuracy score on kNN using cosine metric and 10 neighbours: 0.551020408163\n",
152 |       "Accuracy score on kNN using manhattan metric and 10 neighbours: 0.572448979592\n",
153 |       "Accuracy score on kNN using chebyshev metric and 10 neighbours: 0.574489795918\n"
154 |      ]
155 |     }
156 |    ],
157 |    "source": [
158 |     "for metric in ['euclidean', 'cosine', 'manhattan', 'chebyshev']:\n",
159 |     "    print('Accuracy score on kNN using {} metric and {} neighbours:'.format(metric,k), end = ' ')\n",
160 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), 2, metric)"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "markdown",
165 |    "metadata": {},
166 |    "source": [
167 |     "### Weighted kNN"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": 8,
173 |    "metadata": {},
174 |    "outputs": [
175 |     {
176 |      "name": "stdout",
177 |      "output_type": "stream",
178 |      "text": [
179 |       "Accuracy score on kNN using weights = uniform: 0.574489795918\n",
180 |       "Accuracy score on kNN using weights = distance: 0.648979591837\n"
181 |      ]
182 |     }
183 |    ],
184 |    "source": [
185 |     "for weights in ['uniform', 'distance']:\n",
186 |     "    print('Accuracy score on kNN using weights = {0}:'.format(weights), end = ' ')\n",
187 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), 2, 'chebyshev', weights = weights)"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "markdown",
192 |    "metadata": {},
193 |    "source": [
194 |     "### Engineering"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "code",
199 |    "execution_count": 9,
200 |    "metadata": {
201 |     "collapsed": true
202 |    },
203 |    "outputs": [],
204 |    "source": [
205 |     "def create_poly(train,test,degree):\n",
206 |     "    poly = PolynomialFeatures(degree=degree)\n",
207 |     "    train_poly = poly.fit_transform(train)\n",
208 |     "    test_poly = poly.fit_transform(test)\n",
209 |     "    return train_poly,test_poly"
210 |    ]
211 |   },
212 |   {
213 |    "cell_type": "code",
214 |    "execution_count": 10,
215 |    "metadata": {},
216 |    "outputs": [
217 |     {
218 |      "name": "stdout",
219 |      "output_type": "stream",
220 |      "text": [
221 |       "Polynomial degree 1\n",
222 |       "0.648979591837\n",
223 |       "----------\n",
224 |       "Polynomial degree 2\n",
225 |       "0.640816326531\n",
226 |       "----------\n",
227 |       "Polynomial degree 3\n",
228 |       "0.642857142857\n",
229 |       "----------\n"
230 |      ]
231 |     }
232 |    ],
233 |    "source": [
234 |     "for degree in [1,2,3]:\n",
235 |     "    train_poly, test_poly = create_poly(train, test, degree)\n",
236 |     "    print('Polynomial degree',degree)\n",
237 |     "    fit_predict(train_poly, test_poly, y_train, y_test, StandardScaler(), 2, 'chebyshev', weights = 'distance')\n",
238 |     "    print(10*'-')\n",
239 |     "    \n",
240 |     "train_poly, test_poly = create_poly(train, test, 2) "
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "code",
245 |    "execution_count": 11,
246 |    "metadata": {
247 |     "collapsed": true
248 |    },
249 |    "outputs": [],
250 |    "source": [
251 |     "def feat_eng(df):\n",
252 |     "    df['eng1'] = df['fixed acidity'] * df['pH']\n",
253 |     "    df['eng2'] = df['total sulfur dioxide'] / df['free sulfur dioxide']\n",
254 |     "    df['eng3'] = df['sulphates'] / df['chlorides']\n",
255 |     "    df['eng4'] = df['chlorides'] / df['sulphates']\n",
256 |     "    return df\n",
257 |     "\n",
258 |     "train = feat_eng(train)\n",
259 |     "test = feat_eng(test)"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 12,
265 |    "metadata": {},
266 |    "outputs": [
267 |     {
268 |      "name": "stdout",
269 |      "output_type": "stream",
270 |      "text": [
271 |       "Accuracy score after engineering: 0.670408163265\n"
272 |      ]
273 |     }
274 |    ],
275 |    "source": [
276 |     "print('Accuracy score after engineering:', end = ' ')\n",
277 |     "fit_predict(train, test, y_train, y_test, StandardScaler(), 2, 'chebyshev', weights = 'distance')"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": 15,
283 |    "metadata": {},
284 |    "outputs": [
285 |     {
286 |      "name": "stdout",
287 |      "output_type": "stream",
288 |      "text": [
289 |       "overall improvement is 30.36 %\n"
290 |      ]
291 |     }
292 |    ],
293 |    "source": [
294 |     "original_score = 0.514285714286\n",
295 |     "best_score = 0.670408163265\n",
296 |     "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
297 |     "print('overall improvement is {} %'.format(improvement))"
298 |    ]
299 |   }
300 |  ],
301 |  "metadata": {
302 |   "kernelspec": {
303 |    "display_name": "Python 3",
304 |    "language": "python",
305 |    "name": "python3"
306 |   },
307 |   "language_info": {
308 |    "codemirror_mode": {
309 |     "name": "ipython",
310 |     "version": 3
311 |    },
312 |    "file_extension": ".py",
313 |    "mimetype": "text/x-python",
314 |    "name": "python",
315 |    "nbconvert_exporter": "python",
316 |    "pygments_lexer": "ipython3",
317 |    "version": "3.6.2"
318 |   }
319 |  },
320 |  "nbformat": 4,
321 |  "nbformat_minor": 2
322 | }
323 | 


--------------------------------------------------------------------------------
/Section 2/video 2.3 Decision Tree as predictive model.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "import matplotlib.pyplot as plt\n",
 14 |     "%matplotlib inline\n",
 15 |     "from sklearn.preprocessing import StandardScaler\n",
 16 |     "from sklearn.linear_model import LogisticRegression\n",
 17 |     "from sklearn.tree import DecisionTreeClassifier\n",
 18 |     "from sklearn.model_selection import train_test_split\n",
 19 |     "from sklearn.metrics import accuracy_score\n",
 20 |     "from sklearn.preprocessing import PolynomialFeatures\n",
 21 |     "import warnings\n",
 22 |     "warnings.filterwarnings(\"ignore\")\n",
 23 |     "np.random.seed(42)"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 2,
 29 |    "metadata": {
 30 |     "collapsed": true
 31 |    },
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
 35 |     "y = df.pop('quality')"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 3,
 41 |    "metadata": {
 42 |     "collapsed": true
 43 |    },
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "for i in df.columns:\n",
 47 |     "    df[i] = df[i].fillna(np.mean(df[i]))\n",
 48 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 4,
 54 |    "metadata": {},
 55 |    "outputs": [
 56 |     {
 57 |      "name": "stdout",
 58 |      "output_type": "stream",
 59 |      "text": [
 60 |       "Accuracy score baseline: 0.514285714286\n"
 61 |      ]
 62 |     }
 63 |    ],
 64 |    "source": [
 65 |     "lr = LogisticRegression()\n",
 66 |     "lr.fit(train, y_train)\n",
 67 |     "y_pred = lr.predict(test)\n",
 68 |     "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
 69 |    ]
 70 |   },
 71 |   {
 72 |    "cell_type": "code",
 73 |    "execution_count": 5,
 74 |    "metadata": {
 75 |     "collapsed": true
 76 |    },
 77 |    "outputs": [],
 78 |    "source": [
 79 |     "def fit_predict(train, test, y_train, y_test, scaler, max_depth, \n",
 80 |     "                criterion = 'entropy', max_features = 1, min_samples_split = 4):\n",
 81 |     "    train_scaled = scaler.fit_transform(train)\n",
 82 |     "    test_scaled = scaler.transform(test)        \n",
 83 |     "    dt = DecisionTreeClassifier(criterion = criterion, max_depth=max_depth, \n",
 84 |     "                                random_state=42, max_features=max_features,\n",
 85 |     "                               min_samples_split=min_samples_split)\n",
 86 |     "    dt.fit(train_scaled, y_train)\n",
 87 |     "    y_pred = dt.predict(test_scaled)\n",
 88 |     "    print(accuracy_score(y_test, y_pred))"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "### Max depth tuning"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 6,
101 |    "metadata": {},
102 |    "outputs": [
103 |     {
104 |      "name": "stdout",
105 |      "output_type": "stream",
106 |      "text": [
107 |       "Accuracy score using max_depth = 1: 0.440816326531\n",
108 |       "Accuracy score using max_depth = 2: 0.440816326531\n",
109 |       "Accuracy score using max_depth = 3: 0.45306122449\n",
110 |       "Accuracy score using max_depth = 4: 0.460204081633\n",
111 |       "Accuracy score using max_depth = 5: 0.486734693878\n",
112 |       "Accuracy score using max_depth = 6: 0.460204081633\n",
113 |       "Accuracy score using max_depth = 7: 0.497959183673\n",
114 |       "Accuracy score using max_depth = 8: 0.50306122449\n",
115 |       "Accuracy score using max_depth = 9: 0.518367346939\n",
116 |       "Accuracy score using max_depth = 10: 0.49693877551\n",
117 |       "Accuracy score using max_depth = 11: 0.514285714286\n",
118 |       "Accuracy score using max_depth = 12: 0.492857142857\n",
119 |       "Accuracy score using max_depth = 13: 0.575510204082\n",
120 |       "Accuracy score using max_depth = 14: 0.571428571429\n",
121 |       "Accuracy score using max_depth = 15: 0.533673469388\n",
122 |       "Accuracy score using max_depth = 16: 0.548979591837\n",
123 |       "Accuracy score using max_depth = 17: 0.548979591837\n",
124 |       "Accuracy score using max_depth = 18: 0.577551020408\n",
125 |       "Accuracy score using max_depth = 19: 0.561224489796\n"
126 |      ]
127 |     }
128 |    ],
129 |    "source": [
130 |     "for i in range(1, 20):\n",
131 |     "    print('Accuracy score using max_depth =', i, end = ': ')\n",
132 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), i)"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "metadata": {},
138 |    "source": [
139 |     "### Max features tuning"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 7,
145 |    "metadata": {},
146 |    "outputs": [
147 |     {
148 |      "name": "stdout",
149 |      "output_type": "stream",
150 |      "text": [
151 |       "Accuracy score using max features = 0.1: 0.577551020408\n",
152 |       "Accuracy score using max features = 0.2: 0.608163265306\n",
153 |       "Accuracy score using max features = 0.3: 0.608163265306\n",
154 |       "Accuracy score using max features = 0.4: 0.595918367347\n",
155 |       "Accuracy score using max features = 0.5: 0.60612244898\n",
156 |       "Accuracy score using max features = 0.6: 0.580612244898\n",
157 |       "Accuracy score using max features = 0.7: 0.60306122449\n",
158 |       "Accuracy score using max features = 0.8: 0.572448979592\n",
159 |       "Accuracy score using max features = 0.9: 0.576530612245\n"
160 |      ]
161 |     }
162 |    ],
163 |    "source": [
164 |     "for i in np.arange(0.1, 1.0, 0.1):\n",
165 |     "    print('Accuracy score using max features =', i, end = ': ')\n",
166 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), max_depth = 18, max_features=i)"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "markdown",
171 |    "metadata": {},
172 |    "source": [
173 |     "### Min samples split tuning"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": 8,
179 |    "metadata": {},
180 |    "outputs": [
181 |     {
182 |      "name": "stdout",
183 |      "output_type": "stream",
184 |      "text": [
185 |       "Accuracy score using min samples split = 2: 0.613265306122\n",
186 |       "Accuracy score using min samples split = 3: 0.592857142857\n",
187 |       "Accuracy score using min samples split = 4: 0.608163265306\n",
188 |       "Accuracy score using min samples split = 5: 0.561224489796\n",
189 |       "Accuracy score using min samples split = 6: 0.580612244898\n",
190 |       "Accuracy score using min samples split = 7: 0.563265306122\n",
191 |       "Accuracy score using min samples split = 8: 0.595918367347\n",
192 |       "Accuracy score using min samples split = 9: 0.55612244898\n"
193 |      ]
194 |     }
195 |    ],
196 |    "source": [
197 |     "for i in range(2, 10):\n",
198 |     "    print('Accuracy score using min samples split =', i, end = ': ')\n",
199 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), 18, max_features=0.3, min_samples_split=i)"
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "markdown",
204 |    "metadata": {},
205 |    "source": [
206 |     "### Criterion tuning"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "code",
211 |    "execution_count": 9,
212 |    "metadata": {},
213 |    "outputs": [
214 |     {
215 |      "name": "stdout",
216 |      "output_type": "stream",
217 |      "text": [
218 |       "Accuracy score using criterion = gini: 0.613265306122\n",
219 |       "Accuracy score using criterion = entropy: 0.613265306122\n"
220 |      ]
221 |     }
222 |    ],
223 |    "source": [
224 |     "for i in ['gini', 'entropy']:\n",
225 |     "    print('Accuracy score using criterion =', i, end = ': ')\n",
226 |     "    fit_predict(train, test, y_train, y_test, StandardScaler(), 18, \n",
227 |     "                max_features=0.3, min_samples_split=2, criterion = 'entropy')"
228 |    ]
229 |   },
230 |   {
231 |    "cell_type": "code",
232 |    "execution_count": 10,
233 |    "metadata": {
234 |     "collapsed": true
235 |    },
236 |    "outputs": [],
237 |    "source": [
238 |     "def create_poly(train,test,degree):\n",
239 |     "    poly = PolynomialFeatures(degree=degree)\n",
240 |     "    train_poly = poly.fit_transform(train)\n",
241 |     "    test_poly = poly.fit_transform(test)\n",
242 |     "    return train_poly,test_poly"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "code",
247 |    "execution_count": 11,
248 |    "metadata": {},
249 |    "outputs": [
250 |     {
251 |      "name": "stdout",
252 |      "output_type": "stream",
253 |      "text": [
254 |       "Polynomial degree 1\n",
255 |       "0.604081632653\n",
256 |       "----------\n",
257 |       "Polynomial degree 2\n",
258 |       "0.625510204082\n",
259 |       "----------\n",
260 |       "Polynomial degree 3\n",
261 |       "0.625510204082\n",
262 |       "----------\n",
263 |       "Polynomial degree 4\n",
264 |       "0.609183673469\n",
265 |       "----------\n"
266 |      ]
267 |     }
268 |    ],
269 |    "source": [
270 |     "for degree in [1,2,3,4]:\n",
271 |     "    train_poly, test_poly = create_poly(train, test, degree)\n",
272 |     "    print('Polynomial degree',degree)\n",
273 |     "    fit_predict(train_poly, test_poly, y_train, y_test, StandardScaler(), 18, \n",
274 |     "                max_features=0.3, min_samples_split=2, criterion = 'entropy')\n",
275 |     "    print(10*'-')\n",
276 |     "    \n",
277 |     "train_poly, test_poly = create_poly(train, test, 2) "
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "code",
282 |    "execution_count": 12,
283 |    "metadata": {},
284 |    "outputs": [
285 |     {
286 |      "name": "stdout",
287 |      "output_type": "stream",
288 |      "text": [
289 |       "Additional feature engineering:\n",
290 |       "0.592857142857\n",
291 |       "0.617346938776\n"
292 |      ]
293 |     }
294 |    ],
295 |    "source": [
296 |     "def feat_eng(df):\n",
297 |     "    df['eng1'] = df['fixed acidity'] * df['pH']\n",
298 |     "    df['eng2'] = df['total sulfur dioxide'] / df['free sulfur dioxide']\n",
299 |     "    df['eng3'] = df['sulphates'] / df['chlorides']\n",
300 |     "    df['eng4'] = df['chlorides'] / df['sulphates']\n",
301 |     "    return df\n",
302 |     "\n",
303 |     "train = feat_eng(train)\n",
304 |     "test = feat_eng(test)\n",
305 |     "\n",
306 |     "print('Additional feature engineering:')\n",
307 |     "\n",
308 |     "fit_predict(train, test, y_train, y_test, StandardScaler(), 18, \n",
309 |     "                max_features=0.3, min_samples_split=2, criterion = 'entropy')\n",
310 |     "\n",
311 |     "train_poly, test_poly = create_poly(train, test, 2)\n",
312 |     "\n",
313 |     "fit_predict(train_poly, test_poly, y_train, y_test, StandardScaler(), 18, \n",
314 |     "                max_features=0.3, min_samples_split=2, criterion = 'entropy')\n"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "code",
319 |    "execution_count": 13,
320 |    "metadata": {},
321 |    "outputs": [
322 |     {
323 |      "name": "stdout",
324 |      "output_type": "stream",
325 |      "text": [
326 |       "overall improvement is 21.63 %\n"
327 |      ]
328 |     }
329 |    ],
330 |    "source": [
331 |     "original_score = 0.514285714286\n",
332 |     "best_score = 0.625510204082\n",
333 |     "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
334 |     "print('overall improvement is {} %'.format(improvement))"
335 |    ]
336 |   }
337 |  ],
338 |  "metadata": {
339 |   "kernelspec": {
340 |    "display_name": "Python 3",
341 |    "language": "python",
342 |    "name": "python3"
343 |   },
344 |   "language_info": {
345 |    "codemirror_mode": {
346 |     "name": "ipython",
347 |     "version": 3
348 |    },
349 |    "file_extension": ".py",
350 |    "mimetype": "text/x-python",
351 |    "name": "python",
352 |    "nbconvert_exporter": "python",
353 |    "pygments_lexer": "ipython3",
354 |    "version": "3.6.1"
355 |   }
356 |  },
357 |  "nbformat": 4,
358 |  "nbformat_minor": 2
359 | }
360 | 


--------------------------------------------------------------------------------
/Section 2/video 2.5 combining all together.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 9,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import pandas as pd\n",
 10 |     "import numpy as np\n",
 11 |     "import matplotlib.pyplot as plt\n",
 12 |     "%matplotlib inline\n",
 13 |     "from sklearn.preprocessing import StandardScaler\n",
 14 |     "from sklearn.neighbors import KNeighborsClassifier\n",
 15 |     "from sklearn.tree import DecisionTreeClassifier\n",
 16 |     "from sklearn.svm import SVC\n",
 17 |     "from sklearn.model_selection import train_test_split\n",
 18 |     "from sklearn.metrics import roc_auc_score\n",
 19 |     "from sklearn.linear_model import LogisticRegression\n",
 20 |     "from keras.models import Sequential, Input, Model\n",
 21 |     "from keras.layers import Dense, Activation, Dropout, Flatten, BatchNormalization\n",
 22 |     "from keras.callbacks import EarlyStopping, ReduceLROnPlateau\n",
 23 |     "from sklearn.metrics import accuracy_score\n",
 24 |     "import warnings\n",
 25 |     "warnings.filterwarnings(\"ignore\")\n",
 26 |     "np.random.seed(42)\n"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 5,
 32 |    "metadata": {},
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "df = pd.read_csv(\"train.csv\", sep = ',')\n",
 36 |     "df = df.sample(frac = 0.2, random_state = 123)\n",
 37 |     "y = df.pop('target')\n",
 38 |     "df.drop('id', axis = 1, inplace=True)\n",
 39 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": 18,
 45 |    "metadata": {},
 46 |    "outputs": [
 47 |     {
 48 |      "name": "stdout",
 49 |      "output_type": "stream",
 50 |      "text": [
 51 |       "AUC baseline: 0.6205363530135511\n"
 52 |      ]
 53 |     }
 54 |    ],
 55 |    "source": [
 56 |     "lr = LogisticRegression()\n",
 57 |     "lr.fit(train, y_train)\n",
 58 |     "y_pred = lr.predict_proba(test)\n",
 59 |     "print('AUC baseline:', roc_auc_score(y_test, y_pred[:,1]))"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": 19,
 65 |    "metadata": {
 66 |     "collapsed": true
 67 |    },
 68 |    "outputs": [],
 69 |    "source": [
 70 |     "scaler = StandardScaler()\n",
 71 |     "df_values = scaler.fit_transform(df)"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 20,
 77 |    "metadata": {
 78 |     "collapsed": true
 79 |    },
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "def fit_knn(train, test, y_train, y_test, \n",
 83 |     "                n_neighbours = 64, metric = 'euclidean', weights = 'distance'):   \n",
 84 |     "    knn = KNeighborsClassifier(n_neighbors=n_neighbours, metric=metric, \n",
 85 |     "                               weights=weights, n_jobs = 4)\n",
 86 |     "    knn.fit(train, y_train)\n",
 87 |     "    y_pred = knn.predict_proba(test)\n",
 88 |     "    print(roc_auc_score(y_test, y_pred[:, 1]))"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 21,
 94 |    "metadata": {
 95 |     "collapsed": true
 96 |    },
 97 |    "outputs": [],
 98 |    "source": [
 99 |     "def fit_svm(train, test, y_train, y_test, kernel = 'linear', C = 1.5, degree = 3):  \n",
100 |     "    svm = SVC(kernel = kernel, degree = degree, C = C, max_iter=100, probability=True)\n",
101 |     "    svm.fit(train, y_train)\n",
102 |     "    y_pred = svm.predict_proba(test)\n",
103 |     "    print(roc_auc_score(y_test, y_pred[:, 1]))"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": 22,
109 |    "metadata": {
110 |     "collapsed": true
111 |    },
112 |    "outputs": [],
113 |    "source": [
114 |     "def fit_tree(train, test, y_train, y_test, max_depth = 9, \n",
115 |     "                criterion = 'entropy', max_features = 0.8, min_samples_split = 6):\n",
116 |     "    tree = DecisionTreeClassifier(criterion = criterion, max_depth=max_depth, \n",
117 |     "                                random_state=111, max_features=max_features,\n",
118 |     "                               min_samples_split=min_samples_split)\n",
119 |     "    tree.fit(train, y_train)\n",
120 |     "    y_pred = tree.predict_proba(test)\n",
121 |     "    print(roc_auc_score(y_test, y_pred[:, 1]))"
122 |    ]
123 |   },
124 |   {
125 |    "cell_type": "code",
126 |    "execution_count": 23,
127 |    "metadata": {
128 |     "collapsed": true
129 |    },
130 |    "outputs": [],
131 |    "source": [
132 |     "def create_autoencoder_model(object_size=df.shape[1], encoder_layer_shapes=[128, 64, 32], decoder_layer_shapes=[64, 128]):\n",
133 |     "    # входные параметры:\n",
134 |     "    # object_size: int, размер входного и выходного слоя автоэнкодера\n",
135 |     "    # encoder_layer_shapes: list of int, количество нейронов в каждом слое энкодера. \n",
136 |     "    #                       последний элемент списка - размер \"бутылочного горлышка\"\n",
137 |     "    # decoder_layer_shapes: ist of int, количество нейронов в каждом слое декодера\n",
138 |     "    \n",
139 |     "    # выход:\n",
140 |     "    # keras модель\n",
141 |     "    input_ = Input(shape=(object_size,))\n",
142 |     "    encoded = Dense(encoder_layer_shapes[0], activation='elu')(input_)\n",
143 |     "    encoded = BatchNormalization()(encoded)\n",
144 |     "    encoded = Dense(encoder_layer_shapes[1], activation='elu')(encoded)\n",
145 |     "    encoded = BatchNormalization()(encoded)\n",
146 |     "    encoded = Dense(encoder_layer_shapes[2], activation='elu')(encoded)\n",
147 |     "    encoded = BatchNormalization()(encoded)\n",
148 |     "    decoded = Dense(decoder_layer_shapes[0], activation='elu')(encoded)\n",
149 |     "    decoded = BatchNormalization()(decoded)\n",
150 |     "    decoded = Dense(decoder_layer_shapes[1], activation='elu')(decoded)\n",
151 |     "    decoded = BatchNormalization()(decoded)\n",
152 |     "    decoded = Dense(object_size, activation='sigmoid')(decoded)\n",
153 |     "    \n",
154 |     "    model = Model(input_, decoded)\n",
155 |     "    model.compile(optimizer = 'Adam', loss='mean_squared_error')\n",
156 |     "    return model"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "code",
161 |    "execution_count": 24,
162 |    "metadata": {
163 |     "collapsed": true
164 |    },
165 |    "outputs": [],
166 |    "source": [
167 |     "train, test, y_train, y_test = train_test_split(df_values, y, test_size = 0.2)"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": 25,
173 |    "metadata": {},
174 |    "outputs": [
175 |     {
176 |      "name": "stdout",
177 |      "output_type": "stream",
178 |      "text": [
179 |       "Train on 95233 samples, validate on 23809 samples\n",
180 |       "Epoch 1/100\n",
181 |       "95233/95233 [==============================] - 4s 39us/step - loss: 0.8554 - val_loss: 0.7237\n",
182 |       "Epoch 2/100\n",
183 |       "95233/95233 [==============================] - 3s 30us/step - loss: 0.6612 - val_loss: 0.6782\n",
184 |       "Epoch 3/100\n",
185 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6390 - val_loss: 0.6564\n",
186 |       "Epoch 4/100\n",
187 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6311 - val_loss: 0.6472\n",
188 |       "Epoch 5/100\n",
189 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6258 - val_loss: 0.6417\n",
190 |       "Epoch 6/100\n",
191 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6228 - val_loss: 0.6384\n",
192 |       "Epoch 7/100\n",
193 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6210 - val_loss: 0.6368\n",
194 |       "Epoch 8/100\n",
195 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6194 - val_loss: 0.6344\n",
196 |       "Epoch 9/100\n",
197 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6179 - val_loss: 0.6328\n",
198 |       "Epoch 10/100\n",
199 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6166 - val_loss: 0.6315\n",
200 |       "Epoch 11/100\n",
201 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6155 - val_loss: 0.6306\n",
202 |       "Epoch 12/100\n",
203 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6147 - val_loss: 0.6298\n",
204 |       "Epoch 13/100\n",
205 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6141 - val_loss: 0.6289\n",
206 |       "Epoch 14/100\n",
207 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6136 - val_loss: 0.6283\n",
208 |       "Epoch 15/100\n",
209 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6130 - val_loss: 0.6281\n",
210 |       "Epoch 16/100\n",
211 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6124 - val_loss: 0.6273\n",
212 |       "Epoch 17/100\n",
213 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6119 - val_loss: 0.6268\n",
214 |       "Epoch 18/100\n",
215 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6114 - val_loss: 0.6260\n",
216 |       "Epoch 19/100\n",
217 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6111 - val_loss: 0.6261\n",
218 |       "Epoch 20/100\n",
219 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6108 - val_loss: 0.6257\n",
220 |       "Epoch 21/100\n",
221 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6105 - val_loss: 0.6253\n",
222 |       "Epoch 22/100\n",
223 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6103 - val_loss: 0.6251\n",
224 |       "Epoch 23/100\n",
225 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6102 - val_loss: 0.6249\n",
226 |       "Epoch 24/100\n",
227 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6101 - val_loss: 0.6247\n",
228 |       "Epoch 25/100\n",
229 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6098 - val_loss: 0.6246\n",
230 |       "Epoch 26/100\n",
231 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6098 - val_loss: 0.6247\n",
232 |       "Epoch 27/100\n",
233 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6096 - val_loss: 0.6242\n",
234 |       "Epoch 28/100\n",
235 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6095 - val_loss: 0.6240\n",
236 |       "Epoch 29/100\n",
237 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6094 - val_loss: 0.6240\n",
238 |       "Epoch 30/100\n",
239 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6092 - val_loss: 0.6240\n",
240 |       "Epoch 31/100\n",
241 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6091 - val_loss: 0.6235\n",
242 |       "Epoch 32/100\n",
243 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6090 - val_loss: 0.6235\n",
244 |       "Epoch 33/100\n",
245 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6089 - val_loss: 0.6234\n",
246 |       "Epoch 34/100\n",
247 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6089 - val_loss: 0.6236\n",
248 |       "Epoch 35/100\n",
249 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6088 - val_loss: 0.6233\n",
250 |       "Epoch 36/100\n",
251 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6086 - val_loss: 0.6230\n",
252 |       "Epoch 37/100\n",
253 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6087 - val_loss: 0.6233\n",
254 |       "Epoch 38/100\n",
255 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6086 - val_loss: 0.6231\n",
256 |       "Epoch 39/100\n",
257 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6085 - val_loss: 0.6232\n",
258 |       "Epoch 40/100\n",
259 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6084 - val_loss: 0.6228\n",
260 |       "Epoch 41/100\n",
261 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6084 - val_loss: 0.6232\n",
262 |       "Epoch 42/100\n",
263 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6083 - val_loss: 0.6230\n",
264 |       "Epoch 43/100\n",
265 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6084 - val_loss: 0.6229\n",
266 |       "Epoch 44/100\n",
267 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6082 - val_loss: 0.6230\n",
268 |       "Epoch 45/100\n",
269 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6082 - val_loss: 0.6230\n",
270 |       "Epoch 46/100\n",
271 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6081 - val_loss: 0.6227\n",
272 |       "\n",
273 |       "Epoch 00046: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.\n",
274 |       "Epoch 47/100\n",
275 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6075 - val_loss: 0.6216\n",
276 |       "Epoch 48/100\n",
277 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6074 - val_loss: 0.6216\n",
278 |       "Epoch 49/100\n",
279 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6074 - val_loss: 0.6215\n",
280 |       "Epoch 50/100\n",
281 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6074 - val_loss: 0.6215\n",
282 |       "Epoch 51/100\n",
283 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6074 - val_loss: 0.6215\n",
284 |       "Epoch 52/100\n",
285 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6074 - val_loss: 0.6216\n",
286 |       "Epoch 53/100\n",
287 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6215\n",
288 |       "Epoch 54/100\n",
289 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6215\n",
290 |       "Epoch 55/100\n",
291 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6074 - val_loss: 0.6214\n",
292 |       "Epoch 56/100\n",
293 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6215\n",
294 |       "Epoch 57/100\n",
295 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6215\n",
296 |       "Epoch 58/100\n",
297 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6214\n",
298 |       "Epoch 59/100\n",
299 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6215\n",
300 |       "Epoch 60/100\n",
301 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6214\n",
302 |       "Epoch 61/100\n",
303 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6214\n",
304 |       "\n",
305 |       "Epoch 00061: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.\n",
306 |       "Epoch 62/100\n",
307 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
308 |       "Epoch 63/100\n",
309 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6073 - val_loss: 0.6214\n",
310 |       "Epoch 64/100\n",
311 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
312 |       "Epoch 65/100\n",
313 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6215\n",
314 |       "Epoch 66/100\n",
315 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
316 |       "Epoch 67/100\n",
317 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
318 |       "Epoch 68/100\n",
319 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
320 |       "\n",
321 |       "Epoch 00068: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.\n",
322 |       "Epoch 69/100\n",
323 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
324 |       "Epoch 70/100\n",
325 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
326 |       "Epoch 71/100\n",
327 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
328 |       "Epoch 72/100\n",
329 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
330 |       "Epoch 73/100\n",
331 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6213\n",
332 |       "Epoch 74/100\n",
333 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n"
334 |      ]
335 |     },
336 |     {
337 |      "name": "stdout",
338 |      "output_type": "stream",
339 |      "text": [
340 |       "Epoch 75/100\n",
341 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
342 |       "\n",
343 |       "Epoch 00075: ReduceLROnPlateau reducing learning rate to 1.0000001111620805e-07.\n",
344 |       "Epoch 76/100\n",
345 |       "95233/95233 [==============================] - 3s 28us/step - loss: 0.6072 - val_loss: 0.6215\n",
346 |       "Epoch 77/100\n",
347 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
348 |       "Epoch 78/100\n",
349 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
350 |       "Epoch 79/100\n",
351 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
352 |       "Epoch 80/100\n",
353 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
354 |       "Epoch 81/100\n",
355 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
356 |       "Epoch 82/100\n",
357 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
358 |       "\n",
359 |       "Epoch 00082: ReduceLROnPlateau reducing learning rate to 1.000000082740371e-08.\n",
360 |       "Epoch 83/100\n",
361 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
362 |       "Epoch 84/100\n",
363 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6215\n",
364 |       "Epoch 85/100\n",
365 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
366 |       "Epoch 86/100\n",
367 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
368 |       "Epoch 87/100\n",
369 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
370 |       "Epoch 88/100\n",
371 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6214\n",
372 |       "Epoch 89/100\n",
373 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6071 - val_loss: 0.6214\n",
374 |       "\n",
375 |       "Epoch 00089: ReduceLROnPlateau reducing learning rate to 1.000000082740371e-09.\n",
376 |       "Epoch 90/100\n",
377 |       "95233/95233 [==============================] - 3s 29us/step - loss: 0.6072 - val_loss: 0.6215\n",
378 |       "Epoch 00090: early stopping\n"
379 |      ]
380 |     },
381 |     {
382 |      "data": {
383 |       "text/plain": [
384 |        "<keras.callbacks.History at 0x1d20b36b550>"
385 |       ]
386 |      },
387 |      "execution_count": 25,
388 |      "metadata": {},
389 |      "output_type": "execute_result"
390 |     }
391 |    ],
392 |    "source": [
393 |     "autoencoder = create_autoencoder_model()\n",
394 |     "\n",
395 |     "early_stop = EarlyStopping(monitor='val_loss',\n",
396 |     "                           patience=35,\n",
397 |     "                           verbose=1,\n",
398 |     "                           min_delta=1e-4)\n",
399 |     "\n",
400 |     "reduce_lr =  ReduceLROnPlateau(monitor='val_loss',\n",
401 |     "                               factor=0.1,\n",
402 |     "                               patience=5,\n",
403 |     "                               cooldown=2,\n",
404 |     "                               verbose=1)\n",
405 |     "\n",
406 |     "autoencoder.fit(train, train,\n",
407 |     "                epochs=100,\n",
408 |     "                batch_size=512,\n",
409 |     "                validation_data=(test, test), callbacks = [early_stop, reduce_lr])"
410 |    ]
411 |   },
412 |   {
413 |    "cell_type": "code",
414 |    "execution_count": 26,
415 |    "metadata": {},
416 |    "outputs": [
417 |     {
418 |      "name": "stdout",
419 |      "output_type": "stream",
420 |      "text": [
421 |       "95233/95233 [==============================] - 2s 26us/step\n",
422 |       "23809/23809 [==============================] - 1s 25us/step\n"
423 |      ]
424 |     }
425 |    ],
426 |    "source": [
427 |     "model_bn = Model(autoencoder.input, autoencoder.layers[3].output)\n",
428 |     "decompose_train = model_bn.predict(train, verbose = 1)\n",
429 |     "decompose_test = model_bn.predict(test, verbose = 1)\n"
430 |    ]
431 |   },
432 |   {
433 |    "cell_type": "code",
434 |    "execution_count": 27,
435 |    "metadata": {},
436 |    "outputs": [
437 |     {
438 |      "name": "stdout",
439 |      "output_type": "stream",
440 |      "text": [
441 |       "ROC-AUC score on kNN:\n",
442 |       "0.547487388360222\n",
443 |       "ROC-AUC score on SVM:\n",
444 |       "0.5264453710504442\n",
445 |       "ROC-AUC score on Decision tree:\n",
446 |       "0.5499497836853388\n"
447 |      ]
448 |     }
449 |    ],
450 |    "source": [
451 |     "print('ROC-AUC score on kNN:')\n",
452 |     "fit_knn(decompose_train, decompose_test, y_train, y_test)\n",
453 |     "print('ROC-AUC score on SVM:')\n",
454 |     "fit_svm(decompose_train, decompose_test, y_train, y_test)\n",
455 |     "print('ROC-AUC score on Decision tree:')\n",
456 |     "fit_tree(decompose_train, decompose_test, y_train, y_test)"
457 |    ]
458 |   },
459 |   {
460 |    "cell_type": "code",
461 |    "execution_count": 28,
462 |    "metadata": {},
463 |    "outputs": [
464 |     {
465 |      "name": "stdout",
466 |      "output_type": "stream",
467 |      "text": [
468 |       "ROC-AUC score on kNN:\n",
469 |       "0.5609669115700139\n",
470 |       "ROC-AUC score on SVM:\n",
471 |       "0.4533384754593952\n",
472 |       "ROC-AUC score on Decision tree:\n",
473 |       "0.5525507861696763\n"
474 |      ]
475 |     }
476 |    ],
477 |    "source": [
478 |     "print('ROC-AUC score on kNN:')\n",
479 |     "fit_knn(train, test, y_train, y_test)\n",
480 |     "print('ROC-AUC score on SVM:')\n",
481 |     "fit_svm(train, test, y_train, y_test)\n",
482 |     "print('ROC-AUC score on Decision tree:')\n",
483 |     "fit_tree(train, test, y_train, y_test)"
484 |    ]
485 |   }
486 |  ],
487 |  "metadata": {
488 |   "kernelspec": {
489 |    "display_name": "Python 3",
490 |    "language": "python",
491 |    "name": "python3"
492 |   },
493 |   "language_info": {
494 |    "codemirror_mode": {
495 |     "name": "ipython",
496 |     "version": 3
497 |    },
498 |    "file_extension": ".py",
499 |    "mimetype": "text/x-python",
500 |    "name": "python",
501 |    "nbconvert_exporter": "python",
502 |    "pygments_lexer": "ipython3",
503 |    "version": "3.6.1"
504 |   }
505 |  },
506 |  "nbformat": 4,
507 |  "nbformat_minor": 2
508 | }
509 | 


--------------------------------------------------------------------------------
/Section 3/3.1 Random Forest for classification.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "import matplotlib.pyplot as plt\n",
 14 |     "%matplotlib inline\n",
 15 |     "from sklearn.preprocessing import StandardScaler\n",
 16 |     "from sklearn.linear_model import LogisticRegression\n",
 17 |     "from sklearn.ensemble import RandomForestClassifier\n",
 18 |     "from sklearn.model_selection import train_test_split\n",
 19 |     "from sklearn.metrics import accuracy_score\n",
 20 |     "from sklearn.preprocessing import PolynomialFeatures\n",
 21 |     "import warnings\n",
 22 |     "warnings.filterwarnings(\"ignore\")\n",
 23 |     "np.random.seed(42)"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 2,
 29 |    "metadata": {
 30 |     "collapsed": true
 31 |    },
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "df = pd.read_csv(\"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\", sep = ';')\n",
 35 |     "y = df.pop('quality')\n",
 36 |     "for i in df.columns:\n",
 37 |     "    df[i] = df[i].fillna(np.mean(df[i]))\n",
 38 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": 3,
 44 |    "metadata": {},
 45 |    "outputs": [
 46 |     {
 47 |      "name": "stdout",
 48 |      "output_type": "stream",
 49 |      "text": [
 50 |       "Accuracy score baseline: 0.5142857142857142\n"
 51 |      ]
 52 |     }
 53 |    ],
 54 |    "source": [
 55 |     "lr = LogisticRegression()\n",
 56 |     "lr.fit(train, y_train)\n",
 57 |     "y_pred = lr.predict(test)\n",
 58 |     "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": 4,
 64 |    "metadata": {
 65 |     "collapsed": true
 66 |    },
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "def fit_predict(train, test, y_train, y_test,  max_depth = None , \n",
 70 |     "                n_estimators = 10, max_features = 'auto', min_samples_split = 2,scaler = None):\n",
 71 |     "    if scaler:\n",
 72 |     "        train = scaler.fit_transform(train)\n",
 73 |     "        test = scaler.transform(test)        \n",
 74 |     "    RF = RandomForestClassifier(n_estimators = n_estimators, max_depth=max_depth, \n",
 75 |     "                                random_state = 42, max_features = max_features,\n",
 76 |     "                               min_samples_split = min_samples_split)\n",
 77 |     "    RF.fit(train, y_train)\n",
 78 |     "    y_pred = RF.predict(test)\n",
 79 |     "    print(accuracy_score(y_test, y_pred))"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 5,
 85 |    "metadata": {},
 86 |    "outputs": [
 87 |     {
 88 |      "name": "stdout",
 89 |      "output_type": "stream",
 90 |      "text": [
 91 |       "baseline accuracy score: 0.6428571428571429\n",
 92 |       "baseline accuracy score with scaler: 0.6418367346938776\n"
 93 |      ]
 94 |     }
 95 |    ],
 96 |    "source": [
 97 |     "print('baseline accuracy score', end = ': ')\n",
 98 |     "fit_predict(train,test,y_train,y_test)\n",
 99 |     "print('baseline accuracy score with scaler', end = ': ')\n",
100 |     "fit_predict(train,test,y_train,y_test,scaler=StandardScaler())"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "code",
105 |    "execution_count": 6,
106 |    "metadata": {},
107 |    "outputs": [
108 |     {
109 |      "name": "stdout",
110 |      "output_type": "stream",
111 |      "text": [
112 |       "Accuracy score using n_estimators = 20: 0.6591836734693878\n",
113 |       "Accuracy score using n_estimators = 40: 0.6744897959183673\n",
114 |       "Accuracy score using n_estimators = 60: 0.6816326530612244\n",
115 |       "Accuracy score using n_estimators = 80: 0.6877551020408164\n",
116 |       "Accuracy score using n_estimators = 100: 0.6908163265306122\n",
117 |       "Accuracy score using n_estimators = 120: 0.6979591836734694\n",
118 |       "Accuracy score using n_estimators = 140: 0.6908163265306122\n",
119 |       "Accuracy score using n_estimators = 160: 0.6959183673469388\n",
120 |       "Accuracy score using n_estimators = 180: 0.6948979591836735\n"
121 |      ]
122 |     }
123 |    ],
124 |    "source": [
125 |     "for n_estimators in range(20,200,20):\n",
126 |     "    print('Accuracy score using n_estimators =', n_estimators, end = ': ')\n",
127 |     "    fit_predict(train,test,y_train,y_test,n_estimators = n_estimators)\n"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": 7,
133 |    "metadata": {},
134 |    "outputs": [
135 |     {
136 |      "name": "stdout",
137 |      "output_type": "stream",
138 |      "text": [
139 |       "Accuracy score using max_depth = 1: 0.44081632653061226\n",
140 |       "Accuracy score using max_depth = 2: 0.4897959183673469\n",
141 |       "Accuracy score using max_depth = 3: 0.49387755102040815\n",
142 |       "Accuracy score using max_depth = 4: 0.5051020408163265\n",
143 |       "Accuracy score using max_depth = 5: 0.5244897959183673\n",
144 |       "Accuracy score using max_depth = 6: 0.5357142857142857\n",
145 |       "Accuracy score using max_depth = 7: 0.563265306122449\n",
146 |       "Accuracy score using max_depth = 8: 0.5826530612244898\n",
147 |       "Accuracy score using max_depth = 9: 0.5959183673469388\n",
148 |       "Accuracy score using max_depth = 10: 0.6091836734693877\n",
149 |       "Accuracy score using max_depth = 11: 0.6469387755102041\n",
150 |       "Accuracy score using max_depth = 12: 0.6744897959183673\n",
151 |       "Accuracy score using max_depth = 13: 0.6795918367346939\n",
152 |       "Accuracy score using max_depth = 14: 0.6979591836734694\n",
153 |       "Accuracy score using max_depth = 15: 0.7010204081632653\n",
154 |       "Accuracy score using max_depth = 16: 0.6959183673469388\n",
155 |       "Accuracy score using max_depth = 17: 0.6969387755102041\n",
156 |       "Accuracy score using max_depth = 18: 0.7020408163265306\n",
157 |       "Accuracy score using max_depth = 19: 0.7010204081632653\n"
158 |      ]
159 |     }
160 |    ],
161 |    "source": [
162 |     "for max_depth in range(1,20):\n",
163 |     "    print('Accuracy score using max_depth =', max_depth, end = ': ')\n",
164 |     "    fit_predict(train,test,y_train,y_test,n_estimators = 160,max_depth = max_depth)\n"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "code",
169 |    "execution_count": 8,
170 |    "metadata": {},
171 |    "outputs": [
172 |     {
173 |      "name": "stdout",
174 |      "output_type": "stream",
175 |      "text": [
176 |       "Accuracy score using max_features = 0.1: 0.6969387755102041\n",
177 |       "Accuracy score using max_features = 0.2: 0.7040816326530612\n",
178 |       "Accuracy score using max_features = 0.30000000000000004: 0.7020408163265306\n",
179 |       "Accuracy score using max_features = 0.4: 0.6948979591836735\n",
180 |       "Accuracy score using max_features = 0.5: 0.6969387755102041\n",
181 |       "Accuracy score using max_features = 0.6: 0.6908163265306122\n",
182 |       "Accuracy score using max_features = 0.7000000000000001: 0.6969387755102041\n",
183 |       "Accuracy score using max_features = 0.8: 0.6989795918367347\n",
184 |       "Accuracy score using max_features = 0.9: 0.6918367346938775\n",
185 |       "Accuracy score using max_features = 1.0: 0.7020408163265306\n"
186 |      ]
187 |     }
188 |    ],
189 |    "source": [
190 |     "for max_features in np.linspace(0.1,1,10):\n",
191 |     "    print('Accuracy score using max_features =', max_features, end = ': ')\n",
192 |     "    fit_predict(train,test,y_train,y_test,n_estimators = 160,max_features = max_features,max_depth = 18)\n"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": 9,
198 |    "metadata": {},
199 |    "outputs": [
200 |     {
201 |      "name": "stdout",
202 |      "output_type": "stream",
203 |      "text": [
204 |       "Accuracy score using min_samples_split = 2: 0.7040816326530612\n",
205 |       "Accuracy score using min_samples_split = 3: 0.7193877551020408\n",
206 |       "Accuracy score using min_samples_split = 4: 0.7040816326530612\n",
207 |       "Accuracy score using min_samples_split = 5: 0.6938775510204082\n",
208 |       "Accuracy score using min_samples_split = 6: 0.6938775510204082\n",
209 |       "Accuracy score using min_samples_split = 7: 0.6857142857142857\n",
210 |       "Accuracy score using min_samples_split = 8: 0.6806122448979591\n",
211 |       "Accuracy score using min_samples_split = 9: 0.6714285714285714\n"
212 |      ]
213 |     }
214 |    ],
215 |    "source": [
216 |     "for min_samples_split in range(2,10):\n",
217 |     "    print('Accuracy score using min_samples_split =', min_samples_split, end = ': ')\n",
218 |     "    fit_predict(train,test,y_train,y_test,n_estimators = 160,max_features = 0.2,min_samples_split=min_samples_split\n",
219 |     "               ,max_depth = 18)\n"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": 10,
225 |    "metadata": {},
226 |    "outputs": [
227 |     {
228 |      "name": "stdout",
229 |      "output_type": "stream",
230 |      "text": [
231 |       "tuned accuracy score: 0.7193877551020408\n",
232 |       "tuned accuracy score with scaler: 0.7173469387755103\n"
233 |      ]
234 |     }
235 |    ],
236 |    "source": [
237 |     "print('tuned accuracy score', end = ': ')\n",
238 |     "fit_predict(train,test,y_train,y_test,n_estimators = 160,max_features = 0.2,min_samples_split=3,max_depth = 18)\n",
239 |     "print('tuned accuracy score with scaler', end = ': ')\n",
240 |     "\n",
241 |     "fit_predict(train,test,y_train,y_test,n_estimators = 160,max_features = 0.2,min_samples_split=3,\n",
242 |     "            max_depth = 18,scaler=StandardScaler())"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "code",
247 |    "execution_count": 11,
248 |    "metadata": {},
249 |    "outputs": [
250 |     {
251 |      "name": "stdout",
252 |      "output_type": "stream",
253 |      "text": [
254 |       "overall improvement is 39.88 %\n"
255 |      ]
256 |     }
257 |    ],
258 |    "source": [
259 |     "original_score = 0.514285714286\n",
260 |     "best_score = 0.7193877551020408\n",
261 |     "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
262 |     "print('overall improvement is {} %'.format(improvement))"
263 |    ]
264 |   },
265 |   {
266 |    "cell_type": "code",
267 |    "execution_count": 12,
268 |    "metadata": {},
269 |    "outputs": [
270 |     {
271 |      "name": "stdout",
272 |      "output_type": "stream",
273 |      "text": [
274 |       "overall improvement compare to non tuned model is 11.9 %\n"
275 |      ]
276 |     }
277 |    ],
278 |    "source": [
279 |     "original_score = 0.6428571428571429\n",
280 |     "best_score = 0.7193877551020408\n",
281 |     "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
282 |     "print('overall improvement compare to non tuned model is {} %'.format(improvement))"
283 |    ]
284 |   }
285 |  ],
286 |  "metadata": {
287 |   "kernelspec": {
288 |    "display_name": "Python 3",
289 |    "language": "python",
290 |    "name": "python3"
291 |   },
292 |   "language_info": {
293 |    "codemirror_mode": {
294 |     "name": "ipython",
295 |     "version": 3
296 |    },
297 |    "file_extension": ".py",
298 |    "mimetype": "text/x-python",
299 |    "name": "python",
300 |    "nbconvert_exporter": "python",
301 |    "pygments_lexer": "ipython3",
302 |    "version": "3.6.1"
303 |   }
304 |  },
305 |  "nbformat": 4,
306 |  "nbformat_minor": 2
307 | }
308 | 


--------------------------------------------------------------------------------
/Section 3/3.5 stacking.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "import lightgbm as lgb\n",
 14 |     "from catboost import CatBoostClassifier\n",
 15 |     "from sklearn.model_selection import train_test_split\n",
 16 |     "from sklearn.metrics import accuracy_score\n",
 17 |     "from sklearn.linear_model import LogisticRegression\n",
 18 |     "import warnings\n",
 19 |     "from sklearn import preprocessing\n",
 20 |     "from sklearn.neighbors import KNeighborsClassifier\n",
 21 |     "from sklearn.preprocessing import StandardScaler\n",
 22 |     "warnings.filterwarnings(\"ignore\")\n",
 23 |     "np.random.seed(42)"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "code",
 28 |    "execution_count": 2,
 29 |    "metadata": {},
 30 |    "outputs": [
 31 |     {
 32 |      "data": {
 33 |       "text/html": [
 34 |        "<div>\n",
 35 |        "<style scoped>\n",
 36 |        "    .dataframe tbody tr th:only-of-type {\n",
 37 |        "        vertical-align: middle;\n",
 38 |        "    }\n",
 39 |        "\n",
 40 |        "    .dataframe tbody tr th {\n",
 41 |        "        vertical-align: top;\n",
 42 |        "    }\n",
 43 |        "\n",
 44 |        "    .dataframe thead th {\n",
 45 |        "        text-align: right;\n",
 46 |        "    }\n",
 47 |        "</style>\n",
 48 |        "<table border=\"1\" class=\"dataframe\">\n",
 49 |        "  <thead>\n",
 50 |        "    <tr style=\"text-align: right;\">\n",
 51 |        "      <th></th>\n",
 52 |        "      <th>v1</th>\n",
 53 |        "      <th>v2</th>\n",
 54 |        "      <th>v3</th>\n",
 55 |        "      <th>v4</th>\n",
 56 |        "      <th>v5</th>\n",
 57 |        "      <th>v6</th>\n",
 58 |        "      <th>v7</th>\n",
 59 |        "      <th>v8</th>\n",
 60 |        "      <th>v9</th>\n",
 61 |        "      <th>v10</th>\n",
 62 |        "      <th>...</th>\n",
 63 |        "      <th>v122</th>\n",
 64 |        "      <th>v123</th>\n",
 65 |        "      <th>v124</th>\n",
 66 |        "      <th>v125</th>\n",
 67 |        "      <th>v126</th>\n",
 68 |        "      <th>v127</th>\n",
 69 |        "      <th>v128</th>\n",
 70 |        "      <th>v129</th>\n",
 71 |        "      <th>v130</th>\n",
 72 |        "      <th>v131</th>\n",
 73 |        "    </tr>\n",
 74 |        "  </thead>\n",
 75 |        "  <tbody>\n",
 76 |        "    <tr>\n",
 77 |        "      <th>0</th>\n",
 78 |        "      <td>1.335739</td>\n",
 79 |        "      <td>8.727474</td>\n",
 80 |        "      <td>C</td>\n",
 81 |        "      <td>3.921026</td>\n",
 82 |        "      <td>7.915266</td>\n",
 83 |        "      <td>2.599278</td>\n",
 84 |        "      <td>3.176895</td>\n",
 85 |        "      <td>0.012941</td>\n",
 86 |        "      <td>9.999999</td>\n",
 87 |        "      <td>0.503281</td>\n",
 88 |        "      <td>...</td>\n",
 89 |        "      <td>8.000000</td>\n",
 90 |        "      <td>1.989780</td>\n",
 91 |        "      <td>0.035754</td>\n",
 92 |        "      <td>AU</td>\n",
 93 |        "      <td>1.804126</td>\n",
 94 |        "      <td>3.113719</td>\n",
 95 |        "      <td>2.024285</td>\n",
 96 |        "      <td>0</td>\n",
 97 |        "      <td>0.636365</td>\n",
 98 |        "      <td>2.857144</td>\n",
 99 |        "    </tr>\n",
100 |        "    <tr>\n",
101 |        "      <th>1</th>\n",
102 |        "      <td>NaN</td>\n",
103 |        "      <td>NaN</td>\n",
104 |        "      <td>C</td>\n",
105 |        "      <td>NaN</td>\n",
106 |        "      <td>9.191265</td>\n",
107 |        "      <td>NaN</td>\n",
108 |        "      <td>NaN</td>\n",
109 |        "      <td>2.301630</td>\n",
110 |        "      <td>NaN</td>\n",
111 |        "      <td>1.312910</td>\n",
112 |        "      <td>...</td>\n",
113 |        "      <td>NaN</td>\n",
114 |        "      <td>NaN</td>\n",
115 |        "      <td>0.598896</td>\n",
116 |        "      <td>AF</td>\n",
117 |        "      <td>NaN</td>\n",
118 |        "      <td>NaN</td>\n",
119 |        "      <td>1.957825</td>\n",
120 |        "      <td>0</td>\n",
121 |        "      <td>NaN</td>\n",
122 |        "      <td>NaN</td>\n",
123 |        "    </tr>\n",
124 |        "    <tr>\n",
125 |        "      <th>2</th>\n",
126 |        "      <td>0.943877</td>\n",
127 |        "      <td>5.310079</td>\n",
128 |        "      <td>C</td>\n",
129 |        "      <td>4.410969</td>\n",
130 |        "      <td>5.326159</td>\n",
131 |        "      <td>3.979592</td>\n",
132 |        "      <td>3.928571</td>\n",
133 |        "      <td>0.019645</td>\n",
134 |        "      <td>12.666667</td>\n",
135 |        "      <td>0.765864</td>\n",
136 |        "      <td>...</td>\n",
137 |        "      <td>9.333333</td>\n",
138 |        "      <td>2.477596</td>\n",
139 |        "      <td>0.013452</td>\n",
140 |        "      <td>AE</td>\n",
141 |        "      <td>1.773709</td>\n",
142 |        "      <td>3.922193</td>\n",
143 |        "      <td>1.120468</td>\n",
144 |        "      <td>2</td>\n",
145 |        "      <td>0.883118</td>\n",
146 |        "      <td>1.176472</td>\n",
147 |        "    </tr>\n",
148 |        "    <tr>\n",
149 |        "      <th>3</th>\n",
150 |        "      <td>0.797415</td>\n",
151 |        "      <td>8.304757</td>\n",
152 |        "      <td>C</td>\n",
153 |        "      <td>4.225930</td>\n",
154 |        "      <td>11.627438</td>\n",
155 |        "      <td>2.097700</td>\n",
156 |        "      <td>1.987549</td>\n",
157 |        "      <td>0.171947</td>\n",
158 |        "      <td>8.965516</td>\n",
159 |        "      <td>6.542669</td>\n",
160 |        "      <td>...</td>\n",
161 |        "      <td>7.018256</td>\n",
162 |        "      <td>1.812795</td>\n",
163 |        "      <td>0.002267</td>\n",
164 |        "      <td>CJ</td>\n",
165 |        "      <td>1.415230</td>\n",
166 |        "      <td>2.954381</td>\n",
167 |        "      <td>1.990847</td>\n",
168 |        "      <td>1</td>\n",
169 |        "      <td>1.677108</td>\n",
170 |        "      <td>1.034483</td>\n",
171 |        "    </tr>\n",
172 |        "    <tr>\n",
173 |        "      <th>4</th>\n",
174 |        "      <td>NaN</td>\n",
175 |        "      <td>NaN</td>\n",
176 |        "      <td>C</td>\n",
177 |        "      <td>NaN</td>\n",
178 |        "      <td>NaN</td>\n",
179 |        "      <td>NaN</td>\n",
180 |        "      <td>NaN</td>\n",
181 |        "      <td>NaN</td>\n",
182 |        "      <td>NaN</td>\n",
183 |        "      <td>1.050328</td>\n",
184 |        "      <td>...</td>\n",
185 |        "      <td>NaN</td>\n",
186 |        "      <td>NaN</td>\n",
187 |        "      <td>NaN</td>\n",
188 |        "      <td>Z</td>\n",
189 |        "      <td>NaN</td>\n",
190 |        "      <td>NaN</td>\n",
191 |        "      <td>NaN</td>\n",
192 |        "      <td>0</td>\n",
193 |        "      <td>NaN</td>\n",
194 |        "      <td>NaN</td>\n",
195 |        "    </tr>\n",
196 |        "  </tbody>\n",
197 |        "</table>\n",
198 |        "<p>5 rows × 131 columns</p>\n",
199 |        "</div>"
200 |       ],
201 |       "text/plain": [
202 |        "         v1        v2 v3        v4         v5        v6        v7        v8  \\\n",
203 |        "0  1.335739  8.727474  C  3.921026   7.915266  2.599278  3.176895  0.012941   \n",
204 |        "1       NaN       NaN  C       NaN   9.191265       NaN       NaN  2.301630   \n",
205 |        "2  0.943877  5.310079  C  4.410969   5.326159  3.979592  3.928571  0.019645   \n",
206 |        "3  0.797415  8.304757  C  4.225930  11.627438  2.097700  1.987549  0.171947   \n",
207 |        "4       NaN       NaN  C       NaN        NaN       NaN       NaN       NaN   \n",
208 |        "\n",
209 |        "          v9       v10    ...         v122      v123      v124  v125  \\\n",
210 |        "0   9.999999  0.503281    ...     8.000000  1.989780  0.035754    AU   \n",
211 |        "1        NaN  1.312910    ...          NaN       NaN  0.598896    AF   \n",
212 |        "2  12.666667  0.765864    ...     9.333333  2.477596  0.013452    AE   \n",
213 |        "3   8.965516  6.542669    ...     7.018256  1.812795  0.002267    CJ   \n",
214 |        "4        NaN  1.050328    ...          NaN       NaN       NaN     Z   \n",
215 |        "\n",
216 |        "       v126      v127      v128  v129      v130      v131  \n",
217 |        "0  1.804126  3.113719  2.024285     0  0.636365  2.857144  \n",
218 |        "1       NaN       NaN  1.957825     0       NaN       NaN  \n",
219 |        "2  1.773709  3.922193  1.120468     2  0.883118  1.176472  \n",
220 |        "3  1.415230  2.954381  1.990847     1  1.677108  1.034483  \n",
221 |        "4       NaN       NaN       NaN     0       NaN       NaN  \n",
222 |        "\n",
223 |        "[5 rows x 131 columns]"
224 |       ]
225 |      },
226 |      "execution_count": 2,
227 |      "metadata": {},
228 |      "output_type": "execute_result"
229 |     }
230 |    ],
231 |    "source": [
232 |     "df = pd.read_csv('train.csv')\n",
233 |     "y = df.target\n",
234 |     "\n",
235 |     "df.drop(['ID', 'target'], axis=1, inplace=True)\n",
236 |     "df.head()"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": 3,
242 |    "metadata": {
243 |     "collapsed": true
244 |    },
245 |    "outputs": [],
246 |    "source": [
247 |     "string_type = []\n",
248 |     "for column in df.columns:\n",
249 |     "    if type(df[column].values[0]) == str:\n",
250 |     "        string_type.append(column)\n",
251 |     "string_type.append('v113')\n",
252 |     "        \n",
253 |     "df[string_type] = df[string_type].fillna('zero')\n",
254 |     "\n",
255 |     "df.fillna(-9999, inplace=True)\n",
256 |     "cat_features_ids = np.where(df.apply(pd.Series.nunique) < 10000)[0].tolist()"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": 4,
262 |    "metadata": {},
263 |    "outputs": [
264 |     {
265 |      "data": {
266 |       "text/html": [
267 |        "<div>\n",
268 |        "<style scoped>\n",
269 |        "    .dataframe tbody tr th:only-of-type {\n",
270 |        "        vertical-align: middle;\n",
271 |        "    }\n",
272 |        "\n",
273 |        "    .dataframe tbody tr th {\n",
274 |        "        vertical-align: top;\n",
275 |        "    }\n",
276 |        "\n",
277 |        "    .dataframe thead th {\n",
278 |        "        text-align: right;\n",
279 |        "    }\n",
280 |        "</style>\n",
281 |        "<table border=\"1\" class=\"dataframe\">\n",
282 |        "  <thead>\n",
283 |        "    <tr style=\"text-align: right;\">\n",
284 |        "      <th></th>\n",
285 |        "      <th>v1</th>\n",
286 |        "      <th>v2</th>\n",
287 |        "      <th>v3</th>\n",
288 |        "      <th>v4</th>\n",
289 |        "      <th>v5</th>\n",
290 |        "      <th>v6</th>\n",
291 |        "      <th>v7</th>\n",
292 |        "      <th>v8</th>\n",
293 |        "      <th>v9</th>\n",
294 |        "      <th>v10</th>\n",
295 |        "      <th>...</th>\n",
296 |        "      <th>v122</th>\n",
297 |        "      <th>v123</th>\n",
298 |        "      <th>v124</th>\n",
299 |        "      <th>v125</th>\n",
300 |        "      <th>v126</th>\n",
301 |        "      <th>v127</th>\n",
302 |        "      <th>v128</th>\n",
303 |        "      <th>v129</th>\n",
304 |        "      <th>v130</th>\n",
305 |        "      <th>v131</th>\n",
306 |        "    </tr>\n",
307 |        "  </thead>\n",
308 |        "  <tbody>\n",
309 |        "    <tr>\n",
310 |        "      <th>0</th>\n",
311 |        "      <td>1.335739</td>\n",
312 |        "      <td>8.727474</td>\n",
313 |        "      <td>C</td>\n",
314 |        "      <td>3.921026</td>\n",
315 |        "      <td>7.915266</td>\n",
316 |        "      <td>2.599278</td>\n",
317 |        "      <td>3.176895</td>\n",
318 |        "      <td>0.012941</td>\n",
319 |        "      <td>9.999999</td>\n",
320 |        "      <td>0.503281</td>\n",
321 |        "      <td>...</td>\n",
322 |        "      <td>8.000000</td>\n",
323 |        "      <td>1.989780</td>\n",
324 |        "      <td>0.035754</td>\n",
325 |        "      <td>AU</td>\n",
326 |        "      <td>1.804126</td>\n",
327 |        "      <td>3.113719</td>\n",
328 |        "      <td>2.024285</td>\n",
329 |        "      <td>0</td>\n",
330 |        "      <td>0.636365</td>\n",
331 |        "      <td>2.857144</td>\n",
332 |        "    </tr>\n",
333 |        "    <tr>\n",
334 |        "      <th>1</th>\n",
335 |        "      <td>-9999.000000</td>\n",
336 |        "      <td>-9999.000000</td>\n",
337 |        "      <td>C</td>\n",
338 |        "      <td>-9999.000000</td>\n",
339 |        "      <td>9.191265</td>\n",
340 |        "      <td>-9999.000000</td>\n",
341 |        "      <td>-9999.000000</td>\n",
342 |        "      <td>2.301630</td>\n",
343 |        "      <td>-9999.000000</td>\n",
344 |        "      <td>1.312910</td>\n",
345 |        "      <td>...</td>\n",
346 |        "      <td>-9999.000000</td>\n",
347 |        "      <td>-9999.000000</td>\n",
348 |        "      <td>0.598896</td>\n",
349 |        "      <td>AF</td>\n",
350 |        "      <td>-9999.000000</td>\n",
351 |        "      <td>-9999.000000</td>\n",
352 |        "      <td>1.957825</td>\n",
353 |        "      <td>0</td>\n",
354 |        "      <td>-9999.000000</td>\n",
355 |        "      <td>-9999.000000</td>\n",
356 |        "    </tr>\n",
357 |        "    <tr>\n",
358 |        "      <th>2</th>\n",
359 |        "      <td>0.943877</td>\n",
360 |        "      <td>5.310079</td>\n",
361 |        "      <td>C</td>\n",
362 |        "      <td>4.410969</td>\n",
363 |        "      <td>5.326159</td>\n",
364 |        "      <td>3.979592</td>\n",
365 |        "      <td>3.928571</td>\n",
366 |        "      <td>0.019645</td>\n",
367 |        "      <td>12.666667</td>\n",
368 |        "      <td>0.765864</td>\n",
369 |        "      <td>...</td>\n",
370 |        "      <td>9.333333</td>\n",
371 |        "      <td>2.477596</td>\n",
372 |        "      <td>0.013452</td>\n",
373 |        "      <td>AE</td>\n",
374 |        "      <td>1.773709</td>\n",
375 |        "      <td>3.922193</td>\n",
376 |        "      <td>1.120468</td>\n",
377 |        "      <td>2</td>\n",
378 |        "      <td>0.883118</td>\n",
379 |        "      <td>1.176472</td>\n",
380 |        "    </tr>\n",
381 |        "    <tr>\n",
382 |        "      <th>3</th>\n",
383 |        "      <td>0.797415</td>\n",
384 |        "      <td>8.304757</td>\n",
385 |        "      <td>C</td>\n",
386 |        "      <td>4.225930</td>\n",
387 |        "      <td>11.627438</td>\n",
388 |        "      <td>2.097700</td>\n",
389 |        "      <td>1.987549</td>\n",
390 |        "      <td>0.171947</td>\n",
391 |        "      <td>8.965516</td>\n",
392 |        "      <td>6.542669</td>\n",
393 |        "      <td>...</td>\n",
394 |        "      <td>7.018256</td>\n",
395 |        "      <td>1.812795</td>\n",
396 |        "      <td>0.002267</td>\n",
397 |        "      <td>CJ</td>\n",
398 |        "      <td>1.415230</td>\n",
399 |        "      <td>2.954381</td>\n",
400 |        "      <td>1.990847</td>\n",
401 |        "      <td>1</td>\n",
402 |        "      <td>1.677108</td>\n",
403 |        "      <td>1.034483</td>\n",
404 |        "    </tr>\n",
405 |        "    <tr>\n",
406 |        "      <th>4</th>\n",
407 |        "      <td>-9999.000000</td>\n",
408 |        "      <td>-9999.000000</td>\n",
409 |        "      <td>C</td>\n",
410 |        "      <td>-9999.000000</td>\n",
411 |        "      <td>-9999.000000</td>\n",
412 |        "      <td>-9999.000000</td>\n",
413 |        "      <td>-9999.000000</td>\n",
414 |        "      <td>-9999.000000</td>\n",
415 |        "      <td>-9999.000000</td>\n",
416 |        "      <td>1.050328</td>\n",
417 |        "      <td>...</td>\n",
418 |        "      <td>-9999.000000</td>\n",
419 |        "      <td>-9999.000000</td>\n",
420 |        "      <td>-9999.000000</td>\n",
421 |        "      <td>Z</td>\n",
422 |        "      <td>-9999.000000</td>\n",
423 |        "      <td>-9999.000000</td>\n",
424 |        "      <td>-9999.000000</td>\n",
425 |        "      <td>0</td>\n",
426 |        "      <td>-9999.000000</td>\n",
427 |        "      <td>-9999.000000</td>\n",
428 |        "    </tr>\n",
429 |        "  </tbody>\n",
430 |        "</table>\n",
431 |        "<p>5 rows × 131 columns</p>\n",
432 |        "</div>"
433 |       ],
434 |       "text/plain": [
435 |        "            v1           v2 v3           v4           v5           v6  \\\n",
436 |        "0     1.335739     8.727474  C     3.921026     7.915266     2.599278   \n",
437 |        "1 -9999.000000 -9999.000000  C -9999.000000     9.191265 -9999.000000   \n",
438 |        "2     0.943877     5.310079  C     4.410969     5.326159     3.979592   \n",
439 |        "3     0.797415     8.304757  C     4.225930    11.627438     2.097700   \n",
440 |        "4 -9999.000000 -9999.000000  C -9999.000000 -9999.000000 -9999.000000   \n",
441 |        "\n",
442 |        "            v7           v8           v9       v10     ...              v122  \\\n",
443 |        "0     3.176895     0.012941     9.999999  0.503281     ...          8.000000   \n",
444 |        "1 -9999.000000     2.301630 -9999.000000  1.312910     ...      -9999.000000   \n",
445 |        "2     3.928571     0.019645    12.666667  0.765864     ...          9.333333   \n",
446 |        "3     1.987549     0.171947     8.965516  6.542669     ...          7.018256   \n",
447 |        "4 -9999.000000 -9999.000000 -9999.000000  1.050328     ...      -9999.000000   \n",
448 |        "\n",
449 |        "          v123         v124  v125         v126         v127         v128  \\\n",
450 |        "0     1.989780     0.035754    AU     1.804126     3.113719     2.024285   \n",
451 |        "1 -9999.000000     0.598896    AF -9999.000000 -9999.000000     1.957825   \n",
452 |        "2     2.477596     0.013452    AE     1.773709     3.922193     1.120468   \n",
453 |        "3     1.812795     0.002267    CJ     1.415230     2.954381     1.990847   \n",
454 |        "4 -9999.000000 -9999.000000     Z -9999.000000 -9999.000000 -9999.000000   \n",
455 |        "\n",
456 |        "   v129         v130         v131  \n",
457 |        "0     0     0.636365     2.857144  \n",
458 |        "1     0 -9999.000000 -9999.000000  \n",
459 |        "2     2     0.883118     1.176472  \n",
460 |        "3     1     1.677108     1.034483  \n",
461 |        "4     0 -9999.000000 -9999.000000  \n",
462 |        "\n",
463 |        "[5 rows x 131 columns]"
464 |       ]
465 |      },
466 |      "execution_count": 4,
467 |      "metadata": {},
468 |      "output_type": "execute_result"
469 |     }
470 |    ],
471 |    "source": [
472 |     "df.head()"
473 |    ]
474 |   },
475 |   {
476 |    "cell_type": "code",
477 |    "execution_count": 5,
478 |    "metadata": {
479 |     "collapsed": true
480 |    },
481 |    "outputs": [],
482 |    "source": [
483 |     "le = preprocessing.LabelEncoder()\n",
484 |     "for column in string_type:\n",
485 |     "    df[column] = le.fit_transform(df[column])"
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "code",
490 |    "execution_count": 6,
491 |    "metadata": {
492 |     "collapsed": true
493 |    },
494 |    "outputs": [],
495 |    "source": [
496 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.2)"
497 |    ]
498 |   },
499 |   {
500 |    "cell_type": "code",
501 |    "execution_count": 7,
502 |    "metadata": {},
503 |    "outputs": [
504 |     {
505 |      "name": "stdout",
506 |      "output_type": "stream",
507 |      "text": [
508 |       "Accuracy score baseline: 0.7616881696916685\n"
509 |      ]
510 |     }
511 |    ],
512 |    "source": [
513 |     "lr = LogisticRegression()\n",
514 |     "lr.fit(train, y_train)\n",
515 |     "y_pred = lr.predict(test)\n",
516 |     "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
517 |    ]
518 |   },
519 |   {
520 |    "cell_type": "code",
521 |    "execution_count": 8,
522 |    "metadata": {
523 |     "collapsed": true
524 |    },
525 |    "outputs": [],
526 |    "source": [
527 |     "folds = pd.DataFrame(list(range(len(train))))\n",
528 |     "folds[0] = folds.values % 3\n",
529 |     "folds.rename(columns={0:'fold'},inplace=True)"
530 |    ]
531 |   },
532 |   {
533 |    "cell_type": "markdown",
534 |    "metadata": {},
535 |    "source": [
536 |     "## Catboost"
537 |    ]
538 |   },
539 |   {
540 |    "cell_type": "code",
541 |    "execution_count": 9,
542 |    "metadata": {
543 |     "collapsed": true
544 |    },
545 |    "outputs": [],
546 |    "source": [
547 |     "dict_OOF_predict_cbst = {}\n",
548 |     "dict_test_predict_cbst = {}\n",
549 |     "for fold in [0,1,2]:\n",
550 |     "\n",
551 |     "    \n",
552 |     "    clf = CatBoostClassifier(learning_rate=0.1, iterations=500, random_seed=42, logging_level='Silent')\n",
553 |     "    clf.fit(train.values[folds.fold != fold], y_train.values[folds.fold != fold], cat_features=cat_features_ids)\n",
554 |     "    \n",
555 |     "    predicts = clf.predict_proba(test)[:,1]\n",
556 |     "    dict_test_predict_cbst[fold] = predicts\n",
557 |     "\n",
558 |     "    predicts_OOF = clf.predict_proba(train.values[folds.fold == fold])[:,1]\n",
559 |     "    dict_OOF_predict_cbst[fold] = predicts_OOF\n",
560 |     "    del clf\n",
561 |     "    \n"
562 |    ]
563 |   },
564 |   {
565 |    "cell_type": "code",
566 |    "execution_count": 20,
567 |    "metadata": {
568 |     "collapsed": true
569 |    },
570 |    "outputs": [],
571 |    "source": [
572 |     "    clf = CatBoostClassifier(learning_rate=0.1, iterations=500, random_seed=42, logging_level='Silent')\n",
573 |     "    clf.fit()"
574 |    ]
575 |   },
576 |   {
577 |    "cell_type": "code",
578 |    "execution_count": 10,
579 |    "metadata": {},
580 |    "outputs": [
581 |     {
582 |      "name": "stdout",
583 |      "output_type": "stream",
584 |      "text": [
585 |       "OOF Catboost accuracy 0.7817748425472358\n"
586 |      ]
587 |     }
588 |    ],
589 |    "source": [
590 |     "OOF_X_cbst = np.zeros_like(y_train)\n",
591 |     "OOF_X_cbst = pd.DataFrame(OOF_X_cbst)\n",
592 |     "for fold in dict_OOF_predict_cbst.keys():\n",
593 |     "    OOF_X_cbst[folds.fold == fold] = dict_OOF_predict_cbst[fold].reshape((dict_OOF_predict_cbst[fold].shape[0],1))\n",
594 |     "print('OOF Catboost accuracy',accuracy_score(y_train,np.round(OOF_X_cbst)))"
595 |    ]
596 |   },
597 |   {
598 |    "cell_type": "code",
599 |    "execution_count": 11,
600 |    "metadata": {},
601 |    "outputs": [
602 |     {
603 |      "name": "stdout",
604 |      "output_type": "stream",
605 |      "text": [
606 |       "test Catboost accuracy 0.7865296304395364\n"
607 |      ]
608 |     }
609 |    ],
610 |    "source": [
611 |     "f_0,f_1,f_2 = dict_test_predict_cbst.keys()\n",
612 |     "sub_cbst = dict_test_predict_cbst[f_0] + dict_test_predict_cbst[f_1] + dict_test_predict_cbst[f_2]\n",
613 |     "sub_cbst/=3\n",
614 |     "print('test Catboost accuracy',accuracy_score(y_test,np.round(sub_cbst)))"
615 |    ]
616 |   },
617 |   {
618 |    "cell_type": "markdown",
619 |    "metadata": {},
620 |    "source": [
621 |     "## KNN 2"
622 |    ]
623 |   },
624 |   {
625 |    "cell_type": "code",
626 |    "execution_count": 12,
627 |    "metadata": {},
628 |    "outputs": [
629 |     {
630 |      "name": "stdout",
631 |      "output_type": "stream",
632 |      "text": [
633 |       "OOF kNN 2 accuracy 0.6638820853743876\n",
634 |       "test kNN 2 accuracy 0.7161163350098404\n"
635 |      ]
636 |     }
637 |    ],
638 |    "source": [
639 |     "\n",
640 |     "dict_OOF_predict_knn_2 = {}\n",
641 |     "dict_test_predict_knn_2  = {}\n",
642 |     "for fold in [0,1,2]:\n",
643 |     "\n",
644 |     "    \n",
645 |     "    clf = KNeighborsClassifier(n_neighbors = 2,weights='distance',n_jobs=32)\n",
646 |     "    clf.fit(train.values[folds.fold != fold], y_train.values[folds.fold != fold])\n",
647 |     "    \n",
648 |     "    predicts = clf.predict_proba(test)[:,1]\n",
649 |     "    dict_test_predict_knn_2[fold] = predicts\n",
650 |     "\n",
651 |     "    predicts_OOF = clf.predict_proba(train.values[folds.fold == fold])[:,1]\n",
652 |     "    dict_OOF_predict_knn_2[fold] = predicts_OOF\n",
653 |     "    del clf\n",
654 |     "    \n",
655 |     "OOF_X_knn_2 = np.zeros_like(y_train)\n",
656 |     "OOF_X_knn_2 = pd.DataFrame(OOF_X_knn_2)\n",
657 |     "for fold in dict_OOF_predict_knn_2.keys():\n",
658 |     "    OOF_X_knn_2[folds.fold == fold] = dict_OOF_predict_knn_2[fold].reshape((dict_OOF_predict_knn_2[fold].shape[0],1))\n",
659 |     "print('OOF kNN 2 accuracy',accuracy_score(y_train,np.round(OOF_X_knn_2)))\n",
660 |     "\n",
661 |     "f_0,f_1,f_2 = dict_test_predict_knn_2.keys()\n",
662 |     "sub_knn_2 = dict_test_predict_knn_2[f_0] + dict_test_predict_knn_2[f_1] + dict_test_predict_knn_2[f_2]\n",
663 |     "sub_knn_2/=3\n",
664 |     "print('test kNN 2 accuracy',accuracy_score(y_test,np.round(sub_knn_2)))"
665 |    ]
666 |   },
667 |   {
668 |    "cell_type": "markdown",
669 |    "metadata": {},
670 |    "source": [
671 |     "## KNN 4"
672 |    ]
673 |   },
674 |   {
675 |    "cell_type": "code",
676 |    "execution_count": 13,
677 |    "metadata": {},
678 |    "outputs": [
679 |     {
680 |      "name": "stdout",
681 |      "output_type": "stream",
682 |      "text": [
683 |       "OOF kNN 4 accuracy 0.7044808432470259\n",
684 |       "test kNN 4 accuracy 0.7345287557402143\n"
685 |      ]
686 |     }
687 |    ],
688 |    "source": [
689 |     "\n",
690 |     "dict_OOF_predict_knn_4 = {}\n",
691 |     "dict_test_predict_knn_4  = {}\n",
692 |     "for fold in [0,1,2]:\n",
693 |     "\n",
694 |     "    \n",
695 |     "    clf = KNeighborsClassifier(n_neighbors= 4,weights='distance',n_jobs=32)\n",
696 |     "    clf.fit(train.values[folds.fold != fold], y_train.values[folds.fold != fold])\n",
697 |     "    \n",
698 |     "    predicts = clf.predict_proba(test)[:,1]\n",
699 |     "    dict_test_predict_knn_4[fold] = predicts\n",
700 |     "\n",
701 |     "    predicts_OOF = clf.predict_proba(train.values[folds.fold == fold])[:,1]\n",
702 |     "    dict_OOF_predict_knn_4[fold] = predicts_OOF\n",
703 |     "    del clf\n",
704 |     "    \n",
705 |     "OOF_X_knn_4 = np.zeros_like(y_train)\n",
706 |     "OOF_X_knn_4 = pd.DataFrame(OOF_X_knn_4)\n",
707 |     "for fold in dict_OOF_predict_knn_4.keys():\n",
708 |     "    OOF_X_knn_4[folds.fold == fold] = dict_OOF_predict_knn_4[fold].reshape((dict_OOF_predict_knn_4[fold].shape[0],1))\n",
709 |     "print('OOF kNN 4 accuracy',accuracy_score(y_train,np.round(OOF_X_knn_4)))\n",
710 |     "\n",
711 |     "f_0,f_1,f_2 = dict_test_predict_knn_4.keys()\n",
712 |     "sub_knn_4 = dict_test_predict_knn_4[f_0] + dict_test_predict_knn_4[f_1] + dict_test_predict_knn_4[f_2]\n",
713 |     "sub_knn_4/=3\n",
714 |     "print('test kNN 4 accuracy',accuracy_score(y_test,np.round(sub_knn_4)))"
715 |    ]
716 |   },
717 |   {
718 |    "cell_type": "markdown",
719 |    "metadata": {},
720 |    "source": [
721 |     "## KNN 8"
722 |    ]
723 |   },
724 |   {
725 |    "cell_type": "code",
726 |    "execution_count": 14,
727 |    "metadata": {},
728 |    "outputs": [
729 |     {
730 |      "name": "stdout",
731 |      "output_type": "stream",
732 |      "text": [
733 |       "OOF kNN 8 accuracy 0.7372397655703289\n",
734 |       "test kNN 8 accuracy 0.7567461185217581\n"
735 |      ]
736 |     }
737 |    ],
738 |    "source": [
739 |     "\n",
740 |     "dict_OOF_predict_knn_8 = {}\n",
741 |     "dict_test_predict_knn_8  = {}\n",
742 |     "for fold in [0,1,2]:\n",
743 |     "\n",
744 |     "    \n",
745 |     "    clf = KNeighborsClassifier(n_neighbors = 8,weights='distance',n_jobs=32)\n",
746 |     "    clf.fit(train.values[folds.fold != fold], y_train.values[folds.fold != fold])\n",
747 |     "    \n",
748 |     "    predicts = clf.predict_proba(test)[:,1]\n",
749 |     "    dict_test_predict_knn_8[fold] = predicts\n",
750 |     "\n",
751 |     "    predicts_OOF = clf.predict_proba(train.values[folds.fold == fold])[:,1]\n",
752 |     "    dict_OOF_predict_knn_8[fold] = predicts_OOF\n",
753 |     "    del clf\n",
754 |     "    \n",
755 |     "OOF_X_knn_8 = np.zeros_like(y_train)\n",
756 |     "OOF_X_knn_8 = pd.DataFrame(OOF_X_knn_8)\n",
757 |     "for fold in dict_OOF_predict_knn_8.keys():\n",
758 |     "    OOF_X_knn_8[folds.fold == fold] = dict_OOF_predict_knn_8[fold].reshape((dict_OOF_predict_knn_8[fold].shape[0],1))\n",
759 |     "print('OOF kNN 8 accuracy',accuracy_score(y_train,np.round(OOF_X_knn_8)))\n",
760 |     "\n",
761 |     "f_0,f_1,f_2 = dict_test_predict_knn_8.keys()\n",
762 |     "sub_knn_8 = dict_test_predict_knn_8[f_0] + dict_test_predict_knn_8[f_1] + dict_test_predict_knn_8[f_2]\n",
763 |     "sub_knn_8/=3\n",
764 |     "print('test kNN 8 accuracy',accuracy_score(y_test,np.round(sub_knn_8)))"
765 |    ]
766 |   },
767 |   {
768 |    "cell_type": "markdown",
769 |    "metadata": {},
770 |    "source": [
771 |     "## KNN 16"
772 |    ]
773 |   },
774 |   {
775 |    "cell_type": "code",
776 |    "execution_count": 15,
777 |    "metadata": {},
778 |    "outputs": [
779 |     {
780 |      "name": "stdout",
781 |      "output_type": "stream",
782 |      "text": [
783 |       "OOF kNN 16 accuracy 0.75659333449965\n",
784 |       "test kNN 16 accuracy 0.7642685326918872\n"
785 |      ]
786 |     }
787 |    ],
788 |    "source": [
789 |     "\n",
790 |     "dict_OOF_predict_knn_16 = {}\n",
791 |     "dict_test_predict_knn_16  = {}\n",
792 |     "for fold in [0,1,2]:\n",
793 |     "\n",
794 |     "    \n",
795 |     "    clf = KNeighborsClassifier(n_neighbors = 16,weights='distance',n_jobs=32)\n",
796 |     "    clf.fit(train.values[folds.fold != fold], y_train.values[folds.fold != fold])\n",
797 |     "    \n",
798 |     "    predicts = clf.predict_proba(test)[:,1]\n",
799 |     "    dict_test_predict_knn_16[fold] = predicts\n",
800 |     "\n",
801 |     "    predicts_OOF = clf.predict_proba(train.values[folds.fold == fold])[:,1]\n",
802 |     "    dict_OOF_predict_knn_16[fold] = predicts_OOF\n",
803 |     "    del clf\n",
804 |     "    \n",
805 |     "OOF_X_knn_16 = np.zeros_like(y_train)\n",
806 |     "OOF_X_knn_16 = pd.DataFrame(OOF_X_knn_16)\n",
807 |     "for fold in dict_OOF_predict_knn_16.keys():\n",
808 |     "    OOF_X_knn_16[folds.fold == fold] = dict_OOF_predict_knn_16[fold].reshape((dict_OOF_predict_knn_16[fold].shape[0],1))\n",
809 |     "print('OOF kNN 16 accuracy',accuracy_score(y_train,np.round(OOF_X_knn_16)))\n",
810 |     "\n",
811 |     "\n",
812 |     "f_0,f_1,f_2 = dict_test_predict_knn_16.keys()\n",
813 |     "sub_knn_16 = dict_test_predict_knn_16[f_0] + dict_test_predict_knn_16[f_1] + dict_test_predict_knn_16[f_2]\n",
814 |     "sub_knn_16/=3\n",
815 |     "print('test kNN 16 accuracy',accuracy_score(y_test,np.round(sub_knn_16)))"
816 |    ]
817 |   },
818 |   {
819 |    "cell_type": "markdown",
820 |    "metadata": {},
821 |    "source": [
822 |     "## Stacking"
823 |    ]
824 |   },
825 |   {
826 |    "cell_type": "code",
827 |    "execution_count": 16,
828 |    "metadata": {
829 |     "collapsed": true
830 |    },
831 |    "outputs": [],
832 |    "source": [
833 |     "stacked_data_set_train = pd.DataFrame(OOF_X_cbst)\n",
834 |     "stacked_data_set_test = pd.DataFrame(sub_cbst)\n",
835 |     "\n",
836 |     "stacked_data_set_train[1] = OOF_X_knn_2\n",
837 |     "stacked_data_set_test[1] = sub_knn_2\n",
838 |     "\n",
839 |     "stacked_data_set_train[2] = OOF_X_knn_4\n",
840 |     "stacked_data_set_test[2] = sub_knn_4 \n",
841 |     "\n",
842 |     "stacked_data_set_train[3] = OOF_X_knn_8\n",
843 |     "stacked_data_set_test[3] = sub_knn_8\n",
844 |     "\n",
845 |     "stacked_data_set_train[4] = OOF_X_knn_16\n",
846 |     "stacked_data_set_test[4] = sub_knn_16 \n"
847 |    ]
848 |   },
849 |   {
850 |    "cell_type": "code",
851 |    "execution_count": 17,
852 |    "metadata": {},
853 |    "outputs": [
854 |     {
855 |      "name": "stdout",
856 |      "output_type": "stream",
857 |      "text": [
858 |       "Accuracy score baseline: 0.787797944456593\n"
859 |      ]
860 |     }
861 |    ],
862 |    "source": [
863 |     "lr = LogisticRegression()\n",
864 |     "lr.fit(stacked_data_set_train, y_train)\n",
865 |     "y_pred = lr.predict(stacked_data_set_test)\n",
866 |     "print('Accuracy score baseline:', accuracy_score(y_test, y_pred))"
867 |    ]
868 |   },
869 |   {
870 |    "cell_type": "code",
871 |    "execution_count": 18,
872 |    "metadata": {},
873 |    "outputs": [
874 |     {
875 |      "name": "stdout",
876 |      "output_type": "stream",
877 |      "text": [
878 |       "overall improvement with stacking is 0.16 %\n"
879 |      ]
880 |     }
881 |    ],
882 |    "source": [
883 |     "original_score = 0.7865296304395364\n",
884 |     "best_score = 0.787797944456593\n",
885 |     "improvement = np.abs(np.round(100*(original_score - best_score)/original_score,2))\n",
886 |     "print('overall improvement with stacking is {} %'.format(improvement))"
887 |    ]
888 |   },
889 |   {
890 |    "cell_type": "code",
891 |    "execution_count": 19,
892 |    "metadata": {},
893 |    "outputs": [
894 |     {
895 |      "name": "stdout",
896 |      "output_type": "stream",
897 |      "text": [
898 |       "additional value is 182.9136 samples\n"
899 |      ]
900 |     }
901 |    ],
902 |    "source": [
903 |     "print('additional value is {} samples'.format(df.shape[0] * 0.16 / 100))"
904 |    ]
905 |   }
906 |  ],
907 |  "metadata": {
908 |   "kernelspec": {
909 |    "display_name": "Python 3",
910 |    "language": "python",
911 |    "name": "python3"
912 |   },
913 |   "language_info": {
914 |    "codemirror_mode": {
915 |     "name": "ipython",
916 |     "version": 3
917 |    },
918 |    "file_extension": ".py",
919 |    "mimetype": "text/x-python",
920 |    "name": "python",
921 |    "nbconvert_exporter": "python",
922 |    "pygments_lexer": "ipython3",
923 |    "version": "3.6.1"
924 |   }
925 |  },
926 |  "nbformat": 4,
927 |  "nbformat_minor": 2
928 | }
929 | 


--------------------------------------------------------------------------------
/Section 4/4_1_Memory based collaborative filtering.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "import scipy\n",
 14 |     "import json\n",
 15 |     "import os\n",
 16 |     "\n",
 17 |     "from tqdm import tqdm_notebook\n",
 18 |     "from sklearn.model_selection import train_test_split\n",
 19 |     "from sklearn.metrics.pairwise import pairwise_distances"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 2,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "def load_df(): #loading dataframe\n",
 31 |     "    df = pd.DataFrame()\n",
 32 |     "    with open('./mpd.slice.0-999.json') as data_file:\n",
 33 |     "        data_string = data_file.read()\n",
 34 |     "        try:\n",
 35 |     "            data = json.loads(data_string)\n",
 36 |     "        except ValueError:\n",
 37 |     "            print('Failed:')\n",
 38 |     "            print(repr(data_string))\n",
 39 |     "    df = pd.concat([df, pd.DataFrame(data['playlists'])], ignore_index=True)\n",
 40 |     "    return df"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "code",
 45 |    "execution_count": 3,
 46 |    "metadata": {},
 47 |    "outputs": [
 48 |     {
 49 |      "data": {
 50 |       "application/vnd.jupyter.widget-view+json": {
 51 |        "model_id": "143c4d7efa9445b99273421fef710c52"
 52 |       }
 53 |      },
 54 |      "metadata": {},
 55 |      "output_type": "display_data"
 56 |     },
 57 |     {
 58 |      "name": "stdout",
 59 |      "output_type": "stream",
 60 |      "text": [
 61 |       "\n"
 62 |      ]
 63 |     }
 64 |    ],
 65 |    "source": [
 66 |     "df = load_df()\n",
 67 |     "df.drop(['description','name', 'pid', 'num_albums','num_artists', \n",
 68 |     "         'num_edits', 'num_followers', 'num_tracks', 'collaborative'], axis = 1, inplace = True) #dropping columns \n",
 69 |     "                                                                                                 #that we are not going\n",
 70 |     "                                                                                                 #to use\n",
 71 |     "\n",
 72 |     "artist_list = []\n",
 73 |     "vocab_artist = set()\n",
 74 |     "\n",
 75 |     "for row in tqdm_notebook(df.iterrows()): #iterating through df to get sequence of artists name \n",
 76 |     "                                         #that are contained in playlist\n",
 77 |     "    artists = [x['artist_name'] for x in row[1]['tracks']] #getting artists from playlist(json type)\n",
 78 |     "    for x in row[1]['tracks']:\n",
 79 |     "        vocab_artist.add(x['artist_name']) #creating set with unique artists name\n",
 80 |     "    artist_list.append(artists) \n",
 81 |     "\n",
 82 |     "df['artist_list'] = artist_list \n",
 83 |     "\n",
 84 |     "w2x_artist = {artist:i for i, artist in enumerate(vocab_artist)} #artist name to index\n",
 85 |     "x2w_artist = {i:artist for i, artist in enumerate(vocab_artist)} #index to artist name\n",
 86 |     "\n",
 87 |     "df['artist_idx'] = df['artist_list'].apply(lambda x: [w2x_artist[a] for a in x]) #converting sequence of artist name \n",
 88 |     "                                                                                 #to sequence of artists idx\n",
 89 |     "\n",
 90 |     "\n",
 91 |     "df['train_seq_artist'] = df['artist_idx'].apply(lambda x: x[:-3]) #creating train sequence\n",
 92 |     "df['target_val_artist'] = df['artist_idx'].apply(lambda x: x[-3:]) #creating validation sequence"
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": 4,
 98 |    "metadata": {},
 99 |    "outputs": [
100 |     {
101 |      "data": {
102 |       "application/vnd.jupyter.widget-view+json": {
103 |        "model_id": "5312415c36e94a17aa363a0e0ace3b57"
104 |       }
105 |      },
106 |      "metadata": {},
107 |      "output_type": "display_data"
108 |     },
109 |     {
110 |      "name": "stdout",
111 |      "output_type": "stream",
112 |      "text": [
113 |       "\n"
114 |      ]
115 |     }
116 |    ],
117 |    "source": [
118 |     "inds = df['train_seq_artist']\n",
119 |     "playlist_artist = scipy.sparse.lil_matrix((df.shape[0], len(vocab_artist)), dtype=np.int8) #binary sparse artist-\n",
120 |     "                                                                                           #-playlist matrix\n",
121 |     "for i, row in tqdm_notebook(enumerate(inds)):\n",
122 |     "    playlist_artist[i, row] = 1 "
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": 5,
128 |    "metadata": {},
129 |    "outputs": [
130 |     {
131 |      "name": "stdout",
132 |      "output_type": "stream",
133 |      "text": [
134 |       "Hit rate using most popular benchmark: 0.037\n",
135 |       "Precision using most popular benchmark: 0.013\n"
136 |      ]
137 |     }
138 |    ],
139 |    "source": [
140 |     "precision = []\n",
141 |     "hr = []\n",
142 |     "sum_artists = np.asarray(np.sum(playlist_artist, axis = 0)).reshape((9722, ))\n",
143 |     "preds = np.argsort(sum_artists)[-3:]\n",
144 |     "y_true = df['target_val_artist']\n",
145 |     "for y in y_true:\n",
146 |     "    score = len(set(preds) & set(y))\n",
147 |     "    precision.append(score/3)\n",
148 |     "    hr.append(int(score > 0))\n",
149 |     "print('Hit rate using most popular benchmark:', np.mean(hr))\n",
150 |     "print('Precision using most popular benchmark:', np.mean(precision))"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": 6,
156 |    "metadata": {},
157 |    "outputs": [
158 |     {
159 |      "data": {
160 |       "application/vnd.jupyter.widget-view+json": {
161 |        "model_id": "6f57dad00099406ab8ca4d1cc45ac0b9"
162 |       }
163 |      },
164 |      "metadata": {},
165 |      "output_type": "display_data"
166 |     },
167 |     {
168 |      "name": "stdout",
169 |      "output_type": "stream",
170 |      "text": [
171 |       "\n",
172 |       "Hit rate using most popular benchmark: 0.013\n",
173 |       "Precision using most popular benchmark: 0.004333333333333333\n"
174 |      ]
175 |     }
176 |    ],
177 |    "source": [
178 |     "precision = []\n",
179 |     "hr = []\n",
180 |     "playlist_distances = pairwise_distances(playlist_artist, metric='cosine', n_jobs = 32)\n",
181 |     "i = 0\n",
182 |     "for row, y in tqdm_notebook(zip(playlist_artist, y_true)):\n",
183 |     "    distances = playlist_distances[i, :]\n",
184 |     "    most_similar = np.argsort(distances)[-3:].tolist()\n",
185 |     "    preds = []\n",
186 |     "\n",
187 |     "    if i in most_similar:\n",
188 |     "        most_similar.remove(i)\n",
189 |     "    \n",
190 |     "    for user in most_similar:\n",
191 |     "        pred = df.loc[user, 'train_seq_artist']\n",
192 |     "        preds += pred\n",
193 |     "    \n",
194 |     "    preds = np.asarray(np.unique(preds))\n",
195 |     "    preds_ind = np.argsort(sum_artists[preds])[-3:]\n",
196 |     "    y_pred = preds[preds_ind]\n",
197 |     "    score = len(set(y_pred) & set(y))\n",
198 |     "    precision.append(score/3)\n",
199 |     "    hr.append(int(score > 0))\n",
200 |     "    i+=1\n",
201 |     "print('Hit rate using most popular benchmark:', np.mean(hr))\n",
202 |     "print('Precision using most popular benchmark:', np.mean(precision))"
203 |    ]
204 |   }
205 |  ],
206 |  "metadata": {
207 |   "kernelspec": {
208 |    "display_name": "Python 3",
209 |    "language": "python",
210 |    "name": "python3"
211 |   },
212 |   "language_info": {
213 |    "codemirror_mode": {
214 |     "name": "ipython",
215 |     "version": 3
216 |    },
217 |    "file_extension": ".py",
218 |    "mimetype": "text/x-python",
219 |    "name": "python",
220 |    "nbconvert_exporter": "python",
221 |    "pygments_lexer": "ipython3",
222 |    "version": "3.6.1"
223 |   }
224 |  },
225 |  "nbformat": 4,
226 |  "nbformat_minor": 2
227 | }
228 | 


--------------------------------------------------------------------------------
/Section 4/4_2_Item to item recommendation with kNN.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "import scipy\n",
 14 |     "import json\n",
 15 |     "import os\n",
 16 |     "\n",
 17 |     "from tqdm import tqdm_notebook\n",
 18 |     "from sklearn.model_selection import train_test_split"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": 2,
 24 |    "metadata": {
 25 |     "collapsed": true
 26 |    },
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "def load_df():\n",
 30 |     "    df = pd.DataFrame()\n",
 31 |     "    with open('./mpd.slice.0-999.json') as data_file:\n",
 32 |     "        data_string = data_file.read()\n",
 33 |     "        try:\n",
 34 |     "            data = json.loads(data_string)\n",
 35 |     "        except ValueError:\n",
 36 |     "            print('Failed:')\n",
 37 |     "            print(repr(data_string))\n",
 38 |     "    df = pd.concat([df, pd.DataFrame(data['playlists'])], ignore_index=True)\n",
 39 |     "    return df"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": 3,
 45 |    "metadata": {},
 46 |    "outputs": [
 47 |     {
 48 |      "data": {
 49 |       "application/vnd.jupyter.widget-view+json": {
 50 |        "model_id": "6ed0a8815ac24965a293030c7042189d"
 51 |       }
 52 |      },
 53 |      "metadata": {},
 54 |      "output_type": "display_data"
 55 |     },
 56 |     {
 57 |      "name": "stdout",
 58 |      "output_type": "stream",
 59 |      "text": [
 60 |       "\n"
 61 |      ]
 62 |     }
 63 |    ],
 64 |    "source": [
 65 |     "df = load_df()\n",
 66 |     "df.drop(['description','name', 'pid', 'num_albums','num_artists', \n",
 67 |     "         'num_edits', 'num_followers', 'num_tracks', 'collaborative'], axis = 1, inplace = True) #dropping columns \n",
 68 |     "                                                                                                 #that we are not going\n",
 69 |     "                                                                                                 #to use\n",
 70 |     "\n",
 71 |     "artist_list = []\n",
 72 |     "vocab_artist = set()\n",
 73 |     "\n",
 74 |     "for row in tqdm_notebook(df.iterrows()): #iterating through df to get sequence of artists name \n",
 75 |     "                                         #that are contained in playlist\n",
 76 |     "    artists = [x['artist_name'] for x in row[1]['tracks']] #getting artists from playlist(json type)\n",
 77 |     "    for x in row[1]['tracks']:\n",
 78 |     "        vocab_artist.add(x['artist_name']) #creating set with unique artists name\n",
 79 |     "    artist_list.append(artists) \n",
 80 |     "\n",
 81 |     "df['artist_list'] = artist_list \n",
 82 |     "\n",
 83 |     "w2x_artist = {artist:i for i, artist in enumerate(vocab_artist)} #artist name to index\n",
 84 |     "x2w_artist = {i:artist for i, artist in enumerate(vocab_artist)} #index to artist name\n",
 85 |     "\n",
 86 |     "df['artist_idx'] = df['artist_list'].apply(lambda x: [w2x_artist[a] for a in x]) #converting sequence of artist name \n",
 87 |     "                                                                                 #to sequence of artists idx\n",
 88 |     "\n",
 89 |     "\n",
 90 |     "df['train_seq_artist'] = df['artist_idx'].apply(lambda x: x[:-3]) #creating train sequence\n",
 91 |     "df['target_val_artist'] = df['artist_idx'].apply(lambda x: x[-3:]) #creating validation sequence"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 4,
 97 |    "metadata": {},
 98 |    "outputs": [
 99 |     {
100 |      "data": {
101 |       "application/vnd.jupyter.widget-view+json": {
102 |        "model_id": "66b4d8229fe94f949d9ec1ad31ee1605"
103 |       }
104 |      },
105 |      "metadata": {},
106 |      "output_type": "display_data"
107 |     },
108 |     {
109 |      "name": "stdout",
110 |      "output_type": "stream",
111 |      "text": [
112 |       "\n"
113 |      ]
114 |     }
115 |    ],
116 |    "source": [
117 |     "inds = df['train_seq_artist']\n",
118 |     "playlist_artist = scipy.sparse.lil_matrix((df.shape[0], len(vocab_artist)), dtype=np.int8)\n",
119 |     "for i, row in tqdm_notebook(enumerate(inds)):\n",
120 |     "    playlist_artist[i, row] = 1 "
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 5,
126 |    "metadata": {},
127 |    "outputs": [
128 |     {
129 |      "name": "stdout",
130 |      "output_type": "stream",
131 |      "text": [
132 |       "Hit rate using most popular benchmark: 0.037\n",
133 |       "Precision using most popular benchmark: 0.013\n"
134 |      ]
135 |     }
136 |    ],
137 |    "source": [
138 |     "precision = []\n",
139 |     "hr = []\n",
140 |     "sum_artists = np.asarray(np.sum(playlist_artist, axis = 0)).reshape((playlist_artist.shape[1], ))\n",
141 |     "preds = np.argsort(sum_artists)[-3:]\n",
142 |     "y_true = df['target_val_artist']\n",
143 |     "for y in y_true:\n",
144 |     "    score = len(set(preds) & set(y))\n",
145 |     "    precision.append(score/3)\n",
146 |     "    hr.append(int(score > 0))\n",
147 |     "print('Hit rate using most popular benchmark:', np.mean(hr))\n",
148 |     "print('Precision using most popular benchmark:', np.mean(precision))"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "code",
153 |    "execution_count": 6,
154 |    "metadata": {
155 |     "collapsed": true
156 |    },
157 |    "outputs": [],
158 |    "source": [
159 |     "from sklearn.neighbors import NearestNeighbors"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": 8,
165 |    "metadata": {},
166 |    "outputs": [
167 |     {
168 |      "data": {
169 |       "application/vnd.jupyter.widget-view+json": {
170 |        "model_id": "c7c51a403d4443ed87e1a00b381c5640"
171 |       }
172 |      },
173 |      "metadata": {},
174 |      "output_type": "display_data"
175 |     },
176 |     {
177 |      "name": "stdout",
178 |      "output_type": "stream",
179 |      "text": [
180 |       "\n",
181 |       "Hit rate using item to item kNN: 0.077\n",
182 |       "Precision using item to item kNN: 0.026666666666666665\n"
183 |      ]
184 |     }
185 |    ],
186 |    "source": [
187 |     "precision = []\n",
188 |     "hr = []\n",
189 |     "nn = NearestNeighbors(n_jobs=32,n_neighbors=3)\n",
190 |     "nn.fit(playlist_artist.T)\n",
191 |     "distances = nn.kneighbors(playlist_artist.T)[1]\n",
192 |     "for row, y in tqdm_notebook(zip(playlist_artist, y_true)):\n",
193 |     "    last_listened = np.nonzero(row)[1]\n",
194 |     "    preds = distances[last_listened[-1]]\n",
195 |     "    score = len(set(preds) & set(y))\n",
196 |     "    precision.append(score/3)\n",
197 |     "    hr.append(int(score > 0))\n",
198 |     "print('Hit rate using item to item kNN:', np.mean(hr))\n",
199 |     "print('Precision using item to item kNN:', np.mean(precision))"
200 |    ]
201 |   }
202 |  ],
203 |  "metadata": {
204 |   "kernelspec": {
205 |    "display_name": "Python 3",
206 |    "language": "python",
207 |    "name": "python3"
208 |   },
209 |   "language_info": {
210 |    "codemirror_mode": {
211 |     "name": "ipython",
212 |     "version": 3
213 |    },
214 |    "file_extension": ".py",
215 |    "mimetype": "text/x-python",
216 |    "name": "python",
217 |    "nbconvert_exporter": "python",
218 |    "pygments_lexer": "ipython3",
219 |    "version": "3.6.1"
220 |   }
221 |  },
222 |  "nbformat": 4,
223 |  "nbformat_minor": 2
224 | }
225 | 


--------------------------------------------------------------------------------
/Section 4/4_3_Applying Matrix Factorization on dataset.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "import scipy.sparse\n",
 14 |     "import json\n",
 15 |     "import os\n",
 16 |     "\n",
 17 |     "from tqdm import tqdm_notebook"
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "code",
 22 |    "execution_count": 2,
 23 |    "metadata": {
 24 |     "collapsed": true
 25 |    },
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "def load_df():\n",
 29 |     "    df = pd.DataFrame()\n",
 30 |     "    with open('./mpd.slice.0-999.json') as data_file:\n",
 31 |     "        data_string = data_file.read()\n",
 32 |     "        try:\n",
 33 |     "            data = json.loads(data_string)\n",
 34 |     "        except ValueError:\n",
 35 |     "            print('Failed:')\n",
 36 |     "            print(repr(data_string))\n",
 37 |     "    df = pd.concat([df, pd.DataFrame(data['playlists'])], ignore_index=True)\n",
 38 |     "    return df"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "code",
 43 |    "execution_count": 3,
 44 |    "metadata": {},
 45 |    "outputs": [
 46 |     {
 47 |      "data": {
 48 |       "application/vnd.jupyter.widget-view+json": {
 49 |        "model_id": "021cd30c4ab948929279de7ce40ccbed"
 50 |       }
 51 |      },
 52 |      "metadata": {},
 53 |      "output_type": "display_data"
 54 |     },
 55 |     {
 56 |      "name": "stdout",
 57 |      "output_type": "stream",
 58 |      "text": [
 59 |       "\n"
 60 |      ]
 61 |     }
 62 |    ],
 63 |    "source": [
 64 |     "df = load_df()\n",
 65 |     "df.drop(['description','name', 'pid', 'num_albums','num_artists', \n",
 66 |     "         'num_edits', 'num_followers', 'num_tracks', 'collaborative'], axis = 1, inplace = True) #dropping columns \n",
 67 |     "                                                                                                 #that we are not going\n",
 68 |     "                                                                                                 #to use\n",
 69 |     "\n",
 70 |     "artist_list = []\n",
 71 |     "vocab_artist = set()\n",
 72 |     "\n",
 73 |     "for row in tqdm_notebook(df.iterrows()): #iterating through df to get sequence of artists name \n",
 74 |     "                                         #that are contained in playlist\n",
 75 |     "    artists = [x['artist_name'] for x in row[1]['tracks']] #getting artists from playlist(json type)\n",
 76 |     "    for x in row[1]['tracks']:\n",
 77 |     "        vocab_artist.add(x['artist_name']) #creating set with unique artists name\n",
 78 |     "    artist_list.append(artists) \n",
 79 |     "\n",
 80 |     "df['artist_list'] = artist_list \n",
 81 |     "\n",
 82 |     "w2x_artist = {artist:i for i, artist in enumerate(vocab_artist)} #artist name to index\n",
 83 |     "x2w_artist = {i:artist for i, artist in enumerate(vocab_artist)} #index to artist name\n",
 84 |     "\n",
 85 |     "df['artist_idx'] = df['artist_list'].apply(lambda x: [w2x_artist[a] for a in x]) #converting sequence of artist name \n",
 86 |     "                                                                                 #to sequence of artists idx\n",
 87 |     "\n",
 88 |     "\n",
 89 |     "df['train_seq_artist'] = df['artist_idx'].apply(lambda x: x[:-3]) #creating train sequence\n",
 90 |     "df['target_val_artist'] = df['artist_idx'].apply(lambda x: x[-3:]) #creating validation sequence"
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "code",
 95 |    "execution_count": 4,
 96 |    "metadata": {},
 97 |    "outputs": [
 98 |     {
 99 |      "data": {
100 |       "application/vnd.jupyter.widget-view+json": {
101 |        "model_id": "f2b51261211746bdb1a82ec4f4a5205a"
102 |       }
103 |      },
104 |      "metadata": {},
105 |      "output_type": "display_data"
106 |     },
107 |     {
108 |      "name": "stdout",
109 |      "output_type": "stream",
110 |      "text": [
111 |       "\n"
112 |      ]
113 |     }
114 |    ],
115 |    "source": [
116 |     "inds = df['train_seq_artist']\n",
117 |     "playlist_artist_train = scipy.sparse.lil_matrix((df.shape[0], len(vocab_artist)), dtype=np.int8) \n",
118 |     "#creating binary playlist artist matrix for train\n",
119 |     "for i, row in tqdm_notebook(enumerate(inds)):\n",
120 |     "    playlist_artist_train[i, row] = 1 "
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 5,
126 |    "metadata": {},
127 |    "outputs": [
128 |     {
129 |      "data": {
130 |       "application/vnd.jupyter.widget-view+json": {
131 |        "model_id": "d69d1ef08794412d9c6785ab878d3429"
132 |       }
133 |      },
134 |      "metadata": {},
135 |      "output_type": "display_data"
136 |     },
137 |     {
138 |      "name": "stdout",
139 |      "output_type": "stream",
140 |      "text": [
141 |       "\n"
142 |      ]
143 |     }
144 |    ],
145 |    "source": [
146 |     "inds = df['target_val_artist']\n",
147 |     "playlist_artist_val = scipy.sparse.lil_matrix((df.shape[0], len(vocab_artist)), dtype=np.int8)\n",
148 |     "#creating binary playlist artist matrix for validation\n",
149 |     "for i, row in tqdm_notebook(enumerate(inds)):\n",
150 |     "    playlist_artist_val[i, row] = 1 "
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": 6,
156 |    "metadata": {},
157 |    "outputs": [
158 |     {
159 |      "name": "stdout",
160 |      "output_type": "stream",
161 |      "text": [
162 |       "Hit rate using most popular benchmark: 0.037\n",
163 |       "Precision using most popular benchmark: 0.013\n"
164 |      ]
165 |     }
166 |    ],
167 |    "source": [
168 |     "precision = []\n",
169 |     "hr = []\n",
170 |     "sum_artists = np.asarray(np.sum(playlist_artist_train, axis = 0)).reshape((9722, ))\n",
171 |     "preds = np.argsort(sum_artists)[-3:]\n",
172 |     "y_true = df['target_val_artist']\n",
173 |     "for y in y_true:\n",
174 |     "    score = len(set(preds) & set(y))\n",
175 |     "    precision.append(score/3)\n",
176 |     "    hr.append(int(score > 0))\n",
177 |     "print('Hit rate using most popular benchmark:', np.mean(hr))\n",
178 |     "print('Precision using most popular benchmark:', np.mean(precision))"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "execution_count": 7,
184 |    "metadata": {
185 |     "collapsed": true
186 |    },
187 |    "outputs": [],
188 |    "source": [
189 |     "def get_neg_candidates_train(i):\n",
190 |     "    #getting negative candidates for supervised learning algorithm for train\n",
191 |     "    np.random.seed(42)\n",
192 |     "    neg = np.where(playlist_artist_train.getrow(i).toarray()[0] == 0)[0]\n",
193 |     "    ind = np.random.randint(0, neg.shape[0], size = 3).tolist()\n",
194 |     "    return neg[ind].tolist()"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "code",
199 |    "execution_count": 8,
200 |    "metadata": {},
201 |    "outputs": [
202 |     {
203 |      "data": {
204 |       "application/vnd.jupyter.widget-view+json": {
205 |        "model_id": "75353c20f7f948ca9a4ddf7d95df6dfe"
206 |       }
207 |      },
208 |      "metadata": {},
209 |      "output_type": "display_data"
210 |     },
211 |     {
212 |      "name": "stdout",
213 |      "output_type": "stream",
214 |      "text": [
215 |       "\n"
216 |      ]
217 |     }
218 |    ],
219 |    "source": [
220 |     "f = open('libfm-1.40.windows/train.txt', 'w')\n",
221 |     "n_users = 1000\n",
222 |     "for row in tqdm_notebook(enumerate(playlist_artist_train)): #converting train data for libfm format \n",
223 |     "    for j in np.nonzero(row[1].toarray())[1]: #writing down positive candidates for playlist №row\n",
224 |     "        f.write(str(1) + ' ')\n",
225 |     "        f.write(str(row[0]) + ':' + '1 ') \n",
226 |     "        f.write(str(n_users + j) + ':' + '1 ')\n",
227 |     "        f.write('\\n')\n",
228 |     "    neg_candidates = get_neg_candidates_train(row[0]) #writing down negative candidates for playlist №row\n",
229 |     "    for j in neg_candidates:\n",
230 |     "        f.write(str(0) + ' ')\n",
231 |     "        f.write(str(row[0]) + ':' + '1 ')\n",
232 |     "        f.write(str(j) + ':' + '1 ')\n",
233 |     "        f.write('\\n')\n",
234 |     "f.close()"
235 |    ]
236 |   },
237 |   {
238 |    "cell_type": "code",
239 |    "execution_count": 9,
240 |    "metadata": {
241 |     "collapsed": true
242 |    },
243 |    "outputs": [],
244 |    "source": [
245 |     "def get_neg_candidates_val(i):\n",
246 |     "    #getting negative candidates for supervised learning algorithm for evaluating MF algorithm\n",
247 |     "    np.random.seed(42)\n",
248 |     "    neg = np.where(playlist_artist_val.getrow(i).toarray()[0] == 0)[0]\n",
249 |     "    ind = np.random.randint(0, neg.shape[0], size = 3).tolist()\n",
250 |     "    return neg[ind].tolist()"
251 |    ]
252 |   },
253 |   {
254 |    "cell_type": "code",
255 |    "execution_count": 10,
256 |    "metadata": {},
257 |    "outputs": [
258 |     {
259 |      "data": {
260 |       "application/vnd.jupyter.widget-view+json": {
261 |        "model_id": "b3e185ab54e54bb291e57cad8f925289"
262 |       }
263 |      },
264 |      "metadata": {},
265 |      "output_type": "display_data"
266 |     },
267 |     {
268 |      "name": "stdout",
269 |      "output_type": "stream",
270 |      "text": [
271 |       "\n"
272 |      ]
273 |     }
274 |    ],
275 |    "source": [
276 |     "f = open('libfm-1.40.windows/val.txt', 'w')\n",
277 |     "n_users = 1000\n",
278 |     "answer_dict = {i:[] for i in range(playlist_artist_val.shape[0])}\n",
279 |     "for row in tqdm_notebook(enumerate(playlist_artist_val)): #converting train data for libfm format \n",
280 |     "    positive_candidates = np.nonzero(row[1].toarray())[1] #writing down positive candidates for playlist №row\n",
281 |     "    for j in positive_candidates:\n",
282 |     "        f.write(str(1) + ' ')\n",
283 |     "        f.write(str(row[0]) + ':' + '1 ')\n",
284 |     "        f.write(str(n_users + j) + ':' + '1 ')\n",
285 |     "        f.write('\\n')\n",
286 |     "    neg_candidates = get_neg_candidates_val(row[0])\n",
287 |     "    answer_dict[row[0]] += positive_candidates.tolist() + neg_candidates #using dict playlist : pos + neg candidates\n",
288 |     "    for j in neg_candidates: #writing down negative candidates for playlist №row\n",
289 |     "        f.write(str(0) + ' ')\n",
290 |     "        f.write(str(row[0]) + ':' + '1 ')\n",
291 |     "        f.write(str(n_users + j) + ':' + '1 ')\n",
292 |     "        f.write('\\n')\n",
293 |     "f.close()"
294 |    ]
295 |   },
296 |   {
297 |    "cell_type": "code",
298 |    "execution_count": 11,
299 |    "metadata": {
300 |     "collapsed": true
301 |    },
302 |    "outputs": [],
303 |    "source": [
304 |     "import subprocess"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "code",
309 |    "execution_count": 12,
310 |    "metadata": {},
311 |    "outputs": [
312 |     {
313 |      "name": "stdout",
314 |      "output_type": "stream",
315 |      "text": [
316 |       "Finished training\n"
317 |      ]
318 |     }
319 |    ],
320 |    "source": [
321 |     "cmd = ' '.join(['libfm-1.40.windows/libFM', '-task', 'r', '-train', 'libfm-1.40.windows/train.txt', \n",
322 |     "                         '-test', '../libfm-1.42.src/bin/val.txt', '-iter', '20', '-method', 'sgd',\n",
323 |     "                         '-regular', '’3,3,15’', '-dim', '’1,1,4’', '-init_stdev',\n",
324 |     "                         '0.1', '-out', 'output.txt', '-learn_rate', '0.001']) #hyperparameters (for mor info see \n",
325 |     "                                                                               #manual attached to the course)\n",
326 |     "proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) #starting subprocess \n",
327 |     "                                                                                                           #in console\n",
328 |     "for line in iter(proc.stdout.readline, ''): #evaluating libfm\n",
329 |     "    if line == b'':\n",
330 |     "        print('Finished training')\n",
331 |     "        break"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": 13,
337 |    "metadata": {
338 |     "collapsed": true
339 |    },
340 |    "outputs": [],
341 |    "source": [
342 |     "with open('./output.txt', 'r') as f:\n",
343 |     "    val_answers = [float(x.strip()) for x in f.readlines()] #opening file with answers"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "code",
348 |    "execution_count": 14,
349 |    "metadata": {},
350 |    "outputs": [
351 |     {
352 |      "data": {
353 |       "application/vnd.jupyter.widget-view+json": {
354 |        "model_id": "d30cbbacfbe24855a0291de0b6a3b0ca"
355 |       }
356 |      },
357 |      "metadata": {},
358 |      "output_type": "display_data"
359 |     },
360 |     {
361 |      "name": "stdout",
362 |      "output_type": "stream",
363 |      "text": [
364 |       "\n",
365 |       "MF HR@3 score: 0.939\n",
366 |       "MF precision@3 score: 0.569\n"
367 |      ]
368 |     }
369 |    ],
370 |    "source": [
371 |     "num_read = 0\n",
372 |     "precision = []\n",
373 |     "hr = []\n",
374 |     "for i in tqdm_notebook(range(playlist_artist_val.shape[0])): #calculating metric\n",
375 |     "    all_answers = np.asarray(answer_dict[i])\n",
376 |     "    y_true = all_answers[:-3] #true answers (first 3 elements in the array)\n",
377 |     "    mf_answers = val_answers[num_read:num_read + len(all_answers)] #answers from algorithm\n",
378 |     "    num_read += len(all_answers) #num of rows that were read from the val_answer\n",
379 |     "    y_pred_ind = np.argsort(mf_answers)[-3:] #top3 by probability\n",
380 |     "    y_pred = all_answers[y_pred_ind] #getting idx of these artists\n",
381 |     "    score = len(set(y_pred) & set(y_true)) #num of guessed artists\n",
382 |     "    precision.append(score/3)\n",
383 |     "    hr.append(int(score > 0))\n",
384 |     "print('MF HR@3 score:', np.mean(hr))\n",
385 |     "print('MF precision@3 score:', np.mean(precision))"
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "code",
390 |    "execution_count": 15,
391 |    "metadata": {},
392 |    "outputs": [
393 |     {
394 |      "data": {
395 |       "text/plain": [
396 |        "[1325, 3410, 9260, 7272, 860, 5392]"
397 |       ]
398 |      },
399 |      "execution_count": 15,
400 |      "metadata": {},
401 |      "output_type": "execute_result"
402 |     }
403 |    ],
404 |    "source": [
405 |     "answer_dict[0]"
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "code",
410 |    "execution_count": null,
411 |    "metadata": {
412 |     "collapsed": true
413 |    },
414 |    "outputs": [],
415 |    "source": []
416 |   }
417 |  ],
418 |  "metadata": {
419 |   "kernelspec": {
420 |    "display_name": "Python 3",
421 |    "language": "python",
422 |    "name": "python3"
423 |   },
424 |   "language_info": {
425 |    "codemirror_mode": {
426 |     "name": "ipython",
427 |     "version": 3
428 |    },
429 |    "file_extension": ".py",
430 |    "mimetype": "text/x-python",
431 |    "name": "python",
432 |    "nbconvert_exporter": "python",
433 |    "pygments_lexer": "ipython3",
434 |    "version": "3.6.1"
435 |   }
436 |  },
437 |  "nbformat": 4,
438 |  "nbformat_minor": 2
439 | }
440 | 


--------------------------------------------------------------------------------
/Section 4/4_4_Wordbach at use.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
 12 |     "from sklearn.model_selection import train_test_split\n",
 13 |     "from scipy.sparse import csr_matrix, hstack\n",
 14 |     "from wordbatch.models import FTRL, FM_FTRL\n",
 15 |     "from nltk.corpus import stopwords\n",
 16 |     "\n",
 17 |     "import re\n",
 18 |     "import wordbatch\n",
 19 |     "import pandas as pd\n",
 20 |     "import numpy as np\n"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 2,
 26 |    "metadata": {
 27 |     "collapsed": true
 28 |    },
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "def rmsle(y, y0): #defining metric\n",
 32 |     "    assert len(y) == len(y0)\n",
 33 |     "    return np.sqrt(np.mean(np.power(np.log1p(y) - np.log1p(y0), 2))) "
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 3,
 39 |    "metadata": {
 40 |     "collapsed": true
 41 |    },
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "stopwords = {x: 1 for x in stopwords.words('english')}\n",
 45 |     "non_alphanums = re.compile(u'[^A-Za-z0-9]+') #using only numbers + english alphabet\n",
 46 |     "\n",
 47 |     "\n",
 48 |     "def normalize_text(text):\n",
 49 |     "    return u\" \".join(\n",
 50 |     "        [x for x in [y for y in non_alphanums.sub(' ', text).lower().strip().split(\" \")] \\\n",
 51 |     "         if len(x) > 1 and x not in stopwords]) #removing stop words and using only numbers + english alphabet"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "code",
 56 |    "execution_count": 4,
 57 |    "metadata": {
 58 |     "collapsed": true
 59 |    },
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "def handle_missing_inplace(df): #filling all nans\n",
 63 |     "    df['category_name'].fillna(value='missing/missing/missing', inplace=True)\n",
 64 |     "    df['brand_name'].fillna(value='missing', inplace=True)\n",
 65 |     "    df['item_description'].fillna(value='missing', inplace=True)\n",
 66 |     "    return df"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "code",
 71 |    "execution_count": 5,
 72 |    "metadata": {},
 73 |    "outputs": [
 74 |     {
 75 |      "data": {
 76 |       "text/html": [
 77 |        "<div>\n",
 78 |        "<style>\n",
 79 |        "    .dataframe thead tr:only-child th {\n",
 80 |        "        text-align: right;\n",
 81 |        "    }\n",
 82 |        "\n",
 83 |        "    .dataframe thead th {\n",
 84 |        "        text-align: left;\n",
 85 |        "    }\n",
 86 |        "\n",
 87 |        "    .dataframe tbody tr th {\n",
 88 |        "        vertical-align: top;\n",
 89 |        "    }\n",
 90 |        "</style>\n",
 91 |        "<table border=\"1\" class=\"dataframe\">\n",
 92 |        "  <thead>\n",
 93 |        "    <tr style=\"text-align: right;\">\n",
 94 |        "      <th></th>\n",
 95 |        "      <th>train_id</th>\n",
 96 |        "      <th>name</th>\n",
 97 |        "      <th>item_condition_id</th>\n",
 98 |        "      <th>category_name</th>\n",
 99 |        "      <th>brand_name</th>\n",
100 |        "      <th>price</th>\n",
101 |        "      <th>shipping</th>\n",
102 |        "      <th>item_description</th>\n",
103 |        "    </tr>\n",
104 |        "  </thead>\n",
105 |        "  <tbody>\n",
106 |        "    <tr>\n",
107 |        "      <th>0</th>\n",
108 |        "      <td>0</td>\n",
109 |        "      <td>MLB Cincinnati Reds T Shirt Size XL</td>\n",
110 |        "      <td>3</td>\n",
111 |        "      <td>Men/Tops/T-shirts</td>\n",
112 |        "      <td>NaN</td>\n",
113 |        "      <td>10.0</td>\n",
114 |        "      <td>1</td>\n",
115 |        "      <td>No description yet</td>\n",
116 |        "    </tr>\n",
117 |        "    <tr>\n",
118 |        "      <th>1</th>\n",
119 |        "      <td>1</td>\n",
120 |        "      <td>Razer BlackWidow Chroma Keyboard</td>\n",
121 |        "      <td>3</td>\n",
122 |        "      <td>Electronics/Computers &amp; Tablets/Components &amp; P...</td>\n",
123 |        "      <td>Razer</td>\n",
124 |        "      <td>52.0</td>\n",
125 |        "      <td>0</td>\n",
126 |        "      <td>This keyboard is in great condition and works ...</td>\n",
127 |        "    </tr>\n",
128 |        "    <tr>\n",
129 |        "      <th>2</th>\n",
130 |        "      <td>2</td>\n",
131 |        "      <td>AVA-VIV Blouse</td>\n",
132 |        "      <td>1</td>\n",
133 |        "      <td>Women/Tops &amp; Blouses/Blouse</td>\n",
134 |        "      <td>Target</td>\n",
135 |        "      <td>10.0</td>\n",
136 |        "      <td>1</td>\n",
137 |        "      <td>Adorable top with a hint of lace and a key hol...</td>\n",
138 |        "    </tr>\n",
139 |        "    <tr>\n",
140 |        "      <th>3</th>\n",
141 |        "      <td>3</td>\n",
142 |        "      <td>Leather Horse Statues</td>\n",
143 |        "      <td>1</td>\n",
144 |        "      <td>Home/Home Décor/Home Décor Accents</td>\n",
145 |        "      <td>NaN</td>\n",
146 |        "      <td>35.0</td>\n",
147 |        "      <td>1</td>\n",
148 |        "      <td>New with tags. Leather horses. Retail for [rm]...</td>\n",
149 |        "    </tr>\n",
150 |        "    <tr>\n",
151 |        "      <th>4</th>\n",
152 |        "      <td>4</td>\n",
153 |        "      <td>24K GOLD plated rose</td>\n",
154 |        "      <td>1</td>\n",
155 |        "      <td>Women/Jewelry/Necklaces</td>\n",
156 |        "      <td>NaN</td>\n",
157 |        "      <td>44.0</td>\n",
158 |        "      <td>0</td>\n",
159 |        "      <td>Complete with certificate of authenticity</td>\n",
160 |        "    </tr>\n",
161 |        "  </tbody>\n",
162 |        "</table>\n",
163 |        "</div>"
164 |       ],
165 |       "text/plain": [
166 |        "   train_id                                 name  item_condition_id  \\\n",
167 |        "0         0  MLB Cincinnati Reds T Shirt Size XL                  3   \n",
168 |        "1         1     Razer BlackWidow Chroma Keyboard                  3   \n",
169 |        "2         2                       AVA-VIV Blouse                  1   \n",
170 |        "3         3                Leather Horse Statues                  1   \n",
171 |        "4         4                 24K GOLD plated rose                  1   \n",
172 |        "\n",
173 |        "                                       category_name brand_name  price  \\\n",
174 |        "0                                  Men/Tops/T-shirts        NaN   10.0   \n",
175 |        "1  Electronics/Computers & Tablets/Components & P...      Razer   52.0   \n",
176 |        "2                        Women/Tops & Blouses/Blouse     Target   10.0   \n",
177 |        "3                 Home/Home Décor/Home Décor Accents        NaN   35.0   \n",
178 |        "4                            Women/Jewelry/Necklaces        NaN   44.0   \n",
179 |        "\n",
180 |        "   shipping                                   item_description  \n",
181 |        "0         1                                 No description yet  \n",
182 |        "1         0  This keyboard is in great condition and works ...  \n",
183 |        "2         1  Adorable top with a hint of lace and a key hol...  \n",
184 |        "3         1  New with tags. Leather horses. Retail for [rm]...  \n",
185 |        "4         0          Complete with certificate of authenticity  "
186 |       ]
187 |      },
188 |      "execution_count": 5,
189 |      "metadata": {},
190 |      "output_type": "execute_result"
191 |     }
192 |    ],
193 |    "source": [
194 |     "train = pd.read_csv('./train.tsv', sep = '\\t') #loading train\n",
195 |     "train.head()"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 13,
201 |    "metadata": {
202 |     "collapsed": true
203 |    },
204 |    "outputs": [],
205 |    "source": [
206 |     "sample = train.sample(frac = 0.05, random_state = 42)#using 5% sample\n",
207 |     "sample = handle_missing_inplace(sample) #filling all nans\n",
208 |     "y = sample.pop('price')\n",
209 |     "\n",
210 |     "#splitting categories into 3 sub categories\n",
211 |     "sample['cat1'] = sample['category_name'].apply(lambda x: x.split('/')[0])\n",
212 |     "sample['cat2'] = sample['category_name'].apply(lambda x: x.split('/')[1])\n",
213 |     "sample['cat3'] = sample['category_name'].apply(lambda x: x.split('/')[2])"
214 |    ]
215 |   },
216 |   {
217 |    "cell_type": "code",
218 |    "execution_count": 16,
219 |    "metadata": {},
220 |    "outputs": [
221 |     {
222 |      "data": {
223 |       "text/html": [
224 |        "<div>\n",
225 |        "<style>\n",
226 |        "    .dataframe thead tr:only-child th {\n",
227 |        "        text-align: right;\n",
228 |        "    }\n",
229 |        "\n",
230 |        "    .dataframe thead th {\n",
231 |        "        text-align: left;\n",
232 |        "    }\n",
233 |        "\n",
234 |        "    .dataframe tbody tr th {\n",
235 |        "        vertical-align: top;\n",
236 |        "    }\n",
237 |        "</style>\n",
238 |        "<table border=\"1\" class=\"dataframe\">\n",
239 |        "  <thead>\n",
240 |        "    <tr style=\"text-align: right;\">\n",
241 |        "      <th></th>\n",
242 |        "      <th>train_id</th>\n",
243 |        "      <th>name</th>\n",
244 |        "      <th>item_condition_id</th>\n",
245 |        "      <th>category_name</th>\n",
246 |        "      <th>brand_name</th>\n",
247 |        "      <th>shipping</th>\n",
248 |        "      <th>item_description</th>\n",
249 |        "      <th>cat1</th>\n",
250 |        "      <th>cat2</th>\n",
251 |        "      <th>cat3</th>\n",
252 |        "    </tr>\n",
253 |        "  </thead>\n",
254 |        "  <tbody>\n",
255 |        "    <tr>\n",
256 |        "      <th>525834</th>\n",
257 |        "      <td>525834</td>\n",
258 |        "      <td>Under armor sweatpants</td>\n",
259 |        "      <td>4</td>\n",
260 |        "      <td>Women/Athletic Apparel/Pants, Tights, Leggings</td>\n",
261 |        "      <td>Under Armour</td>\n",
262 |        "      <td>1</td>\n",
263 |        "      <td>Used condition size small black in color two p...</td>\n",
264 |        "      <td>Women</td>\n",
265 |        "      <td>Athletic Apparel</td>\n",
266 |        "      <td>Pants, Tights, Leggings</td>\n",
267 |        "    </tr>\n",
268 |        "    <tr>\n",
269 |        "      <th>149839</th>\n",
270 |        "      <td>149839</td>\n",
271 |        "      <td>Men's watch</td>\n",
272 |        "      <td>3</td>\n",
273 |        "      <td>Men/Men's Accessories/Watches</td>\n",
274 |        "      <td>Tommy Bahama</td>\n",
275 |        "      <td>0</td>\n",
276 |        "      <td>Tommy Bahama watch in good condition with new ...</td>\n",
277 |        "      <td>Men</td>\n",
278 |        "      <td>Men's Accessories</td>\n",
279 |        "      <td>Watches</td>\n",
280 |        "    </tr>\n",
281 |        "    <tr>\n",
282 |        "      <th>536234</th>\n",
283 |        "      <td>536234</td>\n",
284 |        "      <td>Eileen Fisher gray Cardigan</td>\n",
285 |        "      <td>3</td>\n",
286 |        "      <td>Women/Sweaters/Cardigan</td>\n",
287 |        "      <td>Eileen Fisher</td>\n",
288 |        "      <td>0</td>\n",
289 |        "      <td>Large but fits medium or small</td>\n",
290 |        "      <td>Women</td>\n",
291 |        "      <td>Sweaters</td>\n",
292 |        "      <td>Cardigan</td>\n",
293 |        "    </tr>\n",
294 |        "    <tr>\n",
295 |        "      <th>427908</th>\n",
296 |        "      <td>427908</td>\n",
297 |        "      <td>Blue Patagonia</td>\n",
298 |        "      <td>2</td>\n",
299 |        "      <td>Men/Sweats &amp; Hoodies/Sweatshirt, Pullover</td>\n",
300 |        "      <td>Patagonia, Inc.</td>\n",
301 |        "      <td>0</td>\n",
302 |        "      <td>No description yet</td>\n",
303 |        "      <td>Men</td>\n",
304 |        "      <td>Sweats &amp; Hoodies</td>\n",
305 |        "      <td>Sweatshirt, Pullover</td>\n",
306 |        "    </tr>\n",
307 |        "    <tr>\n",
308 |        "      <th>193641</th>\n",
309 |        "      <td>193641</td>\n",
310 |        "      <td>✨4 YMED NIKE PRO for Lindsay✨</td>\n",
311 |        "      <td>1</td>\n",
312 |        "      <td>Kids/Girls (4+)/Other</td>\n",
313 |        "      <td>Nike</td>\n",
314 |        "      <td>0</td>\n",
315 |        "      <td>4 YMED NIKE PRO compression shorts All NWT</td>\n",
316 |        "      <td>Kids</td>\n",
317 |        "      <td>Girls (4+)</td>\n",
318 |        "      <td>Other</td>\n",
319 |        "    </tr>\n",
320 |        "  </tbody>\n",
321 |        "</table>\n",
322 |        "</div>"
323 |       ],
324 |       "text/plain": [
325 |        "        train_id                           name  item_condition_id  \\\n",
326 |        "525834    525834         Under armor sweatpants                  4   \n",
327 |        "149839    149839                    Men's watch                  3   \n",
328 |        "536234    536234    Eileen Fisher gray Cardigan                  3   \n",
329 |        "427908    427908                 Blue Patagonia                  2   \n",
330 |        "193641    193641  ✨4 YMED NIKE PRO for Lindsay✨                  1   \n",
331 |        "\n",
332 |        "                                         category_name       brand_name  \\\n",
333 |        "525834  Women/Athletic Apparel/Pants, Tights, Leggings     Under Armour   \n",
334 |        "149839                   Men/Men's Accessories/Watches     Tommy Bahama   \n",
335 |        "536234                         Women/Sweaters/Cardigan    Eileen Fisher   \n",
336 |        "427908       Men/Sweats & Hoodies/Sweatshirt, Pullover  Patagonia, Inc.   \n",
337 |        "193641                           Kids/Girls (4+)/Other             Nike   \n",
338 |        "\n",
339 |        "        shipping                                   item_description   cat1  \\\n",
340 |        "525834         1  Used condition size small black in color two p...  Women   \n",
341 |        "149839         0  Tommy Bahama watch in good condition with new ...    Men   \n",
342 |        "536234         0                     Large but fits medium or small  Women   \n",
343 |        "427908         0                                 No description yet    Men   \n",
344 |        "193641         0         4 YMED NIKE PRO compression shorts All NWT   Kids   \n",
345 |        "\n",
346 |        "                     cat2                     cat3  \n",
347 |        "525834   Athletic Apparel  Pants, Tights, Leggings  \n",
348 |        "149839  Men's Accessories                  Watches  \n",
349 |        "536234           Sweaters                 Cardigan  \n",
350 |        "427908   Sweats & Hoodies     Sweatshirt, Pullover  \n",
351 |        "193641         Girls (4+)                    Other  "
352 |       ]
353 |      },
354 |      "execution_count": 16,
355 |      "metadata": {},
356 |      "output_type": "execute_result"
357 |     }
358 |    ],
359 |    "source": [
360 |     "sample.head()"
361 |    ]
362 |   },
363 |   {
364 |    "cell_type": "code",
365 |    "execution_count": 8,
366 |    "metadata": {
367 |     "collapsed": true
368 |    },
369 |    "outputs": [],
370 |    "source": [
371 |     "tf = TfidfVectorizer(max_features=10000,\n",
372 |     "                     max_df = 0.95, min_df = 100) #using tf-idf preprocessing to convert text in numerical matrix"
373 |    ]
374 |   },
375 |   {
376 |    "cell_type": "code",
377 |    "execution_count": 9,
378 |    "metadata": {},
379 |    "outputs": [
380 |     {
381 |      "name": "stdout",
382 |      "output_type": "stream",
383 |      "text": [
384 |       "Working with name\n"
385 |      ]
386 |     },
387 |     {
388 |      "name": "stderr",
389 |      "output_type": "stream",
390 |      "text": [
391 |       "C:\\Users\\Piboditheowl\\Anaconda3\\lib\\site-packages\\sklearn\\feature_extraction\\text.py:1059: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
392 |       "  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):\n"
393 |      ]
394 |     },
395 |     {
396 |      "name": "stdout",
397 |      "output_type": "stream",
398 |      "text": [
399 |       "-------\n",
400 |       "Working with item_description\n",
401 |       "-------\n",
402 |       "Working with cat1\n",
403 |       "-------\n",
404 |       "Working with cat2\n",
405 |       "-------\n",
406 |       "Working with cat3\n",
407 |       "-------\n"
408 |      ]
409 |     }
410 |    ],
411 |    "source": [
412 |     "#Evaluating tf-idf (transformig text into matrix)\n",
413 |     "\n",
414 |     "print('Working with name')\n",
415 |     "x_name = tf.fit_transform(sample['name'].values)\n",
416 |     "print(7*'-')\n",
417 |     "print('Working with item_description')\n",
418 |     "x_description = tf.fit_transform(sample['item_description'].values)\n",
419 |     "print(7*'-')\n",
420 |     "print('Working with cat1')\n",
421 |     "x_cat1 = tf.fit_transform(sample['cat1'].values)\n",
422 |     "print(7*'-')\n",
423 |     "print('Working with cat2')\n",
424 |     "x_cat2 = tf.fit_transform(sample['cat2'].values)\n",
425 |     "print(7*'-')\n",
426 |     "print('Working with cat3')\n",
427 |     "x_cat3 =  tf.fit_transform(sample['cat3'].values)\n",
428 |     "print(7*'-')"
429 |    ]
430 |   },
431 |   {
432 |    "cell_type": "code",
433 |    "execution_count": 10,
434 |    "metadata": {
435 |     "collapsed": true
436 |    },
437 |    "outputs": [],
438 |    "source": [
439 |     "sample_preprocessed = hstack((x_name, x_description, x_cat1, x_cat2, x_cat3)).tocsr() #concatenating together and \n",
440 |     "                                                                                      #using scipy sparse for low-memory\n",
441 |     "                                                                                      #allocation of matrix \n",
442 |     "mask = np.array(np.clip(sample_preprocessed.getnnz(axis=0) - 1, 0, 1), dtype=bool)\n",
443 |     "sample_preprocessed = sample_preprocessed[:, mask]\n",
444 |     "\n",
445 |     "x_train, x_val, y_train, y_val = train_test_split(sample_preprocessed, y, test_size = 0.15) #splitting into test and train"
446 |    ]
447 |   },
448 |   {
449 |    "cell_type": "code",
450 |    "execution_count": 11,
451 |    "metadata": {
452 |     "collapsed": true
453 |    },
454 |    "outputs": [],
455 |    "source": [
456 |     "model = FM_FTRL(alpha=0.01, beta=0.01, L1=0.00001, L2=0.1, D = sample_preprocessed.shape[1], alpha_fm=0.01, L2_fm=0.0, init_fm=0.01,\n",
457 |     "                    D_fm=200, e_noise=0.0001, iters=15, inv_link=\"identity\", threads=16) #defining model"
458 |    ]
459 |   },
460 |   {
461 |    "cell_type": "code",
462 |    "execution_count": 12,
463 |    "metadata": {},
464 |    "outputs": [
465 |     {
466 |      "name": "stdout",
467 |      "output_type": "stream",
468 |      "text": [
469 |       "RMSLE score using FM_FTRL: 0.7428922496558461\n"
470 |      ]
471 |     }
472 |    ],
473 |    "source": [
474 |     "model.fit(x_train, y_train) #training algorithm \n",
475 |     "y_pred = model.predict(x_val)#evaluating algorithm \n",
476 |     "print('RMSLE score using FM_FTRL:', rmsle(y_val, y_pred))"
477 |    ]
478 |   }
479 |  ],
480 |  "metadata": {
481 |   "kernelspec": {
482 |    "display_name": "Python 3",
483 |    "language": "python",
484 |    "name": "python3"
485 |   },
486 |   "language_info": {
487 |    "codemirror_mode": {
488 |     "name": "ipython",
489 |     "version": 3
490 |    },
491 |    "file_extension": ".py",
492 |    "mimetype": "text/x-python",
493 |    "name": "python",
494 |    "nbconvert_exporter": "python",
495 |    "pygments_lexer": "ipython3",
496 |    "version": "3.6.1"
497 |   }
498 |  },
499 |  "nbformat": 4,
500 |  "nbformat_minor": 2
501 | }
502 | 


--------------------------------------------------------------------------------
/Section 5/5_1_Validation dataset tuning.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "from itertools import combinations\n",
 14 |     "from catboost import CatBoostClassifier\n",
 15 |     "from sklearn.model_selection import train_test_split, KFold\n",
 16 |     "from sklearn.metrics import roc_auc_score\n",
 17 |     "import warnings\n",
 18 |     "warnings.filterwarnings(\"ignore\")\n",
 19 |     "np.random.seed(42)"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 2,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "df = pd.read_csv('train.csv')\n",
 31 |     "y = df.target\n",
 32 |     "\n",
 33 |     "df.drop(['ID', 'target'], axis=1, inplace=True)\n",
 34 |     "df.fillna(-9999, inplace=True)\n",
 35 |     "cat_features_ids = np.where(df.apply(pd.Series.nunique) < 30000)[0].tolist()"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 3,
 41 |    "metadata": {
 42 |     "collapsed": true
 43 |    },
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.1,random_state = 42)\n",
 47 |     "train, val, y_train, y_val = train_test_split(train, y_train, test_size = 0.25)"
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 4,
 53 |    "metadata": {
 54 |     "scrolled": false
 55 |    },
 56 |    "outputs": [
 57 |     {
 58 |      "name": "stdout",
 59 |      "output_type": "stream",
 60 |      "text": [
 61 |       "Roc-auc score with Catboost: 0.7841281938499387\n"
 62 |      ]
 63 |     }
 64 |    ],
 65 |    "source": [
 66 |     "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, eval_metric='AUC', logging_level='Silent')\n",
 67 |     "clf.fit(train, y_train, cat_features=cat_features_ids, eval_set=(val, y_val))\n",
 68 |     "prediction = clf.predict_proba(test)\n",
 69 |     "print('Roc-auc score with Catboost:',roc_auc_score(y_test, prediction[:, 1]))"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 5,
 75 |    "metadata": {},
 76 |    "outputs": [
 77 |     {
 78 |      "name": "stdout",
 79 |      "output_type": "stream",
 80 |      "text": [
 81 |       "Roc-auc score with Catboost: 0.7930162585925847\n"
 82 |      ]
 83 |     }
 84 |    ],
 85 |    "source": [
 86 |     "kfold = KFold(n_splits=10)\n",
 87 |     "pred = []\n",
 88 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.1,random_state = 42)\n",
 89 |     "for train_ind, test_ind in kfold.split(train):\n",
 90 |     "    train_val, test_val, y_train_val, y_test_val = train.iloc[train_ind, :], train.iloc[test_ind, :],\\\n",
 91 |     "                                                   y_train.iloc[train_ind], y_train.iloc[test_ind]\n",
 92 |     "    clf.fit(train_val, y_train_val, cat_features=cat_features_ids, eval_set=(test_val, y_test_val))\n",
 93 |     "    prediction = clf.predict_proba(test)\n",
 94 |     "    pred.append(\n",
 95 |     "    prediction[:, 1]\n",
 96 |     "    )\n",
 97 |     "    \n",
 98 |     "\n",
 99 |     "print('Roc-auc score with Catboost:',roc_auc_score(y_test, np.mean(pred, axis = 0)))"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "code",
104 |    "execution_count": null,
105 |    "metadata": {
106 |     "collapsed": true
107 |    },
108 |    "outputs": [],
109 |    "source": []
110 |   }
111 |  ],
112 |  "metadata": {
113 |   "kernelspec": {
114 |    "display_name": "Python 3",
115 |    "language": "python",
116 |    "name": "python3"
117 |   },
118 |   "language_info": {
119 |    "codemirror_mode": {
120 |     "name": "ipython",
121 |     "version": 3
122 |    },
123 |    "file_extension": ".py",
124 |    "mimetype": "text/x-python",
125 |    "name": "python",
126 |    "nbconvert_exporter": "python",
127 |    "pygments_lexer": "ipython3",
128 |    "version": "3.6.1"
129 |   }
130 |  },
131 |  "nbformat": 4,
132 |  "nbformat_minor": 2
133 | }
134 | 


--------------------------------------------------------------------------------
/Section 5/5_2_Regularizing model to avoid overfitting.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "from itertools import combinations\n",
 14 |     "from catboost import CatBoostClassifier\n",
 15 |     "from sklearn.model_selection import train_test_split, KFold\n",
 16 |     "from sklearn.metrics import roc_auc_score\n",
 17 |     "import warnings\n",
 18 |     "warnings.filterwarnings(\"ignore\")\n",
 19 |     "np.random.seed(42)"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 2,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "df = pd.read_csv('train.csv')\n",
 31 |     "y = df.target\n",
 32 |     "\n",
 33 |     "df.drop(['ID', 'target'], axis=1, inplace=True)\n",
 34 |     "df.fillna(-9999, inplace=True)\n",
 35 |     "cat_features_ids = np.where(df.apply(pd.Series.nunique) < 30000)[0].tolist()"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 3,
 41 |    "metadata": {
 42 |     "collapsed": true
 43 |    },
 44 |    "outputs": [],
 45 |    "source": [
 46 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.1)"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": 4,
 52 |    "metadata": {},
 53 |    "outputs": [
 54 |     {
 55 |      "name": "stdout",
 56 |      "output_type": "stream",
 57 |      "text": [
 58 |       "Roc-auc score with Catboost without regularization: 0.7939610054617733\n",
 59 |       "Roc-auc score with Catboost with regularization: 0.7961023589633582\n"
 60 |      ]
 61 |     }
 62 |    ],
 63 |    "source": [
 64 |     "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, eval_metric='AUC', logging_level='Silent')\n",
 65 |     "clf.fit(train, y_train, cat_features=cat_features_ids)\n",
 66 |     "prediction = clf.predict_proba(test)\n",
 67 |     "print('Roc-auc score with Catboost without regularization:',roc_auc_score(y_test, prediction[:, 1]))\n",
 68 |     "\n",
 69 |     "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, \n",
 70 |     "                         eval_metric='AUC', logging_level='Silent', l2_leaf_reg=3, \n",
 71 |     "                         model_size_reg = 3)\n",
 72 |     "clf.fit(train, y_train, cat_features=cat_features_ids)\n",
 73 |     "prediction = clf.predict_proba(test)\n",
 74 |     "print('Roc-auc score with Catboost with regularization:',roc_auc_score(y_test, prediction[:, 1]))"
 75 |    ]
 76 |   }
 77 |  ],
 78 |  "metadata": {
 79 |   "kernelspec": {
 80 |    "display_name": "Python 3",
 81 |    "language": "python",
 82 |    "name": "python3"
 83 |   },
 84 |   "language_info": {
 85 |    "codemirror_mode": {
 86 |     "name": "ipython",
 87 |     "version": 3
 88 |    },
 89 |    "file_extension": ".py",
 90 |    "mimetype": "text/x-python",
 91 |    "name": "python",
 92 |    "nbconvert_exporter": "python",
 93 |    "pygments_lexer": "ipython3",
 94 |    "version": "3.6.1"
 95 |   }
 96 |  },
 97 |  "nbformat": 4,
 98 |  "nbformat_minor": 2
 99 | }
100 | 


--------------------------------------------------------------------------------
/Section 5/5_3_Adversarial Validation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "from itertools import combinations\n",
 14 |     "from catboost import CatBoostClassifier\n",
 15 |     "from sklearn.model_selection import train_test_split, KFold\n",
 16 |     "from sklearn.metrics import roc_auc_score\n",
 17 |     "import warnings\n",
 18 |     "warnings.filterwarnings(\"ignore\")\n",
 19 |     "np.random.seed(42)"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 2,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "train = pd.read_csv('train.csv')\n",
 31 |     "y = train.target\n",
 32 |     "test = pd.read_csv('./test.csv')\n",
 33 |     "train.drop(['ID', 'target'], axis=1, inplace=True)\n",
 34 |     "test.drop(['ID'], axis=1, inplace=True)\n",
 35 |     "train.fillna(-9999, inplace=True)\n",
 36 |     "test.fillna(-9999, inplace=True)\n",
 37 |     "cat_features_ids = np.where(train.apply(pd.Series.nunique) < 30000)[0].tolist()"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 3,
 43 |    "metadata": {},
 44 |    "outputs": [
 45 |     {
 46 |      "name": "stdout",
 47 |      "output_type": "stream",
 48 |      "text": [
 49 |       "Number of train samples from test distribution: 49142\n"
 50 |      ]
 51 |     }
 52 |    ],
 53 |    "source": [
 54 |     "y1 = np.ones_like(y)\n",
 55 |     "y2 = np.zeros((test.shape[0],))\n",
 56 |     "y_all = np.hstack([y1, y2])\n",
 57 |     "all_ = pd.concat([train, test])\n",
 58 |     "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, eval_metric='AUC', logging_level='Silent')\n",
 59 |     "clf.fit(all_, y_all, cat_features=cat_features_ids)\n",
 60 |     "prediction = clf.predict(train)\n",
 61 |     "best_val = train[prediction == 0]\n",
 62 |     "print('Number of train samples from test distribution:', best_val.shape[0])"
 63 |    ]
 64 |   },
 65 |   {
 66 |    "cell_type": "code",
 67 |    "execution_count": 4,
 68 |    "metadata": {},
 69 |    "outputs": [
 70 |     {
 71 |      "name": "stdout",
 72 |      "output_type": "stream",
 73 |      "text": [
 74 |       "Validation score: 0.7470119528903851\n"
 75 |      ]
 76 |     }
 77 |    ],
 78 |    "source": [
 79 |     "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, eval_metric='AUC', logging_level='Silent')\n",
 80 |     "clf.fit(train.loc[prediction != 0, :], y[prediction != 0], cat_features=cat_features_ids)\n",
 81 |     "prediction_val = clf.predict_proba(best_val)\n",
 82 |     "print('Validation score:', roc_auc_score(y[prediction == 0], prediction_val[:, 1]))"
 83 |    ]
 84 |   }
 85 |  ],
 86 |  "metadata": {
 87 |   "kernelspec": {
 88 |    "display_name": "Python 3",
 89 |    "language": "python",
 90 |    "name": "python3"
 91 |   },
 92 |   "language_info": {
 93 |    "codemirror_mode": {
 94 |     "name": "ipython",
 95 |     "version": 3
 96 |    },
 97 |    "file_extension": ".py",
 98 |    "mimetype": "text/x-python",
 99 |    "name": "python",
100 |    "nbconvert_exporter": "python",
101 |    "pygments_lexer": "ipython3",
102 |    "version": "3.6.1"
103 |   }
104 |  },
105 |  "nbformat": 4,
106 |  "nbformat_minor": 2
107 | }
108 | 


--------------------------------------------------------------------------------
/Section 5/5_4_Perform metric selection on real data.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 18,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import pandas as pd\n",
 12 |     "import numpy as np\n",
 13 |     "from itertools import combinations\n",
 14 |     "from catboost import CatBoostClassifier, CatBoostRegressor\n",
 15 |     "from sklearn.model_selection import train_test_split, KFold\n",
 16 |     "from sklearn.metrics import mean_squared_error, accuracy_score, recall_score, precision_score, f1_score, roc_auc_score\n",
 17 |     "import warnings\n",
 18 |     "warnings.filterwarnings(\"ignore\")\n",
 19 |     "np.random.seed(42)"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 19,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "df = pd.read_csv('./train_merc.csv')\n",
 31 |     "y = df.y\n",
 32 |     "df.drop(['ID', 'y'], axis = 1, inplace=True)\n",
 33 |     "cat_features_ids = np.where(df.apply(pd.Series.nunique) < 30000)[0].tolist()"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 20,
 39 |    "metadata": {},
 40 |    "outputs": [
 41 |     {
 42 |      "name": "stdout",
 43 |      "output_type": "stream",
 44 |      "text": [
 45 |       "28.460498941515414\n",
 46 |       "27.65863337187866\n"
 47 |      ]
 48 |     }
 49 |    ],
 50 |    "source": [
 51 |     "pred = [10,10,10,10,10,10,10,10,10,10]\n",
 52 |     "y_real = [10,10,10,10,10,10,10,10,10,100]\n",
 53 |     "print(np.sqrt(mean_squared_error(pred, y_real)))\n",
 54 |     "\n",
 55 |     "pred = [25,25,25,25,25,25,25,25,25,25]\n",
 56 |     "y_real = [10,10,10,10,10,10,10,10,10,100]\n",
 57 |     "print(np.sqrt(mean_squared_error(pred, y_real)))"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": 21,
 63 |    "metadata": {},
 64 |    "outputs": [
 65 |     {
 66 |      "name": "stdout",
 67 |      "output_type": "stream",
 68 |      "text": [
 69 |       "RMSE score: 7.373856282754891\n",
 70 |       "RMSLE score: 0.06776094386349317\n"
 71 |      ]
 72 |     }
 73 |    ],
 74 |    "source": [
 75 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.1)\n",
 76 |     "clf = CatBoostRegressor(learning_rate=0.1, iterations=100, random_seed=42, logging_level='Silent')\n",
 77 |     "clf.fit(train, y_train, cat_features=cat_features_ids)\n",
 78 |     "prediction = clf.predict(test)\n",
 79 |     "print('RMSE score:', np.sqrt(mean_squared_error(y_test, prediction)))\n",
 80 |     "print('RMSLE score:', np.sqrt(mean_squared_error(np.log1p(y_test), np.log1p(prediction))))"
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 23,
 86 |    "metadata": {
 87 |     "collapsed": true
 88 |    },
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "df = pd.read_csv('./train_sample.csv.zip')\n",
 92 |     "y = df.is_attributed\n",
 93 |     "df.drop(['click_time', 'attributed_time', 'is_attributed'], axis = 1, inplace=True)\n",
 94 |     "cat_features_ids = np.where(df.apply(pd.Series.nunique) < 30000)[0].tolist()"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": 24,
100 |    "metadata": {},
101 |    "outputs": [
102 |     {
103 |      "name": "stdout",
104 |      "output_type": "stream",
105 |      "text": [
106 |       "\t\t $a(x)$ = 1 \t\t\n",
107 |       "\n",
108 |       "Accuracy all positive: 0.00227\n",
109 |       "Recall all positive: 1.0\n",
110 |       "Precision all positive: 0.00227\n",
111 |       "F1 score all positive: 0.004529717541181518\n",
112 |       "Roc auc score all positive: 0.5\n",
113 |       "\n",
114 |       "\n",
115 |       "\n",
116 |       "\t\t $a(x)$ = 0 \t\t\n",
117 |       "\n",
118 |       "Accuracy all negative: 0.99773\n",
119 |       "Recall all negative: 0.0\n",
120 |       "Precision all negative: 0.0\n",
121 |       "F1 score all negative: 0.0\n",
122 |       "Roc auc score all positive: 0.5\n",
123 |       "\n",
124 |       "\n",
125 |       "\n",
126 |       "\t\t Catboost \t\t\n",
127 |       "\n",
128 |       "Accuracy using Catboost: 0.9986\n",
129 |       "Recall using Catboost: 0.391304347826087\n",
130 |       "Precision using Catboost: 1.0\n",
131 |       "F1 score using Catboost: 0.5625\n",
132 |       "Roc auc score using Catboost: 0.9189287535244104\n"
133 |      ]
134 |     }
135 |    ],
136 |    "source": [
137 |     "y_positive = np.ones_like(y)\n",
138 |     "y_negative = np.zeros_like(y)\n",
139 |     "print('\\t\\t $a(x)$ = 1 \\t\\t\\n')\n",
140 |     "print('Accuracy all positive:', accuracy_score(y, y_positive))\n",
141 |     "print('Recall all positive:', recall_score(y, y_positive))\n",
142 |     "print('Precision all positive:', precision_score(y, y_positive))\n",
143 |     "print('F1 score all positive:', f1_score(y, y_positive))\n",
144 |     "print('Roc auc score all positive:', roc_auc_score(y, y_positive))\n",
145 |     "print('\\n\\n')\n",
146 |     "print('\\t\\t $a(x)$ = 0 \\t\\t\\n')\n",
147 |     "print('Accuracy all negative:', accuracy_score(y, y_negative))\n",
148 |     "print('Recall all negative:', recall_score(y, y_negative))\n",
149 |     "print('Precision all negative:', precision_score(y, y_negative))\n",
150 |     "print('F1 score all negative:', f1_score(y, y_negative))\n",
151 |     "print('Roc auc score all positive:', roc_auc_score(y, y_negative))\n",
152 |     "\n",
153 |     "print('\\n\\n')\n",
154 |     "print('\\t\\t Catboost \\t\\t\\n')\n",
155 |     "train, test, y_train, y_test = train_test_split(df, y, test_size = 0.1)\n",
156 |     "\n",
157 |     "clf = CatBoostClassifier(learning_rate=0.1, iterations=100, random_seed=42, \n",
158 |     "                         eval_metric='AUC', logging_level='Silent', l2_leaf_reg=3, \n",
159 |     "                         model_size_reg = 3)\n",
160 |     "clf.fit(train, y_train, cat_features=cat_features_ids)\n",
161 |     "prediction = clf.predict_proba(test)\n",
162 |     "\n",
163 |     "print('Accuracy using Catboost:', accuracy_score(y_test, prediction[:, 1] > 0.5))\n",
164 |     "print('Recall using Catboost:', recall_score(y_test, prediction[:, 1] > 0.5))\n",
165 |     "print('Precision using Catboost:', precision_score(y_test, prediction[:, 1] > 0.5))\n",
166 |     "print('F1 score using Catboost:', f1_score(y_test, prediction[:, 1] > 0.5))\n",
167 |     "print('Roc auc score using Catboost:', roc_auc_score(y_test, prediction[:, 1]))"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": null,
173 |    "metadata": {
174 |     "collapsed": true
175 |    },
176 |    "outputs": [],
177 |    "source": []
178 |   }
179 |  ],
180 |  "metadata": {
181 |   "kernelspec": {
182 |    "display_name": "Python 3",
183 |    "language": "python",
184 |    "name": "python3"
185 |   },
186 |   "language_info": {
187 |    "codemirror_mode": {
188 |     "name": "ipython",
189 |     "version": 3
190 |    },
191 |    "file_extension": ".py",
192 |    "mimetype": "text/x-python",
193 |    "name": "python",
194 |    "nbconvert_exporter": "python",
195 |    "pygments_lexer": "ipython3",
196 |    "version": "3.6.1"
197 |   }
198 |  },
199 |  "nbformat": 4,
200 |  "nbformat_minor": 2
201 | }
202 | 


--------------------------------------------------------------------------------