Decision Trees

├── Classification Models
    ├── ML0101EN-Clas-Decision-Trees-drug-py-v1.ipynb
    ├── ML0101EN-Clas-K-Nearest-neighbors-CustCat-py-v1.ipynb
    ├── ML0101EN-Clas-Logistic-Reg-churn-py-v1.ipynb
    ├── ML0101EN-Clas-SVM-cancer-py-v1.ipynb
    └── ML0101EN-Reg-NoneLinearRegression-py-v1.ipynb
├── Final Project.ipynb
├── README.md
└── Recommender System
    ├── ML0101EN-RecSys-Collaborative-Filtering-movies-py-v1.ipynb
    └── ML0101EN-RecSys-Content-Based-movies-py-v1.ipynb


/Classification Models/ML0101EN-Clas-Decision-Trees-drug-py-v1.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "button": false,
   7 |     "deletable": true,
   8 |     "new_sheet": false,
   9 |     "run_control": {
  10 |      "read_only": false
  11 |     }
  12 |    },
  13 |    "source": [
  14 |     "<a href=\"https://www.bigdatauniversity.com\"><img src=\"https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png\" width=\"400\" align=\"center\"></a>\n",
  15 |     "\n",
  16 |     "<h1><center>Decision Trees</center></h1>"
  17 |    ]
  18 |   },
  19 |   {
  20 |    "cell_type": "markdown",
  21 |    "metadata": {
  22 |     "button": false,
  23 |     "deletable": true,
  24 |     "new_sheet": false,
  25 |     "run_control": {
  26 |      "read_only": false
  27 |     }
  28 |    },
  29 |    "source": [
  30 |     "In this lab exercise, you will learn a popular machine learning algorithm, Decision Tree. You will use this classification algorithm to build a model from historical data of patients, and their response to different medications. Then you use the trained decision tree to predict the class of a unknown patient, or to find a proper drug for a new patient."
  31 |    ]
  32 |   },
  33 |   {
  34 |    "cell_type": "markdown",
  35 |    "metadata": {},
  36 |    "source": [
  37 |     "<h1>Table of contents</h1>\n",
  38 |     "\n",
  39 |     "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
  40 |     "    <ol>\n",
  41 |     "        <li><a href=\"#about_dataset\">About the dataset</a></li>\n",
  42 |     "        <li><a href=\"#downloading_data\">Downloading the Data</a></li>\n",
  43 |     "        <li><a href=\"#pre-processing\">Pre-processing</a></li>\n",
  44 |     "        <li><a href=\"#setting_up_tree\">Setting up the Decision Tree</a></li>\n",
  45 |     "        <li><a href=\"#modeling\">Modeling</a></li>\n",
  46 |     "        <li><a href=\"#prediction\">Prediction</a></li>\n",
  47 |     "        <li><a href=\"#evaluation\">Evaluation</a></li>\n",
  48 |     "        <li><a href=\"#visualization\">Visualization</a></li>\n",
  49 |     "    </ol>\n",
  50 |     "</div>\n",
  51 |     "<br>\n",
  52 |     "<hr>"
  53 |    ]
  54 |   },
  55 |   {
  56 |    "cell_type": "markdown",
  57 |    "metadata": {
  58 |     "button": false,
  59 |     "deletable": true,
  60 |     "new_sheet": false,
  61 |     "run_control": {
  62 |      "read_only": false
  63 |     }
  64 |    },
  65 |    "source": [
  66 |     "Import the Following Libraries:\n",
  67 |     "<ul>\n",
  68 |     "    <li> <b>numpy (as np)</b> </li>\n",
  69 |     "    <li> <b>pandas</b> </li>\n",
  70 |     "    <li> <b>DecisionTreeClassifier</b> from <b>sklearn.tree</b> </li>\n",
  71 |     "</ul>"
  72 |    ]
  73 |   },
  74 |   {
  75 |    "cell_type": "code",
  76 |    "execution_count": 1,
  77 |    "metadata": {
  78 |     "button": false,
  79 |     "deletable": true,
  80 |     "new_sheet": false,
  81 |     "run_control": {
  82 |      "read_only": false
  83 |     }
  84 |    },
  85 |    "outputs": [],
  86 |    "source": [
  87 |     "import numpy as np \n",
  88 |     "import pandas as pd\n",
  89 |     "from sklearn.tree import DecisionTreeClassifier"
  90 |    ]
  91 |   },
  92 |   {
  93 |    "cell_type": "markdown",
  94 |    "metadata": {
  95 |     "button": false,
  96 |     "deletable": true,
  97 |     "new_sheet": false,
  98 |     "run_control": {
  99 |      "read_only": false
 100 |     }
 101 |    },
 102 |    "source": [
 103 |     "<div id=\"about_dataset\">\n",
 104 |     "    <h2>About the dataset</h2>\n",
 105 |     "    Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y. \n",
 106 |     "    <br>\n",
 107 |     "    <br>\n",
 108 |     "    Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.\n",
 109 |     "    <br>\n",
 110 |     "    <br>\n",
 111 |     "    It is a sample of binary classifier, and you can use the training part of the dataset \n",
 112 |     "    to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe it to a new patient.\n",
 113 |     "</div>\n"
 114 |    ]
 115 |   },
 116 |   {
 117 |    "cell_type": "markdown",
 118 |    "metadata": {
 119 |     "button": false,
 120 |     "deletable": true,
 121 |     "new_sheet": false,
 122 |     "run_control": {
 123 |      "read_only": false
 124 |     }
 125 |    },
 126 |    "source": [
 127 |     "<div id=\"downloading_data\"> \n",
 128 |     "    <h2>Downloading the Data</h2>\n",
 129 |     "    To download the data, we will use !wget to download it from IBM Object Storage.\n",
 130 |     "</div>"
 131 |    ]
 132 |   },
 133 |   {
 134 |    "cell_type": "code",
 135 |    "execution_count": 2,
 136 |    "metadata": {},
 137 |    "outputs": [
 138 |     {
 139 |      "name": "stdout",
 140 |      "output_type": "stream",
 141 |      "text": [
 142 |       "--2019-07-10 23:57:20--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv\n",
 143 |       "Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193\n",
 144 |       "Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.\n",
 145 |       "HTTP request sent, awaiting response... 200 OK\n",
 146 |       "Length: 6027 (5.9K) [text/csv]\n",
 147 |       "Saving to: ‘drug200.csv’\n",
 148 |       "\n",
 149 |       "drug200.csv         100%[===================>]   5.89K  --.-KB/s    in 0s      \n",
 150 |       "\n",
 151 |       "2019-07-10 23:57:21 (89.2 MB/s) - ‘drug200.csv’ saved [6027/6027]\n",
 152 |       "\n"
 153 |      ]
 154 |     }
 155 |    ],
 156 |    "source": [
 157 |     "!wget -O drug200.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv"
 158 |    ]
 159 |   },
 160 |   {
 161 |    "cell_type": "markdown",
 162 |    "metadata": {},
 163 |    "source": [
 164 |     "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)"
 165 |    ]
 166 |   },
 167 |   {
 168 |    "cell_type": "markdown",
 169 |    "metadata": {},
 170 |    "source": [
 171 |     "now, read data using pandas dataframe:"
 172 |    ]
 173 |   },
 174 |   {
 175 |    "cell_type": "code",
 176 |    "execution_count": 7,
 177 |    "metadata": {
 178 |     "button": false,
 179 |     "deletable": true,
 180 |     "new_sheet": false,
 181 |     "run_control": {
 182 |      "read_only": false
 183 |     }
 184 |    },
 185 |    "outputs": [
 186 |     {
 187 |      "data": {
 188 |       "text/html": [
 189 |        "<div>\n",
 190 |        "<style scoped>\n",
 191 |        "    .dataframe tbody tr th:only-of-type {\n",
 192 |        "        vertical-align: middle;\n",
 193 |        "    }\n",
 194 |        "\n",
 195 |        "    .dataframe tbody tr th {\n",
 196 |        "        vertical-align: top;\n",
 197 |        "    }\n",
 198 |        "\n",
 199 |        "    .dataframe thead th {\n",
 200 |        "        text-align: right;\n",
 201 |        "    }\n",
 202 |        "</style>\n",
 203 |        "<table border=\"1\" class=\"dataframe\">\n",
 204 |        "  <thead>\n",
 205 |        "    <tr style=\"text-align: right;\">\n",
 206 |        "      <th></th>\n",
 207 |        "      <th>Age</th>\n",
 208 |        "      <th>Sex</th>\n",
 209 |        "      <th>BP</th>\n",
 210 |        "      <th>Cholesterol</th>\n",
 211 |        "      <th>Na_to_K</th>\n",
 212 |        "      <th>Drug</th>\n",
 213 |        "    </tr>\n",
 214 |        "  </thead>\n",
 215 |        "  <tbody>\n",
 216 |        "    <tr>\n",
 217 |        "      <th>0</th>\n",
 218 |        "      <td>23</td>\n",
 219 |        "      <td>F</td>\n",
 220 |        "      <td>HIGH</td>\n",
 221 |        "      <td>HIGH</td>\n",
 222 |        "      <td>25.355</td>\n",
 223 |        "      <td>drugY</td>\n",
 224 |        "    </tr>\n",
 225 |        "    <tr>\n",
 226 |        "      <th>1</th>\n",
 227 |        "      <td>47</td>\n",
 228 |        "      <td>M</td>\n",
 229 |        "      <td>LOW</td>\n",
 230 |        "      <td>HIGH</td>\n",
 231 |        "      <td>13.093</td>\n",
 232 |        "      <td>drugC</td>\n",
 233 |        "    </tr>\n",
 234 |        "    <tr>\n",
 235 |        "      <th>2</th>\n",
 236 |        "      <td>47</td>\n",
 237 |        "      <td>M</td>\n",
 238 |        "      <td>LOW</td>\n",
 239 |        "      <td>HIGH</td>\n",
 240 |        "      <td>10.114</td>\n",
 241 |        "      <td>drugC</td>\n",
 242 |        "    </tr>\n",
 243 |        "    <tr>\n",
 244 |        "      <th>3</th>\n",
 245 |        "      <td>28</td>\n",
 246 |        "      <td>F</td>\n",
 247 |        "      <td>NORMAL</td>\n",
 248 |        "      <td>HIGH</td>\n",
 249 |        "      <td>7.798</td>\n",
 250 |        "      <td>drugX</td>\n",
 251 |        "    </tr>\n",
 252 |        "    <tr>\n",
 253 |        "      <th>4</th>\n",
 254 |        "      <td>61</td>\n",
 255 |        "      <td>F</td>\n",
 256 |        "      <td>LOW</td>\n",
 257 |        "      <td>HIGH</td>\n",
 258 |        "      <td>18.043</td>\n",
 259 |        "      <td>drugY</td>\n",
 260 |        "    </tr>\n",
 261 |        "  </tbody>\n",
 262 |        "</table>\n",
 263 |        "</div>"
 264 |       ],
 265 |       "text/plain": [
 266 |        "   Age Sex      BP Cholesterol  Na_to_K   Drug\n",
 267 |        "0   23   F    HIGH        HIGH   25.355  drugY\n",
 268 |        "1   47   M     LOW        HIGH   13.093  drugC\n",
 269 |        "2   47   M     LOW        HIGH   10.114  drugC\n",
 270 |        "3   28   F  NORMAL        HIGH    7.798  drugX\n",
 271 |        "4   61   F     LOW        HIGH   18.043  drugY"
 272 |       ]
 273 |      },
 274 |      "execution_count": 7,
 275 |      "metadata": {},
 276 |      "output_type": "execute_result"
 277 |     }
 278 |    ],
 279 |    "source": [
 280 |     "my_data = pd.read_csv(\"drug200.csv\", delimiter=\",\")\n",
 281 |     "my_data[0:5]"
 282 |    ]
 283 |   },
 284 |   {
 285 |    "cell_type": "markdown",
 286 |    "metadata": {
 287 |     "button": false,
 288 |     "deletable": true,
 289 |     "new_sheet": false,
 290 |     "run_control": {
 291 |      "read_only": false
 292 |     }
 293 |    },
 294 |    "source": [
 295 |     "<div id=\"practice\"> \n",
 296 |     "    <h3>Practice</h3> \n",
 297 |     "    What is the size of data? \n",
 298 |     "</div>"
 299 |    ]
 300 |   },
 301 |   {
 302 |    "cell_type": "code",
 303 |    "execution_count": 11,
 304 |    "metadata": {
 305 |     "button": false,
 306 |     "deletable": true,
 307 |     "new_sheet": false,
 308 |     "run_control": {
 309 |      "read_only": false
 310 |     }
 311 |    },
 312 |    "outputs": [
 313 |     {
 314 |      "data": {
 315 |       "text/plain": [
 316 |        "(200, 6)"
 317 |       ]
 318 |      },
 319 |      "execution_count": 11,
 320 |      "metadata": {},
 321 |      "output_type": "execute_result"
 322 |     }
 323 |    ],
 324 |    "source": [
 325 |     "# write your code here\n",
 326 |     "my_data.shape\n",
 327 |     "\n",
 328 |     "\n"
 329 |    ]
 330 |   },
 331 |   {
 332 |    "cell_type": "markdown",
 333 |    "metadata": {},
 334 |    "source": [
 335 |     "<div href=\"pre-processing\">\n",
 336 |     "    <h2>Pre-processing</h2>\n",
 337 |     "</div>"
 338 |    ]
 339 |   },
 340 |   {
 341 |    "cell_type": "markdown",
 342 |    "metadata": {
 343 |     "button": false,
 344 |     "deletable": true,
 345 |     "new_sheet": false,
 346 |     "run_control": {
 347 |      "read_only": false
 348 |     }
 349 |    },
 350 |    "source": [
 351 |     "Using <b>my_data</b> as the Drug.csv data read by pandas, declare the following variables: <br>\n",
 352 |     "\n",
 353 |     "<ul>\n",
 354 |     "    <li> <b> X </b> as the <b> Feature Matrix </b> (data of my_data) </li>\n",
 355 |     "    <li> <b> y </b> as the <b> response vector (target) </b> </li>\n",
 356 |     "</ul>"
 357 |    ]
 358 |   },
 359 |   {
 360 |    "cell_type": "markdown",
 361 |    "metadata": {
 362 |     "button": false,
 363 |     "deletable": true,
 364 |     "new_sheet": false,
 365 |     "run_control": {
 366 |      "read_only": false
 367 |     }
 368 |    },
 369 |    "source": [
 370 |     "Remove the column containing the target name since it doesn't contain numeric values."
 371 |    ]
 372 |   },
 373 |   {
 374 |    "cell_type": "code",
 375 |    "execution_count": 18,
 376 |    "metadata": {},
 377 |    "outputs": [
 378 |     {
 379 |      "data": {
 380 |       "text/html": [
 381 |        "<div>\n",
 382 |        "<style scoped>\n",
 383 |        "    .dataframe tbody tr th:only-of-type {\n",
 384 |        "        vertical-align: middle;\n",
 385 |        "    }\n",
 386 |        "\n",
 387 |        "    .dataframe tbody tr th {\n",
 388 |        "        vertical-align: top;\n",
 389 |        "    }\n",
 390 |        "\n",
 391 |        "    .dataframe thead th {\n",
 392 |        "        text-align: right;\n",
 393 |        "    }\n",
 394 |        "</style>\n",
 395 |        "<table border=\"1\" class=\"dataframe\">\n",
 396 |        "  <thead>\n",
 397 |        "    <tr style=\"text-align: right;\">\n",
 398 |        "      <th></th>\n",
 399 |        "      <th>Age</th>\n",
 400 |        "      <th>Sex</th>\n",
 401 |        "      <th>BP</th>\n",
 402 |        "      <th>Cholesterol</th>\n",
 403 |        "      <th>Na_to_K</th>\n",
 404 |        "      <th>Drug</th>\n",
 405 |        "      <th>HIGH</th>\n",
 406 |        "      <th>LOW</th>\n",
 407 |        "      <th>NORMAL</th>\n",
 408 |        "    </tr>\n",
 409 |        "  </thead>\n",
 410 |        "  <tbody>\n",
 411 |        "    <tr>\n",
 412 |        "      <th>0</th>\n",
 413 |        "      <td>23</td>\n",
 414 |        "      <td>1</td>\n",
 415 |        "      <td>HIGH</td>\n",
 416 |        "      <td>HIGH</td>\n",
 417 |        "      <td>25.355</td>\n",
 418 |        "      <td>drugY</td>\n",
 419 |        "      <td>1</td>\n",
 420 |        "      <td>0</td>\n",
 421 |        "      <td>0</td>\n",
 422 |        "    </tr>\n",
 423 |        "    <tr>\n",
 424 |        "      <th>1</th>\n",
 425 |        "      <td>47</td>\n",
 426 |        "      <td>0</td>\n",
 427 |        "      <td>LOW</td>\n",
 428 |        "      <td>HIGH</td>\n",
 429 |        "      <td>13.093</td>\n",
 430 |        "      <td>drugC</td>\n",
 431 |        "      <td>0</td>\n",
 432 |        "      <td>1</td>\n",
 433 |        "      <td>0</td>\n",
 434 |        "    </tr>\n",
 435 |        "    <tr>\n",
 436 |        "      <th>2</th>\n",
 437 |        "      <td>47</td>\n",
 438 |        "      <td>0</td>\n",
 439 |        "      <td>LOW</td>\n",
 440 |        "      <td>HIGH</td>\n",
 441 |        "      <td>10.114</td>\n",
 442 |        "      <td>drugC</td>\n",
 443 |        "      <td>0</td>\n",
 444 |        "      <td>1</td>\n",
 445 |        "      <td>0</td>\n",
 446 |        "    </tr>\n",
 447 |        "    <tr>\n",
 448 |        "      <th>3</th>\n",
 449 |        "      <td>28</td>\n",
 450 |        "      <td>1</td>\n",
 451 |        "      <td>NORMAL</td>\n",
 452 |        "      <td>HIGH</td>\n",
 453 |        "      <td>7.798</td>\n",
 454 |        "      <td>drugX</td>\n",
 455 |        "      <td>0</td>\n",
 456 |        "      <td>0</td>\n",
 457 |        "      <td>1</td>\n",
 458 |        "    </tr>\n",
 459 |        "    <tr>\n",
 460 |        "      <th>4</th>\n",
 461 |        "      <td>61</td>\n",
 462 |        "      <td>1</td>\n",
 463 |        "      <td>LOW</td>\n",
 464 |        "      <td>HIGH</td>\n",
 465 |        "      <td>18.043</td>\n",
 466 |        "      <td>drugY</td>\n",
 467 |        "      <td>0</td>\n",
 468 |        "      <td>1</td>\n",
 469 |        "      <td>0</td>\n",
 470 |        "    </tr>\n",
 471 |        "  </tbody>\n",
 472 |        "</table>\n",
 473 |        "</div>"
 474 |       ],
 475 |       "text/plain": [
 476 |        "   Age  Sex      BP Cholesterol  Na_to_K   Drug  HIGH  LOW  NORMAL\n",
 477 |        "0   23    1    HIGH        HIGH   25.355  drugY     1    0       0\n",
 478 |        "1   47    0     LOW        HIGH   13.093  drugC     0    1       0\n",
 479 |        "2   47    0     LOW        HIGH   10.114  drugC     0    1       0\n",
 480 |        "3   28    1  NORMAL        HIGH    7.798  drugX     0    0       1\n",
 481 |        "4   61    1     LOW        HIGH   18.043  drugY     0    1       0"
 482 |       ]
 483 |      },
 484 |      "execution_count": 18,
 485 |      "metadata": {},
 486 |      "output_type": "execute_result"
 487 |     }
 488 |    ],
 489 |    "source": []
 490 |   },
 491 |   {
 492 |    "cell_type": "code",
 493 |    "execution_count": 8,
 494 |    "metadata": {},
 495 |    "outputs": [
 496 |     {
 497 |      "data": {
 498 |       "text/plain": [
 499 |        "array([[23, 'F', 'HIGH', 'HIGH', 25.355],\n",
 500 |        "       [47, 'M', 'LOW', 'HIGH', 13.093],\n",
 501 |        "       [47, 'M', 'LOW', 'HIGH', 10.113999999999999],\n",
 502 |        "       [28, 'F', 'NORMAL', 'HIGH', 7.797999999999999],\n",
 503 |        "       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)"
 504 |       ]
 505 |      },
 506 |      "execution_count": 8,
 507 |      "metadata": {},
 508 |      "output_type": "execute_result"
 509 |     }
 510 |    ],
 511 |    "source": [
 512 |     "X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values\n",
 513 |     "X[0:5]"
 514 |    ]
 515 |   },
 516 |   {
 517 |    "cell_type": "markdown",
 518 |    "metadata": {},
 519 |    "source": [
 520 |     "As you may figure out, some features in this dataset are categorical such as __Sex__ or __BP__. Unfortunately, Sklearn Decision Trees do not handle categorical variables. But still we can convert these features to numerical values. __pandas.get_dummies()__\n",
 521 |     "Convert categorical variable into dummy/indicator variables."
 522 |    ]
 523 |   },
 524 |   {
 525 |    "cell_type": "code",
 526 |    "execution_count": null,
 527 |    "metadata": {},
 528 |    "outputs": [],
 529 |    "source": []
 530 |   },
 531 |   {
 532 |    "cell_type": "code",
 533 |    "execution_count": 9,
 534 |    "metadata": {},
 535 |    "outputs": [
 536 |     {
 537 |      "data": {
 538 |       "text/plain": [
 539 |        "array([[23, 0, 0, 0, 25.355],\n",
 540 |        "       [47, 1, 1, 0, 13.093],\n",
 541 |        "       [47, 1, 1, 0, 10.113999999999999],\n",
 542 |        "       [28, 0, 2, 0, 7.797999999999999],\n",
 543 |        "       [61, 0, 1, 0, 18.043]], dtype=object)"
 544 |       ]
 545 |      },
 546 |      "execution_count": 9,
 547 |      "metadata": {},
 548 |      "output_type": "execute_result"
 549 |     }
 550 |    ],
 551 |    "source": [
 552 |     "from sklearn import preprocessing\n",
 553 |     "le_sex = preprocessing.LabelEncoder()\n",
 554 |     "le_sex.fit(['F','M'])\n",
 555 |     "X[:,1] = le_sex.transform(X[:,1]) \n",
 556 |     "\n",
 557 |     "\n",
 558 |     "le_BP = preprocessing.LabelEncoder()\n",
 559 |     "le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])\n",
 560 |     "X[:,2] = le_BP.transform(X[:,2])\n",
 561 |     "\n",
 562 |     "\n",
 563 |     "le_Chol = preprocessing.LabelEncoder()\n",
 564 |     "le_Chol.fit([ 'NORMAL', 'HIGH'])\n",
 565 |     "X[:,3] = le_Chol.transform(X[:,3]) \n",
 566 |     "\n",
 567 |     "X[0:5]"
 568 |    ]
 569 |   },
 570 |   {
 571 |    "cell_type": "markdown",
 572 |    "metadata": {},
 573 |    "source": [
 574 |     "Now we can fill the target variable."
 575 |    ]
 576 |   },
 577 |   {
 578 |    "cell_type": "code",
 579 |    "execution_count": 10,
 580 |    "metadata": {
 581 |     "button": false,
 582 |     "deletable": true,
 583 |     "new_sheet": false,
 584 |     "run_control": {
 585 |      "read_only": false
 586 |     }
 587 |    },
 588 |    "outputs": [
 589 |     {
 590 |      "data": {
 591 |       "text/plain": [
 592 |        "0    drugY\n",
 593 |        "1    drugC\n",
 594 |        "2    drugC\n",
 595 |        "3    drugX\n",
 596 |        "4    drugY\n",
 597 |        "Name: Drug, dtype: object"
 598 |       ]
 599 |      },
 600 |      "execution_count": 10,
 601 |      "metadata": {},
 602 |      "output_type": "execute_result"
 603 |     }
 604 |    ],
 605 |    "source": [
 606 |     "y = my_data[\"Drug\"]\n",
 607 |     "y[0:5]"
 608 |    ]
 609 |   },
 610 |   {
 611 |    "cell_type": "markdown",
 612 |    "metadata": {
 613 |     "button": false,
 614 |     "deletable": true,
 615 |     "new_sheet": false,
 616 |     "run_control": {
 617 |      "read_only": false
 618 |     }
 619 |    },
 620 |    "source": [
 621 |     "<hr>\n",
 622 |     "\n",
 623 |     "<div id=\"setting_up_tree\">\n",
 624 |     "    <h2>Setting up the Decision Tree</h2>\n",
 625 |     "    We will be using <b>train/test split</b> on our <b>decision tree</b>. Let's import <b>train_test_split</b> from <b>sklearn.cross_validation</b>.\n",
 626 |     "</div>"
 627 |    ]
 628 |   },
 629 |   {
 630 |    "cell_type": "code",
 631 |    "execution_count": 19,
 632 |    "metadata": {
 633 |     "button": false,
 634 |     "deletable": true,
 635 |     "new_sheet": false,
 636 |     "run_control": {
 637 |      "read_only": false
 638 |     }
 639 |    },
 640 |    "outputs": [],
 641 |    "source": [
 642 |     "from sklearn.model_selection import train_test_split"
 643 |    ]
 644 |   },
 645 |   {
 646 |    "cell_type": "markdown",
 647 |    "metadata": {
 648 |     "button": false,
 649 |     "deletable": true,
 650 |     "new_sheet": false,
 651 |     "run_control": {
 652 |      "read_only": false
 653 |     }
 654 |    },
 655 |    "source": [
 656 |     "Now <b> train_test_split </b> will return 4 different parameters. We will name them:<br>\n",
 657 |     "X_trainset, X_testset, y_trainset, y_testset <br> <br>\n",
 658 |     "The <b> train_test_split </b> will need the parameters: <br>\n",
 659 |     "X, y, test_size=0.3, and random_state=3. <br> <br>\n",
 660 |     "The <b>X</b> and <b>y</b> are the arrays required before the split, the <b>test_size</b> represents the ratio of the testing dataset, and the <b>random_state</b> ensures that we obtain the same splits."
 661 |    ]
 662 |   },
 663 |   {
 664 |    "cell_type": "code",
 665 |    "execution_count": 20,
 666 |    "metadata": {
 667 |     "button": false,
 668 |     "deletable": true,
 669 |     "new_sheet": false,
 670 |     "run_control": {
 671 |      "read_only": false
 672 |     }
 673 |    },
 674 |    "outputs": [],
 675 |    "source": [
 676 |     "X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)"
 677 |    ]
 678 |   },
 679 |   {
 680 |    "cell_type": "markdown",
 681 |    "metadata": {
 682 |     "button": false,
 683 |     "deletable": true,
 684 |     "new_sheet": false,
 685 |     "run_control": {
 686 |      "read_only": false
 687 |     }
 688 |    },
 689 |    "source": [
 690 |     "<h3>Practice</h3>\n",
 691 |     "Print the shape of X_trainset and y_trainset. Ensure that the dimensions match"
 692 |    ]
 693 |   },
 694 |   {
 695 |    "cell_type": "code",
 696 |    "execution_count": null,
 697 |    "metadata": {
 698 |     "button": false,
 699 |     "collapsed": true,
 700 |     "deletable": true,
 701 |     "jupyter": {
 702 |      "outputs_hidden": true
 703 |     },
 704 |     "new_sheet": false,
 705 |     "run_control": {
 706 |      "read_only": false
 707 |     }
 708 |    },
 709 |    "outputs": [],
 710 |    "source": [
 711 |     "# your code\n",
 712 |     "\n"
 713 |    ]
 714 |   },
 715 |   {
 716 |    "cell_type": "markdown",
 717 |    "metadata": {
 718 |     "button": false,
 719 |     "deletable": true,
 720 |     "new_sheet": false,
 721 |     "run_control": {
 722 |      "read_only": false
 723 |     }
 724 |    },
 725 |    "source": [
 726 |     "Print the shape of X_testset and y_testset. Ensure that the dimensions match"
 727 |    ]
 728 |   },
 729 |   {
 730 |    "cell_type": "code",
 731 |    "execution_count": null,
 732 |    "metadata": {
 733 |     "button": false,
 734 |     "collapsed": true,
 735 |     "deletable": true,
 736 |     "jupyter": {
 737 |      "outputs_hidden": true
 738 |     },
 739 |     "new_sheet": false,
 740 |     "run_control": {
 741 |      "read_only": false
 742 |     }
 743 |    },
 744 |    "outputs": [],
 745 |    "source": [
 746 |     "# your code\n",
 747 |     "\n"
 748 |    ]
 749 |   },
 750 |   {
 751 |    "cell_type": "markdown",
 752 |    "metadata": {
 753 |     "button": false,
 754 |     "deletable": true,
 755 |     "new_sheet": false,
 756 |     "run_control": {
 757 |      "read_only": false
 758 |     }
 759 |    },
 760 |    "source": [
 761 |     "<hr>\n",
 762 |     "\n",
 763 |     "<div id=\"modeling\">\n",
 764 |     "    <h2>Modeling</h2>\n",
 765 |     "    We will first create an instance of the <b>DecisionTreeClassifier</b> called <b>drugTree</b>.<br>\n",
 766 |     "    Inside of the classifier, specify <i> criterion=\"entropy\" </i> so we can see the information gain of each node.\n",
 767 |     "</div>"
 768 |    ]
 769 |   },
 770 |   {
 771 |    "cell_type": "code",
 772 |    "execution_count": 21,
 773 |    "metadata": {
 774 |     "button": false,
 775 |     "deletable": true,
 776 |     "new_sheet": false,
 777 |     "run_control": {
 778 |      "read_only": false
 779 |     }
 780 |    },
 781 |    "outputs": [
 782 |     {
 783 |      "data": {
 784 |       "text/plain": [
 785 |        "DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,\n",
 786 |        "            max_features=None, max_leaf_nodes=None,\n",
 787 |        "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
 788 |        "            min_samples_leaf=1, min_samples_split=2,\n",
 789 |        "            min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
 790 |        "            splitter='best')"
 791 |       ]
 792 |      },
 793 |      "execution_count": 21,
 794 |      "metadata": {},
 795 |      "output_type": "execute_result"
 796 |     }
 797 |    ],
 798 |    "source": [
 799 |     "drugTree = DecisionTreeClassifier(criterion=\"entropy\", max_depth = 4)\n",
 800 |     "drugTree # it shows the default parameters"
 801 |    ]
 802 |   },
 803 |   {
 804 |    "cell_type": "markdown",
 805 |    "metadata": {
 806 |     "button": false,
 807 |     "deletable": true,
 808 |     "new_sheet": false,
 809 |     "run_control": {
 810 |      "read_only": false
 811 |     }
 812 |    },
 813 |    "source": [
 814 |     "Next, we will fit the data with the training feature matrix <b> X_trainset </b> and training  response vector <b> y_trainset </b>"
 815 |    ]
 816 |   },
 817 |   {
 818 |    "cell_type": "code",
 819 |    "execution_count": 22,
 820 |    "metadata": {
 821 |     "button": false,
 822 |     "deletable": true,
 823 |     "new_sheet": false,
 824 |     "run_control": {
 825 |      "read_only": false
 826 |     }
 827 |    },
 828 |    "outputs": [
 829 |     {
 830 |      "data": {
 831 |       "text/plain": [
 832 |        "DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,\n",
 833 |        "            max_features=None, max_leaf_nodes=None,\n",
 834 |        "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
 835 |        "            min_samples_leaf=1, min_samples_split=2,\n",
 836 |        "            min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n",
 837 |        "            splitter='best')"
 838 |       ]
 839 |      },
 840 |      "execution_count": 22,
 841 |      "metadata": {},
 842 |      "output_type": "execute_result"
 843 |     }
 844 |    ],
 845 |    "source": [
 846 |     "drugTree.fit(X_trainset,y_trainset)"
 847 |    ]
 848 |   },
 849 |   {
 850 |    "cell_type": "markdown",
 851 |    "metadata": {
 852 |     "button": false,
 853 |     "deletable": true,
 854 |     "new_sheet": false,
 855 |     "run_control": {
 856 |      "read_only": false
 857 |     }
 858 |    },
 859 |    "source": [
 860 |     "<hr>\n",
 861 |     "\n",
 862 |     "<div id=\"prediction\">\n",
 863 |     "    <h2>Prediction</h2>\n",
 864 |     "    Let's make some <b>predictions</b> on the testing dataset and store it into a variable called <b>predTree</b>.\n",
 865 |     "</div>"
 866 |    ]
 867 |   },
 868 |   {
 869 |    "cell_type": "code",
 870 |    "execution_count": 24,
 871 |    "metadata": {
 872 |     "button": false,
 873 |     "deletable": true,
 874 |     "new_sheet": false,
 875 |     "run_control": {
 876 |      "read_only": false
 877 |     }
 878 |    },
 879 |    "outputs": [],
 880 |    "source": [
 881 |     "predTree = drugTree.predict(X_testset)"
 882 |    ]
 883 |   },
 884 |   {
 885 |    "cell_type": "markdown",
 886 |    "metadata": {
 887 |     "button": false,
 888 |     "deletable": true,
 889 |     "new_sheet": false,
 890 |     "run_control": {
 891 |      "read_only": false
 892 |     }
 893 |    },
 894 |    "source": [
 895 |     "You can print out <b>predTree</b> and <b>y_testset</b> if you want to visually compare the prediction to the actual values."
 896 |    ]
 897 |   },
 898 |   {
 899 |    "cell_type": "code",
 900 |    "execution_count": 25,
 901 |    "metadata": {
 902 |     "button": false,
 903 |     "deletable": true,
 904 |     "new_sheet": false,
 905 |     "run_control": {
 906 |      "read_only": false
 907 |     },
 908 |     "scrolled": true
 909 |    },
 910 |    "outputs": [
 911 |     {
 912 |      "name": "stdout",
 913 |      "output_type": "stream",
 914 |      "text": [
 915 |       "['drugY' 'drugX' 'drugX' 'drugX' 'drugX']\n",
 916 |       "40     drugY\n",
 917 |       "51     drugX\n",
 918 |       "139    drugX\n",
 919 |       "197    drugX\n",
 920 |       "170    drugX\n",
 921 |       "Name: Drug, dtype: object\n"
 922 |      ]
 923 |     }
 924 |    ],
 925 |    "source": [
 926 |     "print (predTree [0:5])\n",
 927 |     "print (y_testset [0:5])\n"
 928 |    ]
 929 |   },
 930 |   {
 931 |    "cell_type": "markdown",
 932 |    "metadata": {
 933 |     "button": false,
 934 |     "deletable": true,
 935 |     "new_sheet": false,
 936 |     "run_control": {
 937 |      "read_only": false
 938 |     }
 939 |    },
 940 |    "source": [
 941 |     "<hr>\n",
 942 |     "\n",
 943 |     "<div id=\"evaluation\">\n",
 944 |     "    <h2>Evaluation</h2>\n",
 945 |     "    Next, let's import <b>metrics</b> from sklearn and check the accuracy of our model.\n",
 946 |     "</div>"
 947 |    ]
 948 |   },
 949 |   {
 950 |    "cell_type": "code",
 951 |    "execution_count": 1,
 952 |    "metadata": {
 953 |     "button": false,
 954 |     "deletable": true,
 955 |     "new_sheet": false,
 956 |     "run_control": {
 957 |      "read_only": false
 958 |     }
 959 |    },
 960 |    "outputs": [
 961 |     {
 962 |      "ename": "NameError",
 963 |      "evalue": "name 'y_testset' is not defined",
 964 |      "output_type": "error",
 965 |      "traceback": [
 966 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 967 |       "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
 968 |       "\u001b[0;32m<ipython-input-1-81f2535f6cae>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmetrics\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmatplotlib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpyplot\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"DecisionTrees's Accuracy: \"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmetrics\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0maccuracy_score\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_testset\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpredTree\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
 969 |       "\u001b[0;31mNameError\u001b[0m: name 'y_testset' is not defined"
 970 |      ]
 971 |     }
 972 |    ],
 973 |    "source": [
 974 |     "from sklearn import metrics\n",
 975 |     "import matplotlib.pyplot as plt\n",
 976 |     "print(\"DecisionTrees's Accuracy: \", metrics.accuracy_score(y_testset, predTree))"
 977 |    ]
 978 |   },
 979 |   {
 980 |    "cell_type": "markdown",
 981 |    "metadata": {
 982 |     "button": false,
 983 |     "deletable": true,
 984 |     "new_sheet": false,
 985 |     "run_control": {
 986 |      "read_only": false
 987 |     }
 988 |    },
 989 |    "source": [
 990 |     "__Accuracy classification score__ computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.  \n",
 991 |     "\n",
 992 |     "In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.\n"
 993 |    ]
 994 |   },
 995 |   {
 996 |    "cell_type": "markdown",
 997 |    "metadata": {
 998 |     "button": false,
 999 |     "deletable": true,
1000 |     "new_sheet": false,
1001 |     "run_control": {
1002 |      "read_only": false
1003 |     }
1004 |    },
1005 |    "source": [
1006 |     "## Practice \n",
1007 |     "Can you calculate the accuracy score without sklearn ?"
1008 |    ]
1009 |   },
1010 |   {
1011 |    "cell_type": "code",
1012 |    "execution_count": null,
1013 |    "metadata": {
1014 |     "button": false,
1015 |     "collapsed": true,
1016 |     "deletable": true,
1017 |     "jupyter": {
1018 |      "outputs_hidden": true
1019 |     },
1020 |     "new_sheet": false,
1021 |     "run_control": {
1022 |      "read_only": false
1023 |     }
1024 |    },
1025 |    "outputs": [],
1026 |    "source": [
1027 |     "# your code here\n"
1028 |    ]
1029 |   },
1030 |   {
1031 |    "cell_type": "markdown",
1032 |    "metadata": {},
1033 |    "source": [
1034 |     "<hr>\n",
1035 |     "\n",
1036 |     "<div id=\"visualization\">\n",
1037 |     "    <h2>Visualization</h2>\n",
1038 |     "    Lets visualize the tree\n",
1039 |     "</div>"
1040 |    ]
1041 |   },
1042 |   {
1043 |    "cell_type": "code",
1044 |    "execution_count": null,
1045 |    "metadata": {},
1046 |    "outputs": [],
1047 |    "source": [
1048 |     "# Notice: You might need to uncomment and install the pydotplus and graphviz libraries if you have not installed these before\n",
1049 |     "# !conda install -c conda-forge pydotplus -y\n",
1050 |     "# !conda install -c conda-forge python-graphviz -y"
1051 |    ]
1052 |   },
1053 |   {
1054 |    "cell_type": "code",
1055 |    "execution_count": null,
1056 |    "metadata": {
1057 |     "button": false,
1058 |     "collapsed": true,
1059 |     "deletable": true,
1060 |     "jupyter": {
1061 |      "outputs_hidden": true
1062 |     },
1063 |     "new_sheet": false,
1064 |     "run_control": {
1065 |      "read_only": false
1066 |     }
1067 |    },
1068 |    "outputs": [],
1069 |    "source": [
1070 |     "from sklearn.externals.six import StringIO\n",
1071 |     "import pydotplus\n",
1072 |     "import matplotlib.image as mpimg\n",
1073 |     "from sklearn import tree\n",
1074 |     "%matplotlib inline "
1075 |    ]
1076 |   },
1077 |   {
1078 |    "cell_type": "code",
1079 |    "execution_count": null,
1080 |    "metadata": {
1081 |     "button": false,
1082 |     "collapsed": true,
1083 |     "deletable": true,
1084 |     "jupyter": {
1085 |      "outputs_hidden": true
1086 |     },
1087 |     "new_sheet": false,
1088 |     "run_control": {
1089 |      "read_only": false
1090 |     }
1091 |    },
1092 |    "outputs": [],
1093 |    "source": [
1094 |     "dot_data = StringIO()\n",
1095 |     "filename = \"drugtree.png\"\n",
1096 |     "featureNames = my_data.columns[0:5]\n",
1097 |     "targetNames = my_data[\"Drug\"].unique().tolist()\n",
1098 |     "out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True,  special_characters=True,rotate=False)  \n",
1099 |     "graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  \n",
1100 |     "graph.write_png(filename)\n",
1101 |     "img = mpimg.imread(filename)\n",
1102 |     "plt.figure(figsize=(100, 200))\n",
1103 |     "plt.imshow(img,interpolation='nearest')"
1104 |    ]
1105 |   },
1106 |   {
1107 |    "cell_type": "markdown",
1108 |    "metadata": {
1109 |     "button": false,
1110 |     "deletable": true,
1111 |     "new_sheet": false,
1112 |     "run_control": {
1113 |      "read_only": false
1114 |     }
1115 |    },
1116 |    "source": [
1117 |     "<h2>Want to learn more?</h2>\n",
1118 |     "\n",
1119 |     "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href=\"http://cocl.us/ML0101EN-SPSSModeler\">SPSS Modeler</a>\n",
1120 |     "\n",
1121 |     "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href=\"https://cocl.us/ML0101EN_DSX\">Watson Studio</a>\n",
1122 |     "\n",
1123 |     "<h3>Thanks for completing this lesson!</h3>\n",
1124 |     "\n",
1125 |     "<h4>Author:  <a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a></h4>\n",
1126 |     "<p><a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>\n",
1127 |     "\n",
1128 |     "<hr>\n",
1129 |     "\n",
1130 |     "<p>Copyright &copy; 2018 <a href=\"https://cocl.us/DX0108EN_CC\">Cognitive Class</a>. This notebook and its source code are released under the terms of the <a href=\"https://bigdatauniversity.com/mit-license/\">MIT License</a>.</p>"
1131 |    ]
1132 |   }
1133 |  ],
1134 |  "metadata": {
1135 |   "anaconda-cloud": {},
1136 |   "kernelspec": {
1137 |    "display_name": "Python 3",
1138 |    "language": "python",
1139 |    "name": "python3"
1140 |   },
1141 |   "language_info": {
1142 |    "codemirror_mode": {
1143 |     "name": "ipython",
1144 |     "version": 3
1145 |    },
1146 |    "file_extension": ".py",
1147 |    "mimetype": "text/x-python",
1148 |    "name": "python",
1149 |    "nbconvert_exporter": "python",
1150 |    "pygments_lexer": "ipython3",
1151 |    "version": "3.6.7"
1152 |   },
1153 |   "widgets": {
1154 |    "state": {},
1155 |    "version": "1.1.2"
1156 |   }
1157 |  },
1158 |  "nbformat": 4,
1159 |  "nbformat_minor": 4
1160 | }
1161 | 


--------------------------------------------------------------------------------
/Classification Models/ML0101EN-Clas-Logistic-Reg-churn-py-v1.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "button": false,
   7 |     "new_sheet": false,
   8 |     "run_control": {
   9 |      "read_only": false
  10 |     }
  11 |    },
  12 |    "source": [
  13 |     "<a href=\"https://www.bigdatauniversity.com\"><img src=\"https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png\" width=400 align=\"center\"></a>\n",
  14 |     "\n",
  15 |     "<h1 align=\"center\"><font size=\"5\"> Logistic Regression with Python</font></h1>"
  16 |    ]
  17 |   },
  18 |   {
  19 |    "cell_type": "markdown",
  20 |    "metadata": {},
  21 |    "source": [
  22 |     "In this notebook, you will learn Logistic Regression, and then, you'll create a model for a telecommunication company, to predict when its customers will leave for a competitor, so that they can take some action to retain the customers."
  23 |    ]
  24 |   },
  25 |   {
  26 |    "cell_type": "markdown",
  27 |    "metadata": {},
  28 |    "source": [
  29 |     "<h1>Table of contents</h1>\n",
  30 |     "\n",
  31 |     "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
  32 |     "    <ol>\n",
  33 |     "        <li><a href=\"#about_dataset\">About the dataset</a></li>\n",
  34 |     "        <li><a href=\"#preprocessing\">Data pre-processing and selection</a></li>\n",
  35 |     "        <li><a href=\"#modeling\">Modeling (Logistic Regression with Scikit-learn)</a></li>\n",
  36 |     "        <li><a href=\"#evaluation\">Evaluation</a></li>\n",
  37 |     "        <li><a href=\"#practice\">Practice</a></li>\n",
  38 |     "    </ol>\n",
  39 |     "</div>\n",
  40 |     "<br>\n",
  41 |     "<hr>"
  42 |    ]
  43 |   },
  44 |   {
  45 |    "cell_type": "markdown",
  46 |    "metadata": {
  47 |     "button": false,
  48 |     "new_sheet": false,
  49 |     "run_control": {
  50 |      "read_only": false
  51 |     }
  52 |    },
  53 |    "source": [
  54 |     "<a id=\"ref1\"></a>\n",
  55 |     "## What is the difference between Linear and Logistic Regression?\n",
  56 |     "\n",
  57 |     "While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it is not the best tool for predicting the class of an observed data point. In order to estimate the class of a data point, we need some sort of guidance on what would be the <b>most probable class</b> for that data point. For this, we use <b>Logistic Regression</b>.\n",
  58 |     "\n",
  59 |     "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 20px\">\n",
  60 |     "<font size = 3><strong>Recall linear regression:</strong></font>\n",
  61 |     "<br>\n",
  62 |     "<br>\n",
  63 |     "    As you know, <b>Linear regression</b> finds a function that relates a continuous dependent variable, <b>y</b>, to some predictors (independent variables $x_1$, $x_2$, etc.). For example, Simple linear regression assumes a function of the form:\n",
  64 |     "<br><br>\n",
  65 |     "$$\n",
  66 |     "y = \\theta_0 + \\theta_1  x_1 + \\theta_2  x_2 + \\cdots\n",
  67 |     "$$\n",
  68 |     "<br>\n",
  69 |     "and finds the values of parameters $\\theta_0, \\theta_1, \\theta_2$, etc, where the term $\\theta_0$ is the \"intercept\". It can be generally shown as:\n",
  70 |     "<br><br>\n",
  71 |     "$$\n",
  72 |     "ℎ_\\theta(𝑥) = \\theta^TX\n",
  73 |     "$$\n",
  74 |     "<p></p>\n",
  75 |     "\n",
  76 |     "</div>\n",
  77 |     "\n",
  78 |     "Logistic Regression is a variation of Linear Regression, useful when the observed dependent variable, <b>y</b>, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.\n",
  79 |     "\n",
  80 |     "Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function, which is called sigmoid function 𝜎:\n",
  81 |     "\n",
  82 |     "$$\n",
  83 |     "ℎ_\\theta(𝑥) = \\sigma({\\theta^TX}) =  \\frac {e^{(\\theta_0 + \\theta_1  x_1 + \\theta_2  x_2 +...)}}{1 + e^{(\\theta_0 + \\theta_1  x_1 + \\theta_2  x_2 +\\cdots)}}\n",
  84 |     "$$\n",
  85 |     "Or:\n",
  86 |     "$$\n",
  87 |     "ProbabilityOfaClass_1 =  P(Y=1|X) = \\sigma({\\theta^TX}) = \\frac{e^{\\theta^TX}}{1+e^{\\theta^TX}} \n",
  88 |     "$$\n",
  89 |     "\n",
  90 |     "In this equation, ${\\theta^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $\\sigma(\\theta^TX)$ is the sigmoid or [logistic function](http://en.wikipedia.org/wiki/Logistic_function), also called logistic curve. It is a common \"S\" shape (sigmoid curve).\n",
  91 |     "\n",
  92 |     "So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:\n",
  93 |     "\n",
  94 |     "<img\n",
  95 |     "src=\"https://ibm.box.com/shared/static/kgv9alcghmjcv97op4d6onkyxevk23b1.png\" width=\"400\" align=\"center\">\n",
  96 |     "\n",
  97 |     "\n",
  98 |     "The objective of __Logistic Regression__ algorithm, is to find the best parameters θ, for $ℎ_\\theta(𝑥)$ = $\\sigma({\\theta^TX})$, in such a way that the model best predicts the class of each case."
  99 |    ]
 100 |   },
 101 |   {
 102 |    "cell_type": "markdown",
 103 |    "metadata": {},
 104 |    "source": [
 105 |     "### Customer churn with Logistic Regression\n",
 106 |     "A telecommunications company is concerned about the number of customers leaving their land-line business for cable competitors. They need to understand who is leaving. Imagine that you are an analyst at this company and you have to find out who is leaving and why."
 107 |    ]
 108 |   },
 109 |   {
 110 |    "cell_type": "markdown",
 111 |    "metadata": {
 112 |     "button": false,
 113 |     "new_sheet": false,
 114 |     "run_control": {
 115 |      "read_only": false
 116 |     }
 117 |    },
 118 |    "source": [
 119 |     "Lets first import required libraries:"
 120 |    ]
 121 |   },
 122 |   {
 123 |    "cell_type": "code",
 124 |    "execution_count": 2,
 125 |    "metadata": {
 126 |     "button": false,
 127 |     "new_sheet": false,
 128 |     "run_control": {
 129 |      "read_only": false
 130 |     }
 131 |    },
 132 |    "outputs": [],
 133 |    "source": [
 134 |     "import pandas as pd\n",
 135 |     "import pylab as pl\n",
 136 |     "import numpy as np\n",
 137 |     "import scipy.optimize as opt\n",
 138 |     "from sklearn import preprocessing\n",
 139 |     "%matplotlib inline \n",
 140 |     "import matplotlib.pyplot as plt"
 141 |    ]
 142 |   },
 143 |   {
 144 |    "cell_type": "markdown",
 145 |    "metadata": {
 146 |     "button": false,
 147 |     "new_sheet": false,
 148 |     "run_control": {
 149 |      "read_only": false
 150 |     }
 151 |    },
 152 |    "source": [
 153 |     "<h2 id=\"about_dataset\">About the dataset</h2>\n",
 154 |     "We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company. \n",
 155 |     "\n",
 156 |     "\n",
 157 |     "This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.\n",
 158 |     "\n",
 159 |     "\n",
 160 |     "\n",
 161 |     "The dataset includes information about:\n",
 162 |     "\n",
 163 |     "- Customers who left within the last month – the column is called Churn\n",
 164 |     "- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies\n",
 165 |     "- Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges\n",
 166 |     "- Demographic info about customers – gender, age range, and if they have partners and dependents\n"
 167 |    ]
 168 |   },
 169 |   {
 170 |    "cell_type": "markdown",
 171 |    "metadata": {
 172 |     "button": false,
 173 |     "new_sheet": false,
 174 |     "run_control": {
 175 |      "read_only": false
 176 |     }
 177 |    },
 178 |    "source": [
 179 |     "###  Load the Telco Churn data \n",
 180 |     "Telco Churn is a hypothetical data file that concerns a telecommunications company's efforts to reduce turnover in its customer base. Each case corresponds to a separate customer and it records various demographic and service usage information. Before you can work with the data, you must use the URL to get the ChurnData.csv.\n",
 181 |     "\n",
 182 |     "To download the data, we will use `!wget` to download it from IBM Object Storage."
 183 |    ]
 184 |   },
 185 |   {
 186 |    "cell_type": "code",
 187 |    "execution_count": 3,
 188 |    "metadata": {
 189 |     "button": false,
 190 |     "new_sheet": false,
 191 |     "run_control": {
 192 |      "read_only": false
 193 |     }
 194 |    },
 195 |    "outputs": [
 196 |     {
 197 |      "name": "stdout",
 198 |      "output_type": "stream",
 199 |      "text": [
 200 |       "--2019-07-11 02:13:17--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv\n",
 201 |       "Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193\n",
 202 |       "Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.\n",
 203 |       "HTTP request sent, awaiting response... 200 OK\n",
 204 |       "Length: 36144 (35K) [text/csv]\n",
 205 |       "Saving to: ‘ChurnData.csv’\n",
 206 |       "\n",
 207 |       "ChurnData.csv       100%[===================>]  35.30K  --.-KB/s    in 0.02s   \n",
 208 |       "\n",
 209 |       "2019-07-11 02:13:17 (1.63 MB/s) - ‘ChurnData.csv’ saved [36144/36144]\n",
 210 |       "\n"
 211 |      ]
 212 |     }
 213 |    ],
 214 |    "source": [
 215 |     "#Click here and press Shift+Enter\n",
 216 |     "!wget -O ChurnData.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv"
 217 |    ]
 218 |   },
 219 |   {
 220 |    "cell_type": "markdown",
 221 |    "metadata": {},
 222 |    "source": [
 223 |     "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)"
 224 |    ]
 225 |   },
 226 |   {
 227 |    "cell_type": "markdown",
 228 |    "metadata": {
 229 |     "button": false,
 230 |     "new_sheet": false,
 231 |     "run_control": {
 232 |      "read_only": false
 233 |     }
 234 |    },
 235 |    "source": [
 236 |     "### Load Data From CSV File  "
 237 |    ]
 238 |   },
 239 |   {
 240 |    "cell_type": "code",
 241 |    "execution_count": 4,
 242 |    "metadata": {
 243 |     "button": false,
 244 |     "new_sheet": false,
 245 |     "run_control": {
 246 |      "read_only": false
 247 |     }
 248 |    },
 249 |    "outputs": [
 250 |     {
 251 |      "data": {
 252 |       "text/html": [
 253 |        "<div>\n",
 254 |        "<style scoped>\n",
 255 |        "    .dataframe tbody tr th:only-of-type {\n",
 256 |        "        vertical-align: middle;\n",
 257 |        "    }\n",
 258 |        "\n",
 259 |        "    .dataframe tbody tr th {\n",
 260 |        "        vertical-align: top;\n",
 261 |        "    }\n",
 262 |        "\n",
 263 |        "    .dataframe thead th {\n",
 264 |        "        text-align: right;\n",
 265 |        "    }\n",
 266 |        "</style>\n",
 267 |        "<table border=\"1\" class=\"dataframe\">\n",
 268 |        "  <thead>\n",
 269 |        "    <tr style=\"text-align: right;\">\n",
 270 |        "      <th></th>\n",
 271 |        "      <th>tenure</th>\n",
 272 |        "      <th>age</th>\n",
 273 |        "      <th>address</th>\n",
 274 |        "      <th>income</th>\n",
 275 |        "      <th>ed</th>\n",
 276 |        "      <th>employ</th>\n",
 277 |        "      <th>equip</th>\n",
 278 |        "      <th>callcard</th>\n",
 279 |        "      <th>wireless</th>\n",
 280 |        "      <th>longmon</th>\n",
 281 |        "      <th>...</th>\n",
 282 |        "      <th>pager</th>\n",
 283 |        "      <th>internet</th>\n",
 284 |        "      <th>callwait</th>\n",
 285 |        "      <th>confer</th>\n",
 286 |        "      <th>ebill</th>\n",
 287 |        "      <th>loglong</th>\n",
 288 |        "      <th>logtoll</th>\n",
 289 |        "      <th>lninc</th>\n",
 290 |        "      <th>custcat</th>\n",
 291 |        "      <th>churn</th>\n",
 292 |        "    </tr>\n",
 293 |        "  </thead>\n",
 294 |        "  <tbody>\n",
 295 |        "    <tr>\n",
 296 |        "      <th>0</th>\n",
 297 |        "      <td>11.0</td>\n",
 298 |        "      <td>33.0</td>\n",
 299 |        "      <td>7.0</td>\n",
 300 |        "      <td>136.0</td>\n",
 301 |        "      <td>5.0</td>\n",
 302 |        "      <td>5.0</td>\n",
 303 |        "      <td>0.0</td>\n",
 304 |        "      <td>1.0</td>\n",
 305 |        "      <td>1.0</td>\n",
 306 |        "      <td>4.40</td>\n",
 307 |        "      <td>...</td>\n",
 308 |        "      <td>1.0</td>\n",
 309 |        "      <td>0.0</td>\n",
 310 |        "      <td>1.0</td>\n",
 311 |        "      <td>1.0</td>\n",
 312 |        "      <td>0.0</td>\n",
 313 |        "      <td>1.482</td>\n",
 314 |        "      <td>3.033</td>\n",
 315 |        "      <td>4.913</td>\n",
 316 |        "      <td>4.0</td>\n",
 317 |        "      <td>1.0</td>\n",
 318 |        "    </tr>\n",
 319 |        "    <tr>\n",
 320 |        "      <th>1</th>\n",
 321 |        "      <td>33.0</td>\n",
 322 |        "      <td>33.0</td>\n",
 323 |        "      <td>12.0</td>\n",
 324 |        "      <td>33.0</td>\n",
 325 |        "      <td>2.0</td>\n",
 326 |        "      <td>0.0</td>\n",
 327 |        "      <td>0.0</td>\n",
 328 |        "      <td>0.0</td>\n",
 329 |        "      <td>0.0</td>\n",
 330 |        "      <td>9.45</td>\n",
 331 |        "      <td>...</td>\n",
 332 |        "      <td>0.0</td>\n",
 333 |        "      <td>0.0</td>\n",
 334 |        "      <td>0.0</td>\n",
 335 |        "      <td>0.0</td>\n",
 336 |        "      <td>0.0</td>\n",
 337 |        "      <td>2.246</td>\n",
 338 |        "      <td>3.240</td>\n",
 339 |        "      <td>3.497</td>\n",
 340 |        "      <td>1.0</td>\n",
 341 |        "      <td>1.0</td>\n",
 342 |        "    </tr>\n",
 343 |        "    <tr>\n",
 344 |        "      <th>2</th>\n",
 345 |        "      <td>23.0</td>\n",
 346 |        "      <td>30.0</td>\n",
 347 |        "      <td>9.0</td>\n",
 348 |        "      <td>30.0</td>\n",
 349 |        "      <td>1.0</td>\n",
 350 |        "      <td>2.0</td>\n",
 351 |        "      <td>0.0</td>\n",
 352 |        "      <td>0.0</td>\n",
 353 |        "      <td>0.0</td>\n",
 354 |        "      <td>6.30</td>\n",
 355 |        "      <td>...</td>\n",
 356 |        "      <td>0.0</td>\n",
 357 |        "      <td>0.0</td>\n",
 358 |        "      <td>0.0</td>\n",
 359 |        "      <td>1.0</td>\n",
 360 |        "      <td>0.0</td>\n",
 361 |        "      <td>1.841</td>\n",
 362 |        "      <td>3.240</td>\n",
 363 |        "      <td>3.401</td>\n",
 364 |        "      <td>3.0</td>\n",
 365 |        "      <td>0.0</td>\n",
 366 |        "    </tr>\n",
 367 |        "    <tr>\n",
 368 |        "      <th>3</th>\n",
 369 |        "      <td>38.0</td>\n",
 370 |        "      <td>35.0</td>\n",
 371 |        "      <td>5.0</td>\n",
 372 |        "      <td>76.0</td>\n",
 373 |        "      <td>2.0</td>\n",
 374 |        "      <td>10.0</td>\n",
 375 |        "      <td>1.0</td>\n",
 376 |        "      <td>1.0</td>\n",
 377 |        "      <td>1.0</td>\n",
 378 |        "      <td>6.05</td>\n",
 379 |        "      <td>...</td>\n",
 380 |        "      <td>1.0</td>\n",
 381 |        "      <td>1.0</td>\n",
 382 |        "      <td>1.0</td>\n",
 383 |        "      <td>1.0</td>\n",
 384 |        "      <td>1.0</td>\n",
 385 |        "      <td>1.800</td>\n",
 386 |        "      <td>3.807</td>\n",
 387 |        "      <td>4.331</td>\n",
 388 |        "      <td>4.0</td>\n",
 389 |        "      <td>0.0</td>\n",
 390 |        "    </tr>\n",
 391 |        "    <tr>\n",
 392 |        "      <th>4</th>\n",
 393 |        "      <td>7.0</td>\n",
 394 |        "      <td>35.0</td>\n",
 395 |        "      <td>14.0</td>\n",
 396 |        "      <td>80.0</td>\n",
 397 |        "      <td>2.0</td>\n",
 398 |        "      <td>15.0</td>\n",
 399 |        "      <td>0.0</td>\n",
 400 |        "      <td>1.0</td>\n",
 401 |        "      <td>0.0</td>\n",
 402 |        "      <td>7.10</td>\n",
 403 |        "      <td>...</td>\n",
 404 |        "      <td>0.0</td>\n",
 405 |        "      <td>0.0</td>\n",
 406 |        "      <td>1.0</td>\n",
 407 |        "      <td>1.0</td>\n",
 408 |        "      <td>0.0</td>\n",
 409 |        "      <td>1.960</td>\n",
 410 |        "      <td>3.091</td>\n",
 411 |        "      <td>4.382</td>\n",
 412 |        "      <td>3.0</td>\n",
 413 |        "      <td>0.0</td>\n",
 414 |        "    </tr>\n",
 415 |        "  </tbody>\n",
 416 |        "</table>\n",
 417 |        "<p>5 rows × 28 columns</p>\n",
 418 |        "</div>"
 419 |       ],
 420 |       "text/plain": [
 421 |        "   tenure   age  address  income   ed  employ  equip  callcard  wireless  \\\n",
 422 |        "0    11.0  33.0      7.0   136.0  5.0     5.0    0.0       1.0       1.0   \n",
 423 |        "1    33.0  33.0     12.0    33.0  2.0     0.0    0.0       0.0       0.0   \n",
 424 |        "2    23.0  30.0      9.0    30.0  1.0     2.0    0.0       0.0       0.0   \n",
 425 |        "3    38.0  35.0      5.0    76.0  2.0    10.0    1.0       1.0       1.0   \n",
 426 |        "4     7.0  35.0     14.0    80.0  2.0    15.0    0.0       1.0       0.0   \n",
 427 |        "\n",
 428 |        "   longmon  ...  pager  internet  callwait  confer  ebill  loglong  logtoll  \\\n",
 429 |        "0     4.40  ...    1.0       0.0       1.0     1.0    0.0    1.482    3.033   \n",
 430 |        "1     9.45  ...    0.0       0.0       0.0     0.0    0.0    2.246    3.240   \n",
 431 |        "2     6.30  ...    0.0       0.0       0.0     1.0    0.0    1.841    3.240   \n",
 432 |        "3     6.05  ...    1.0       1.0       1.0     1.0    1.0    1.800    3.807   \n",
 433 |        "4     7.10  ...    0.0       0.0       1.0     1.0    0.0    1.960    3.091   \n",
 434 |        "\n",
 435 |        "   lninc  custcat  churn  \n",
 436 |        "0  4.913      4.0    1.0  \n",
 437 |        "1  3.497      1.0    1.0  \n",
 438 |        "2  3.401      3.0    0.0  \n",
 439 |        "3  4.331      4.0    0.0  \n",
 440 |        "4  4.382      3.0    0.0  \n",
 441 |        "\n",
 442 |        "[5 rows x 28 columns]"
 443 |       ]
 444 |      },
 445 |      "execution_count": 4,
 446 |      "metadata": {},
 447 |      "output_type": "execute_result"
 448 |     }
 449 |    ],
 450 |    "source": [
 451 |     "churn_df = pd.read_csv(\"ChurnData.csv\")\n",
 452 |     "churn_df.head()"
 453 |    ]
 454 |   },
 455 |   {
 456 |    "cell_type": "markdown",
 457 |    "metadata": {},
 458 |    "source": [
 459 |     "<h2 id=\"preprocessing\">Data pre-processing and selection</h2>"
 460 |    ]
 461 |   },
 462 |   {
 463 |    "cell_type": "markdown",
 464 |    "metadata": {},
 465 |    "source": [
 466 |     "Lets select some features for the modeling. Also we change the target data type to be integer, as it is a requirement  by the skitlearn algorithm:"
 467 |    ]
 468 |   },
 469 |   {
 470 |    "cell_type": "code",
 471 |    "execution_count": 6,
 472 |    "metadata": {},
 473 |    "outputs": [
 474 |     {
 475 |      "data": {
 476 |       "text/html": [
 477 |        "<div>\n",
 478 |        "<style scoped>\n",
 479 |        "    .dataframe tbody tr th:only-of-type {\n",
 480 |        "        vertical-align: middle;\n",
 481 |        "    }\n",
 482 |        "\n",
 483 |        "    .dataframe tbody tr th {\n",
 484 |        "        vertical-align: top;\n",
 485 |        "    }\n",
 486 |        "\n",
 487 |        "    .dataframe thead th {\n",
 488 |        "        text-align: right;\n",
 489 |        "    }\n",
 490 |        "</style>\n",
 491 |        "<table border=\"1\" class=\"dataframe\">\n",
 492 |        "  <thead>\n",
 493 |        "    <tr style=\"text-align: right;\">\n",
 494 |        "      <th></th>\n",
 495 |        "      <th>tenure</th>\n",
 496 |        "      <th>age</th>\n",
 497 |        "      <th>address</th>\n",
 498 |        "      <th>income</th>\n",
 499 |        "      <th>ed</th>\n",
 500 |        "      <th>employ</th>\n",
 501 |        "      <th>equip</th>\n",
 502 |        "      <th>callcard</th>\n",
 503 |        "      <th>wireless</th>\n",
 504 |        "      <th>churn</th>\n",
 505 |        "    </tr>\n",
 506 |        "  </thead>\n",
 507 |        "  <tbody>\n",
 508 |        "    <tr>\n",
 509 |        "      <th>0</th>\n",
 510 |        "      <td>11.0</td>\n",
 511 |        "      <td>33.0</td>\n",
 512 |        "      <td>7.0</td>\n",
 513 |        "      <td>136.0</td>\n",
 514 |        "      <td>5.0</td>\n",
 515 |        "      <td>5.0</td>\n",
 516 |        "      <td>0.0</td>\n",
 517 |        "      <td>1.0</td>\n",
 518 |        "      <td>1.0</td>\n",
 519 |        "      <td>1</td>\n",
 520 |        "    </tr>\n",
 521 |        "    <tr>\n",
 522 |        "      <th>1</th>\n",
 523 |        "      <td>33.0</td>\n",
 524 |        "      <td>33.0</td>\n",
 525 |        "      <td>12.0</td>\n",
 526 |        "      <td>33.0</td>\n",
 527 |        "      <td>2.0</td>\n",
 528 |        "      <td>0.0</td>\n",
 529 |        "      <td>0.0</td>\n",
 530 |        "      <td>0.0</td>\n",
 531 |        "      <td>0.0</td>\n",
 532 |        "      <td>1</td>\n",
 533 |        "    </tr>\n",
 534 |        "    <tr>\n",
 535 |        "      <th>2</th>\n",
 536 |        "      <td>23.0</td>\n",
 537 |        "      <td>30.0</td>\n",
 538 |        "      <td>9.0</td>\n",
 539 |        "      <td>30.0</td>\n",
 540 |        "      <td>1.0</td>\n",
 541 |        "      <td>2.0</td>\n",
 542 |        "      <td>0.0</td>\n",
 543 |        "      <td>0.0</td>\n",
 544 |        "      <td>0.0</td>\n",
 545 |        "      <td>0</td>\n",
 546 |        "    </tr>\n",
 547 |        "    <tr>\n",
 548 |        "      <th>3</th>\n",
 549 |        "      <td>38.0</td>\n",
 550 |        "      <td>35.0</td>\n",
 551 |        "      <td>5.0</td>\n",
 552 |        "      <td>76.0</td>\n",
 553 |        "      <td>2.0</td>\n",
 554 |        "      <td>10.0</td>\n",
 555 |        "      <td>1.0</td>\n",
 556 |        "      <td>1.0</td>\n",
 557 |        "      <td>1.0</td>\n",
 558 |        "      <td>0</td>\n",
 559 |        "    </tr>\n",
 560 |        "    <tr>\n",
 561 |        "      <th>4</th>\n",
 562 |        "      <td>7.0</td>\n",
 563 |        "      <td>35.0</td>\n",
 564 |        "      <td>14.0</td>\n",
 565 |        "      <td>80.0</td>\n",
 566 |        "      <td>2.0</td>\n",
 567 |        "      <td>15.0</td>\n",
 568 |        "      <td>0.0</td>\n",
 569 |        "      <td>1.0</td>\n",
 570 |        "      <td>0.0</td>\n",
 571 |        "      <td>0</td>\n",
 572 |        "    </tr>\n",
 573 |        "  </tbody>\n",
 574 |        "</table>\n",
 575 |        "</div>"
 576 |       ],
 577 |       "text/plain": [
 578 |        "   tenure   age  address  income   ed  employ  equip  callcard  wireless  \\\n",
 579 |        "0    11.0  33.0      7.0   136.0  5.0     5.0    0.0       1.0       1.0   \n",
 580 |        "1    33.0  33.0     12.0    33.0  2.0     0.0    0.0       0.0       0.0   \n",
 581 |        "2    23.0  30.0      9.0    30.0  1.0     2.0    0.0       0.0       0.0   \n",
 582 |        "3    38.0  35.0      5.0    76.0  2.0    10.0    1.0       1.0       1.0   \n",
 583 |        "4     7.0  35.0     14.0    80.0  2.0    15.0    0.0       1.0       0.0   \n",
 584 |        "\n",
 585 |        "   churn  \n",
 586 |        "0      1  \n",
 587 |        "1      1  \n",
 588 |        "2      0  \n",
 589 |        "3      0  \n",
 590 |        "4      0  "
 591 |       ]
 592 |      },
 593 |      "execution_count": 6,
 594 |      "metadata": {},
 595 |      "output_type": "execute_result"
 596 |     }
 597 |    ],
 598 |    "source": [
 599 |     "churn_df = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',   'callcard', 'wireless','churn']]\n",
 600 |     "churn_df['churn'] = churn_df['churn'].astype('int')\n",
 601 |     "churn_df.head()"
 602 |    ]
 603 |   },
 604 |   {
 605 |    "cell_type": "markdown",
 606 |    "metadata": {
 607 |     "button": true,
 608 |     "new_sheet": true,
 609 |     "run_control": {
 610 |      "read_only": false
 611 |     }
 612 |    },
 613 |    "source": [
 614 |     "## Practice\n",
 615 |     "How many rows and columns are in this dataset in total? What are the name of columns?"
 616 |    ]
 617 |   },
 618 |   {
 619 |    "cell_type": "code",
 620 |    "execution_count": 5,
 621 |    "metadata": {
 622 |     "button": false,
 623 |     "new_sheet": false,
 624 |     "run_control": {
 625 |      "read_only": false
 626 |     }
 627 |    },
 628 |    "outputs": [
 629 |     {
 630 |      "data": {
 631 |       "text/plain": [
 632 |        "(200, 28)"
 633 |       ]
 634 |      },
 635 |      "execution_count": 5,
 636 |      "metadata": {},
 637 |      "output_type": "execute_result"
 638 |     }
 639 |    ],
 640 |    "source": [
 641 |     "# write your code here\n",
 642 |     "churn_df.shape\n",
 643 |     "\n"
 644 |    ]
 645 |   },
 646 |   {
 647 |    "cell_type": "markdown",
 648 |    "metadata": {},
 649 |    "source": [
 650 |     "Lets define X, and y for our dataset:"
 651 |    ]
 652 |   },
 653 |   {
 654 |    "cell_type": "code",
 655 |    "execution_count": 11,
 656 |    "metadata": {},
 657 |    "outputs": [
 658 |     {
 659 |      "data": {
 660 |       "text/html": [
 661 |        "<div>\n",
 662 |        "<style scoped>\n",
 663 |        "    .dataframe tbody tr th:only-of-type {\n",
 664 |        "        vertical-align: middle;\n",
 665 |        "    }\n",
 666 |        "\n",
 667 |        "    .dataframe tbody tr th {\n",
 668 |        "        vertical-align: top;\n",
 669 |        "    }\n",
 670 |        "\n",
 671 |        "    .dataframe thead th {\n",
 672 |        "        text-align: right;\n",
 673 |        "    }\n",
 674 |        "</style>\n",
 675 |        "<table border=\"1\" class=\"dataframe\">\n",
 676 |        "  <thead>\n",
 677 |        "    <tr style=\"text-align: right;\">\n",
 678 |        "      <th></th>\n",
 679 |        "      <th>tenure</th>\n",
 680 |        "      <th>age</th>\n",
 681 |        "      <th>address</th>\n",
 682 |        "      <th>income</th>\n",
 683 |        "      <th>ed</th>\n",
 684 |        "      <th>employ</th>\n",
 685 |        "      <th>equip</th>\n",
 686 |        "    </tr>\n",
 687 |        "  </thead>\n",
 688 |        "  <tbody>\n",
 689 |        "    <tr>\n",
 690 |        "      <th>0</th>\n",
 691 |        "      <td>11.0</td>\n",
 692 |        "      <td>33.0</td>\n",
 693 |        "      <td>7.0</td>\n",
 694 |        "      <td>136.0</td>\n",
 695 |        "      <td>5.0</td>\n",
 696 |        "      <td>5.0</td>\n",
 697 |        "      <td>0.0</td>\n",
 698 |        "    </tr>\n",
 699 |        "    <tr>\n",
 700 |        "      <th>1</th>\n",
 701 |        "      <td>33.0</td>\n",
 702 |        "      <td>33.0</td>\n",
 703 |        "      <td>12.0</td>\n",
 704 |        "      <td>33.0</td>\n",
 705 |        "      <td>2.0</td>\n",
 706 |        "      <td>0.0</td>\n",
 707 |        "      <td>0.0</td>\n",
 708 |        "    </tr>\n",
 709 |        "    <tr>\n",
 710 |        "      <th>2</th>\n",
 711 |        "      <td>23.0</td>\n",
 712 |        "      <td>30.0</td>\n",
 713 |        "      <td>9.0</td>\n",
 714 |        "      <td>30.0</td>\n",
 715 |        "      <td>1.0</td>\n",
 716 |        "      <td>2.0</td>\n",
 717 |        "      <td>0.0</td>\n",
 718 |        "    </tr>\n",
 719 |        "    <tr>\n",
 720 |        "      <th>3</th>\n",
 721 |        "      <td>38.0</td>\n",
 722 |        "      <td>35.0</td>\n",
 723 |        "      <td>5.0</td>\n",
 724 |        "      <td>76.0</td>\n",
 725 |        "      <td>2.0</td>\n",
 726 |        "      <td>10.0</td>\n",
 727 |        "      <td>1.0</td>\n",
 728 |        "    </tr>\n",
 729 |        "    <tr>\n",
 730 |        "      <th>4</th>\n",
 731 |        "      <td>7.0</td>\n",
 732 |        "      <td>35.0</td>\n",
 733 |        "      <td>14.0</td>\n",
 734 |        "      <td>80.0</td>\n",
 735 |        "      <td>2.0</td>\n",
 736 |        "      <td>15.0</td>\n",
 737 |        "      <td>0.0</td>\n",
 738 |        "    </tr>\n",
 739 |        "  </tbody>\n",
 740 |        "</table>\n",
 741 |        "</div>"
 742 |       ],
 743 |       "text/plain": [
 744 |        "   tenure   age  address  income   ed  employ  equip\n",
 745 |        "0    11.0  33.0      7.0   136.0  5.0     5.0    0.0\n",
 746 |        "1    33.0  33.0     12.0    33.0  2.0     0.0    0.0\n",
 747 |        "2    23.0  30.0      9.0    30.0  1.0     2.0    0.0\n",
 748 |        "3    38.0  35.0      5.0    76.0  2.0    10.0    1.0\n",
 749 |        "4     7.0  35.0     14.0    80.0  2.0    15.0    0.0"
 750 |       ]
 751 |      },
 752 |      "execution_count": 11,
 753 |      "metadata": {},
 754 |      "output_type": "execute_result"
 755 |     }
 756 |    ],
 757 |    "source": [
 758 |     "X = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']]\n",
 759 |     "X[0:5]"
 760 |    ]
 761 |   },
 762 |   {
 763 |    "cell_type": "code",
 764 |    "execution_count": 10,
 765 |    "metadata": {},
 766 |    "outputs": [
 767 |     {
 768 |      "data": {
 769 |       "text/plain": [
 770 |        "array([1, 1, 0, 0, 0])"
 771 |       ]
 772 |      },
 773 |      "execution_count": 10,
 774 |      "metadata": {},
 775 |      "output_type": "execute_result"
 776 |     }
 777 |    ],
 778 |    "source": [
 779 |     "y = churn_df['churn'].values\n",
 780 |     "y [0:5]"
 781 |    ]
 782 |   },
 783 |   {
 784 |    "cell_type": "markdown",
 785 |    "metadata": {},
 786 |    "source": [
 787 |     "Also, we normalize the dataset:"
 788 |    ]
 789 |   },
 790 |   {
 791 |    "cell_type": "code",
 792 |    "execution_count": 12,
 793 |    "metadata": {},
 794 |    "outputs": [
 795 |     {
 796 |      "data": {
 797 |       "text/plain": [
 798 |        "array([[-1.13518441, -0.62595491, -0.4588971 ,  0.4751423 ,  1.6961288 ,\n",
 799 |        "        -0.58477841, -0.85972695],\n",
 800 |        "       [-0.11604313, -0.62595491,  0.03454064, -0.32886061, -0.6433592 ,\n",
 801 |        "        -1.14437497, -0.85972695],\n",
 802 |        "       [-0.57928917, -0.85594447, -0.261522  , -0.35227817, -1.42318853,\n",
 803 |        "        -0.92053635, -0.85972695],\n",
 804 |        "       [ 0.11557989, -0.47262854, -0.65627219,  0.00679109, -0.6433592 ,\n",
 805 |        "        -0.02518185,  1.16316   ],\n",
 806 |        "       [-1.32048283, -0.47262854,  0.23191574,  0.03801451, -0.6433592 ,\n",
 807 |        "         0.53441472, -0.85972695]])"
 808 |       ]
 809 |      },
 810 |      "execution_count": 12,
 811 |      "metadata": {},
 812 |      "output_type": "execute_result"
 813 |     }
 814 |    ],
 815 |    "source": [
 816 |     "from sklearn import preprocessing\n",
 817 |     "X = preprocessing.StandardScaler().fit(X).transform(X)\n",
 818 |     "X[0:5]"
 819 |    ]
 820 |   },
 821 |   {
 822 |    "cell_type": "markdown",
 823 |    "metadata": {},
 824 |    "source": [
 825 |     "## Train/Test dataset"
 826 |    ]
 827 |   },
 828 |   {
 829 |    "cell_type": "markdown",
 830 |    "metadata": {},
 831 |    "source": [
 832 |     "Okay, we split our dataset into train and test set:"
 833 |    ]
 834 |   },
 835 |   {
 836 |    "cell_type": "code",
 837 |    "execution_count": 14,
 838 |    "metadata": {},
 839 |    "outputs": [
 840 |     {
 841 |      "name": "stdout",
 842 |      "output_type": "stream",
 843 |      "text": [
 844 |       "Train set: (160, 7) (160,)\n",
 845 |       "Test set: (40, 7) (40,)\n"
 846 |      ]
 847 |     }
 848 |    ],
 849 |    "source": [
 850 |     "from sklearn.model_selection import train_test_split\n",
 851 |     "X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n",
 852 |     "print ('Train set:', X_train.shape,  y_train.shape)\n",
 853 |     "print ('Test set:', X_test.shape,  y_test.shape)"
 854 |    ]
 855 |   },
 856 |   {
 857 |    "cell_type": "markdown",
 858 |    "metadata": {},
 859 |    "source": [
 860 |     "<h2 id=\"modeling\">Modeling (Logistic Regression with Scikit-learn)</h2>"
 861 |    ]
 862 |   },
 863 |   {
 864 |    "cell_type": "markdown",
 865 |    "metadata": {},
 866 |    "source": [
 867 |     "Lets build our model using __LogisticRegression__ from Scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers. You can find extensive information about the pros and cons of these optimizers if you search it in internet.\n",
 868 |     "\n",
 869 |     "The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem in machine learning models.\n",
 870 |     "__C__ parameter indicates __inverse of regularization strength__ which must be a positive float. Smaller values specify stronger regularization. \n",
 871 |     "Now lets fit our model with train set:"
 872 |    ]
 873 |   },
 874 |   {
 875 |    "cell_type": "code",
 876 |    "execution_count": 16,
 877 |    "metadata": {},
 878 |    "outputs": [
 879 |     {
 880 |      "data": {
 881 |       "text/plain": [
 882 |        "LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,\n",
 883 |        "          intercept_scaling=1, max_iter=100, multi_class='warn',\n",
 884 |        "          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',\n",
 885 |        "          tol=0.0001, verbose=0, warm_start=False)"
 886 |       ]
 887 |      },
 888 |      "execution_count": 16,
 889 |      "metadata": {},
 890 |      "output_type": "execute_result"
 891 |     }
 892 |    ],
 893 |    "source": [
 894 |     "from sklearn.linear_model import LogisticRegression\n",
 895 |     "from sklearn.metrics import confusion_matrix\n",
 896 |     "LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)\n",
 897 |     "LR"
 898 |    ]
 899 |   },
 900 |   {
 901 |    "cell_type": "markdown",
 902 |    "metadata": {},
 903 |    "source": [
 904 |     "Now we can predict using our test set:"
 905 |    ]
 906 |   },
 907 |   {
 908 |    "cell_type": "code",
 909 |    "execution_count": 17,
 910 |    "metadata": {},
 911 |    "outputs": [
 912 |     {
 913 |      "data": {
 914 |       "text/plain": [
 915 |        "array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,\n",
 916 |        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0])"
 917 |       ]
 918 |      },
 919 |      "execution_count": 17,
 920 |      "metadata": {},
 921 |      "output_type": "execute_result"
 922 |     }
 923 |    ],
 924 |    "source": [
 925 |     "yhat = LR.predict(X_test)\n",
 926 |     "yhat"
 927 |    ]
 928 |   },
 929 |   {
 930 |    "cell_type": "markdown",
 931 |    "metadata": {},
 932 |    "source": [
 933 |     "__predict_proba__  returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 1, P(Y=1|X), and second column is probability of class 0, P(Y=0|X):"
 934 |    ]
 935 |   },
 936 |   {
 937 |    "cell_type": "code",
 938 |    "execution_count": 18,
 939 |    "metadata": {},
 940 |    "outputs": [
 941 |     {
 942 |      "data": {
 943 |       "text/plain": [
 944 |        "array([[0.54132919, 0.45867081],\n",
 945 |        "       [0.60593357, 0.39406643],\n",
 946 |        "       [0.56277713, 0.43722287],\n",
 947 |        "       [0.63432489, 0.36567511],\n",
 948 |        "       [0.56431839, 0.43568161],\n",
 949 |        "       [0.55386646, 0.44613354],\n",
 950 |        "       [0.52237207, 0.47762793],\n",
 951 |        "       [0.60514349, 0.39485651],\n",
 952 |        "       [0.41069572, 0.58930428],\n",
 953 |        "       [0.6333873 , 0.3666127 ],\n",
 954 |        "       [0.58068791, 0.41931209],\n",
 955 |        "       [0.62768628, 0.37231372],\n",
 956 |        "       [0.47559883, 0.52440117],\n",
 957 |        "       [0.4267593 , 0.5732407 ],\n",
 958 |        "       [0.66172417, 0.33827583],\n",
 959 |        "       [0.55092315, 0.44907685],\n",
 960 |        "       [0.51749946, 0.48250054],\n",
 961 |        "       [0.485743  , 0.514257  ],\n",
 962 |        "       [0.49011451, 0.50988549],\n",
 963 |        "       [0.52423349, 0.47576651],\n",
 964 |        "       [0.61619519, 0.38380481],\n",
 965 |        "       [0.52696302, 0.47303698],\n",
 966 |        "       [0.63957168, 0.36042832],\n",
 967 |        "       [0.52205164, 0.47794836],\n",
 968 |        "       [0.50572852, 0.49427148],\n",
 969 |        "       [0.70706202, 0.29293798],\n",
 970 |        "       [0.55266286, 0.44733714],\n",
 971 |        "       [0.52271594, 0.47728406],\n",
 972 |        "       [0.51638863, 0.48361137],\n",
 973 |        "       [0.71331391, 0.28668609],\n",
 974 |        "       [0.67862111, 0.32137889],\n",
 975 |        "       [0.50896403, 0.49103597],\n",
 976 |        "       [0.42348082, 0.57651918],\n",
 977 |        "       [0.71495838, 0.28504162],\n",
 978 |        "       [0.59711064, 0.40288936],\n",
 979 |        "       [0.63808839, 0.36191161],\n",
 980 |        "       [0.39957895, 0.60042105],\n",
 981 |        "       [0.52127638, 0.47872362],\n",
 982 |        "       [0.65975464, 0.34024536],\n",
 983 |        "       [0.5114172 , 0.4885828 ]])"
 984 |       ]
 985 |      },
 986 |      "execution_count": 18,
 987 |      "metadata": {},
 988 |      "output_type": "execute_result"
 989 |     }
 990 |    ],
 991 |    "source": [
 992 |     "yhat_prob = LR.predict_proba(X_test)\n",
 993 |     "yhat_prob"
 994 |    ]
 995 |   },
 996 |   {
 997 |    "cell_type": "markdown",
 998 |    "metadata": {},
 999 |    "source": [
1000 |     "<h2 id=\"evaluation\">Evaluation</h2>"
1001 |    ]
1002 |   },
1003 |   {
1004 |    "cell_type": "markdown",
1005 |    "metadata": {},
1006 |    "source": [
1007 |     "### jaccard index\n",
1008 |     "Lets try jaccard index for accuracy evaluation. we can define jaccard as the size of the intersection divided by the size of the union of two label sets. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.\n",
1009 |     "\n"
1010 |    ]
1011 |   },
1012 |   {
1013 |    "cell_type": "code",
1014 |    "execution_count": null,
1015 |    "metadata": {},
1016 |    "outputs": [],
1017 |    "source": [
1018 |     "from sklearn.metrics import jaccard_similarity_score\n",
1019 |     "jaccard_similarity_score(y_test, yhat)"
1020 |    ]
1021 |   },
1022 |   {
1023 |    "cell_type": "markdown",
1024 |    "metadata": {},
1025 |    "source": [
1026 |     "### confusion matrix\n",
1027 |     "Another way of looking at accuracy of classifier is to look at __confusion matrix__."
1028 |    ]
1029 |   },
1030 |   {
1031 |    "cell_type": "code",
1032 |    "execution_count": null,
1033 |    "metadata": {},
1034 |    "outputs": [],
1035 |    "source": [
1036 |     "from sklearn.metrics import classification_report, confusion_matrix\n",
1037 |     "import itertools\n",
1038 |     "def plot_confusion_matrix(cm, classes,\n",
1039 |     "                          normalize=False,\n",
1040 |     "                          title='Confusion matrix',\n",
1041 |     "                          cmap=plt.cm.Blues):\n",
1042 |     "    \"\"\"\n",
1043 |     "    This function prints and plots the confusion matrix.\n",
1044 |     "    Normalization can be applied by setting `normalize=True`.\n",
1045 |     "    \"\"\"\n",
1046 |     "    if normalize:\n",
1047 |     "        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
1048 |     "        print(\"Normalized confusion matrix\")\n",
1049 |     "    else:\n",
1050 |     "        print('Confusion matrix, without normalization')\n",
1051 |     "\n",
1052 |     "    print(cm)\n",
1053 |     "\n",
1054 |     "    plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
1055 |     "    plt.title(title)\n",
1056 |     "    plt.colorbar()\n",
1057 |     "    tick_marks = np.arange(len(classes))\n",
1058 |     "    plt.xticks(tick_marks, classes, rotation=45)\n",
1059 |     "    plt.yticks(tick_marks, classes)\n",
1060 |     "\n",
1061 |     "    fmt = '.2f' if normalize else 'd'\n",
1062 |     "    thresh = cm.max() / 2.\n",
1063 |     "    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
1064 |     "        plt.text(j, i, format(cm[i, j], fmt),\n",
1065 |     "                 horizontalalignment=\"center\",\n",
1066 |     "                 color=\"white\" if cm[i, j] > thresh else \"black\")\n",
1067 |     "\n",
1068 |     "    plt.tight_layout()\n",
1069 |     "    plt.ylabel('True label')\n",
1070 |     "    plt.xlabel('Predicted label')\n",
1071 |     "print(confusion_matrix(y_test, yhat, labels=[1,0]))"
1072 |    ]
1073 |   },
1074 |   {
1075 |    "cell_type": "code",
1076 |    "execution_count": null,
1077 |    "metadata": {},
1078 |    "outputs": [],
1079 |    "source": [
1080 |     "# Compute confusion matrix\n",
1081 |     "cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])\n",
1082 |     "np.set_printoptions(precision=2)\n",
1083 |     "\n",
1084 |     "\n",
1085 |     "# Plot non-normalized confusion matrix\n",
1086 |     "plt.figure()\n",
1087 |     "plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False,  title='Confusion matrix')"
1088 |    ]
1089 |   },
1090 |   {
1091 |    "cell_type": "markdown",
1092 |    "metadata": {},
1093 |    "source": [
1094 |     "Look at first row. The first row is for customers whose actual churn value in test set is 1.\n",
1095 |     "As you can calculate, out of 40 customers, the churn value of 15 of them is 1. \n",
1096 |     "And out of these 15, the classifier correctly predicted 6 of them as 1, and 9 of them as 0. \n",
1097 |     "\n",
1098 |     "It means, for 6 customers, the actual churn value were 1 in test set, and classifier also correctly predicted those as 1. However, while the actual label of 9 customers were 1, the classifier predicted those as 0, which is not very good. We can consider it as error of the model for first row.\n",
1099 |     "\n",
1100 |     "What about the customers with churn value 0? Lets look at the second row.\n",
1101 |     "It looks like  there were 25 customers whom their churn value were 0. \n",
1102 |     "\n",
1103 |     "\n",
1104 |     "The classifier correctly predicted 24 of them as 0, and one of them wrongly as 1. So, it has done a good job in predicting the customers with churn value 0. A good thing about confusion matrix is that shows the model’s ability to correctly predict or separate the classes.  In specific case of binary classifier, such as this example,  we can interpret these numbers as the count of true positives, false positives, true negatives, and false negatives. "
1105 |    ]
1106 |   },
1107 |   {
1108 |    "cell_type": "code",
1109 |    "execution_count": null,
1110 |    "metadata": {},
1111 |    "outputs": [],
1112 |    "source": [
1113 |     "print (classification_report(y_test, yhat))\n"
1114 |    ]
1115 |   },
1116 |   {
1117 |    "cell_type": "markdown",
1118 |    "metadata": {},
1119 |    "source": [
1120 |     "Based on the count of each section, we can calculate precision and recall of each label:\n",
1121 |     "\n",
1122 |     "\n",
1123 |     "- __Precision__ is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)\n",
1124 |     "\n",
1125 |     "- __Recall__ is true positive rate. It is defined as: Recall =  TP / (TP + FN)\n",
1126 |     "\n",
1127 |     "    \n",
1128 |     "So, we can calculate precision and recall of each class.\n",
1129 |     "\n",
1130 |     "__F1 score:__\n",
1131 |     "Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label. \n",
1132 |     "\n",
1133 |     "The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.\n",
1134 |     "\n",
1135 |     "\n",
1136 |     "And finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.72 in our case."
1137 |    ]
1138 |   },
1139 |   {
1140 |    "cell_type": "markdown",
1141 |    "metadata": {},
1142 |    "source": [
1143 |     "### log loss\n",
1144 |     "Now, lets try __log loss__ for evaluation. In logistic regression, the output can be the probability of customer churn is yes (or equals to 1). This probability is a value between 0 and 1.\n",
1145 |     "Log loss( Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. \n"
1146 |    ]
1147 |   },
1148 |   {
1149 |    "cell_type": "code",
1150 |    "execution_count": 1,
1151 |    "metadata": {},
1152 |    "outputs": [
1153 |     {
1154 |      "ename": "NameError",
1155 |      "evalue": "name 'y_test' is not defined",
1156 |      "output_type": "error",
1157 |      "traceback": [
1158 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
1159 |       "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
1160 |       "\u001b[0;32m<ipython-input-1-03d2935e7526>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmetrics\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mlog_loss\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mlog_loss\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0myhat_prob\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
1161 |       "\u001b[0;31mNameError\u001b[0m: name 'y_test' is not defined"
1162 |      ]
1163 |     }
1164 |    ],
1165 |    "source": [
1166 |     "from sklearn.metrics import log_loss\n",
1167 |     "log_loss(y_test, yhat_prob)"
1168 |    ]
1169 |   },
1170 |   {
1171 |    "cell_type": "markdown",
1172 |    "metadata": {},
1173 |    "source": [
1174 |     "<h2 id=\"practice\">Practice</h2>\n",
1175 |     "Try to build Logistic Regression model again for the same dataset, but this time, use different __solver__ and __regularization__ values? What is new __logLoss__ value?"
1176 |    ]
1177 |   },
1178 |   {
1179 |    "cell_type": "code",
1180 |    "execution_count": null,
1181 |    "metadata": {},
1182 |    "outputs": [],
1183 |    "source": [
1184 |     "# write your code here\n",
1185 |     "\n"
1186 |    ]
1187 |   },
1188 |   {
1189 |    "cell_type": "markdown",
1190 |    "metadata": {},
1191 |    "source": [
1192 |     "Double-click __here__ for the solution.\n",
1193 |     "\n",
1194 |     "<!-- Your answer is below:\n",
1195 |     "    \n",
1196 |     "LR2 = LogisticRegression(C=0.01, solver='sag').fit(X_train,y_train)\n",
1197 |     "yhat_prob2 = LR2.predict_proba(X_test)\n",
1198 |     "print (\"LogLoss: : %.2f\" % log_loss(y_test, yhat_prob2))\n",
1199 |     "\n",
1200 |     "-->"
1201 |    ]
1202 |   },
1203 |   {
1204 |    "cell_type": "markdown",
1205 |    "metadata": {
1206 |     "button": false,
1207 |     "new_sheet": false,
1208 |     "run_control": {
1209 |      "read_only": false
1210 |     }
1211 |    },
1212 |    "source": [
1213 |     "<h2>Want to learn more?</h2>\n",
1214 |     "\n",
1215 |     "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href=\"http://cocl.us/ML0101EN-SPSSModeler\">SPSS Modeler</a>\n",
1216 |     "\n",
1217 |     "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href=\"https://cocl.us/ML0101EN_DSX\">Watson Studio</a>\n",
1218 |     "\n",
1219 |     "<h3>Thanks for completing this lesson!</h3>\n",
1220 |     "\n",
1221 |     "<h4>Author:  <a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a></h4>\n",
1222 |     "<p><a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>\n",
1223 |     "\n",
1224 |     "<hr>\n",
1225 |     "\n",
1226 |     "<p>Copyright &copy; 2018 <a href=\"https://cocl.us/DX0108EN_CC\">Cognitive Class</a>. This notebook and its source code are released under the terms of the <a href=\"https://bigdatauniversity.com/mit-license/\">MIT License</a>.</p>"
1227 |    ]
1228 |   }
1229 |  ],
1230 |  "metadata": {
1231 |   "kernelspec": {
1232 |    "display_name": "Python 3",
1233 |    "language": "python",
1234 |    "name": "python3"
1235 |   },
1236 |   "language_info": {
1237 |    "codemirror_mode": {
1238 |     "name": "ipython",
1239 |     "version": 3
1240 |    },
1241 |    "file_extension": ".py",
1242 |    "mimetype": "text/x-python",
1243 |    "name": "python",
1244 |    "nbconvert_exporter": "python",
1245 |    "pygments_lexer": "ipython3",
1246 |    "version": "3.6.7"
1247 |   },
1248 |   "widgets": {
1249 |    "state": {},
1250 |    "version": "1.1.2"
1251 |   }
1252 |  },
1253 |  "nbformat": 4,
1254 |  "nbformat_minor": 4
1255 | }
1256 | 


--------------------------------------------------------------------------------
/Classification Models/ML0101EN-Clas-SVM-cancer-py-v1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "<a href=\"https://www.bigdatauniversity.com\"><img src=\"https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png\" width=\"400\" align=\"center\"></a>\n",
  8 |     "\n",
  9 |     "<h1 align=center><font size=\"5\"> SVM (Support Vector Machines)</font></h1>"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "In this notebook, you will use SVM (Support Vector Machines) to build and train a model using human cell records, and classify cells to whether the samples are benign or malignant.\n",
 17 |     "\n",
 18 |     "SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong."
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "markdown",
 23 |    "metadata": {},
 24 |    "source": [
 25 |     "<h1>Table of contents</h1>\n",
 26 |     "\n",
 27 |     "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
 28 |     "    <ol>\n",
 29 |     "        <li><a href=\"#load_dataset\">Load the Cancer data</a></li>\n",
 30 |     "        <li><a href=\"#modeling\">Modeling</a></li>\n",
 31 |     "        <li><a href=\"#evaluation\">Evaluation</a></li>\n",
 32 |     "        <li><a href=\"#practice\">Practice</a></li>\n",
 33 |     "    </ol>\n",
 34 |     "</div>\n",
 35 |     "<br>\n",
 36 |     "<hr>"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 2,
 42 |    "metadata": {},
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "import pandas as pd\n",
 46 |     "import pylab as pl\n",
 47 |     "import numpy as np\n",
 48 |     "import scipy.optimize as opt\n",
 49 |     "from sklearn import preprocessing\n",
 50 |     "from sklearn.model_selection import train_test_split\n",
 51 |     "%matplotlib inline \n",
 52 |     "import matplotlib.pyplot as plt"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {
 58 |     "button": false,
 59 |     "new_sheet": false,
 60 |     "run_control": {
 61 |      "read_only": false
 62 |     }
 63 |    },
 64 |    "source": [
 65 |     "<h2 id=\"load_dataset\">Load the Cancer data</h2>\n",
 66 |     "The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)[http://mlearn.ics.uci.edu/MLRepository.html]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:\n",
 67 |     "\n",
 68 |     "|Field name|Description|\n",
 69 |     "|--- |--- |\n",
 70 |     "|ID|Clump thickness|\n",
 71 |     "|Clump|Clump thickness|\n",
 72 |     "|UnifSize|Uniformity of cell size|\n",
 73 |     "|UnifShape|Uniformity of cell shape|\n",
 74 |     "|MargAdh|Marginal adhesion|\n",
 75 |     "|SingEpiSize|Single epithelial cell size|\n",
 76 |     "|BareNuc|Bare nuclei|\n",
 77 |     "|BlandChrom|Bland chromatin|\n",
 78 |     "|NormNucl|Normal nucleoli|\n",
 79 |     "|Mit|Mitoses|\n",
 80 |     "|Class|Benign or malignant|\n",
 81 |     "\n",
 82 |     "<br>\n",
 83 |     "<br>\n",
 84 |     "\n",
 85 |     "For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record. To download the data, we will use `!wget` to download it from IBM Object Storage.  \n",
 86 |     "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": 3,
 92 |    "metadata": {
 93 |     "button": false,
 94 |     "new_sheet": false,
 95 |     "run_control": {
 96 |      "read_only": false
 97 |     }
 98 |    },
 99 |    "outputs": [
100 |     {
101 |      "name": "stdout",
102 |      "output_type": "stream",
103 |      "text": [
104 |       "--2019-07-11 02:35:27--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv\n",
105 |       "Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193\n",
106 |       "Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.\n",
107 |       "HTTP request sent, awaiting response... 200 OK\n",
108 |       "Length: 20675 (20K) [text/csv]\n",
109 |       "Saving to: ‘cell_samples.csv’\n",
110 |       "\n",
111 |       "cell_samples.csv    100%[===================>]  20.19K  --.-KB/s    in 0.02s   \n",
112 |       "\n",
113 |       "2019-07-11 02:35:27 (966 KB/s) - ‘cell_samples.csv’ saved [20675/20675]\n",
114 |       "\n"
115 |      ]
116 |     }
117 |    ],
118 |    "source": [
119 |     "#Click here and press Shift+Enter\n",
120 |     "!wget -O cell_samples.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "markdown",
125 |    "metadata": {
126 |     "button": false,
127 |     "new_sheet": false,
128 |     "run_control": {
129 |      "read_only": false
130 |     }
131 |    },
132 |    "source": [
133 |     "### Load Data From CSV File  "
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": 4,
139 |    "metadata": {
140 |     "button": false,
141 |     "new_sheet": false,
142 |     "run_control": {
143 |      "read_only": false
144 |     }
145 |    },
146 |    "outputs": [
147 |     {
148 |      "data": {
149 |       "text/html": [
150 |        "<div>\n",
151 |        "<style scoped>\n",
152 |        "    .dataframe tbody tr th:only-of-type {\n",
153 |        "        vertical-align: middle;\n",
154 |        "    }\n",
155 |        "\n",
156 |        "    .dataframe tbody tr th {\n",
157 |        "        vertical-align: top;\n",
158 |        "    }\n",
159 |        "\n",
160 |        "    .dataframe thead th {\n",
161 |        "        text-align: right;\n",
162 |        "    }\n",
163 |        "</style>\n",
164 |        "<table border=\"1\" class=\"dataframe\">\n",
165 |        "  <thead>\n",
166 |        "    <tr style=\"text-align: right;\">\n",
167 |        "      <th></th>\n",
168 |        "      <th>ID</th>\n",
169 |        "      <th>Clump</th>\n",
170 |        "      <th>UnifSize</th>\n",
171 |        "      <th>UnifShape</th>\n",
172 |        "      <th>MargAdh</th>\n",
173 |        "      <th>SingEpiSize</th>\n",
174 |        "      <th>BareNuc</th>\n",
175 |        "      <th>BlandChrom</th>\n",
176 |        "      <th>NormNucl</th>\n",
177 |        "      <th>Mit</th>\n",
178 |        "      <th>Class</th>\n",
179 |        "    </tr>\n",
180 |        "  </thead>\n",
181 |        "  <tbody>\n",
182 |        "    <tr>\n",
183 |        "      <th>0</th>\n",
184 |        "      <td>1000025</td>\n",
185 |        "      <td>5</td>\n",
186 |        "      <td>1</td>\n",
187 |        "      <td>1</td>\n",
188 |        "      <td>1</td>\n",
189 |        "      <td>2</td>\n",
190 |        "      <td>1</td>\n",
191 |        "      <td>3</td>\n",
192 |        "      <td>1</td>\n",
193 |        "      <td>1</td>\n",
194 |        "      <td>2</td>\n",
195 |        "    </tr>\n",
196 |        "    <tr>\n",
197 |        "      <th>1</th>\n",
198 |        "      <td>1002945</td>\n",
199 |        "      <td>5</td>\n",
200 |        "      <td>4</td>\n",
201 |        "      <td>4</td>\n",
202 |        "      <td>5</td>\n",
203 |        "      <td>7</td>\n",
204 |        "      <td>10</td>\n",
205 |        "      <td>3</td>\n",
206 |        "      <td>2</td>\n",
207 |        "      <td>1</td>\n",
208 |        "      <td>2</td>\n",
209 |        "    </tr>\n",
210 |        "    <tr>\n",
211 |        "      <th>2</th>\n",
212 |        "      <td>1015425</td>\n",
213 |        "      <td>3</td>\n",
214 |        "      <td>1</td>\n",
215 |        "      <td>1</td>\n",
216 |        "      <td>1</td>\n",
217 |        "      <td>2</td>\n",
218 |        "      <td>2</td>\n",
219 |        "      <td>3</td>\n",
220 |        "      <td>1</td>\n",
221 |        "      <td>1</td>\n",
222 |        "      <td>2</td>\n",
223 |        "    </tr>\n",
224 |        "    <tr>\n",
225 |        "      <th>3</th>\n",
226 |        "      <td>1016277</td>\n",
227 |        "      <td>6</td>\n",
228 |        "      <td>8</td>\n",
229 |        "      <td>8</td>\n",
230 |        "      <td>1</td>\n",
231 |        "      <td>3</td>\n",
232 |        "      <td>4</td>\n",
233 |        "      <td>3</td>\n",
234 |        "      <td>7</td>\n",
235 |        "      <td>1</td>\n",
236 |        "      <td>2</td>\n",
237 |        "    </tr>\n",
238 |        "    <tr>\n",
239 |        "      <th>4</th>\n",
240 |        "      <td>1017023</td>\n",
241 |        "      <td>4</td>\n",
242 |        "      <td>1</td>\n",
243 |        "      <td>1</td>\n",
244 |        "      <td>3</td>\n",
245 |        "      <td>2</td>\n",
246 |        "      <td>1</td>\n",
247 |        "      <td>3</td>\n",
248 |        "      <td>1</td>\n",
249 |        "      <td>1</td>\n",
250 |        "      <td>2</td>\n",
251 |        "    </tr>\n",
252 |        "  </tbody>\n",
253 |        "</table>\n",
254 |        "</div>"
255 |       ],
256 |       "text/plain": [
257 |        "        ID  Clump  UnifSize  UnifShape  MargAdh  SingEpiSize BareNuc  \\\n",
258 |        "0  1000025      5         1          1        1            2       1   \n",
259 |        "1  1002945      5         4          4        5            7      10   \n",
260 |        "2  1015425      3         1          1        1            2       2   \n",
261 |        "3  1016277      6         8          8        1            3       4   \n",
262 |        "4  1017023      4         1          1        3            2       1   \n",
263 |        "\n",
264 |        "   BlandChrom  NormNucl  Mit  Class  \n",
265 |        "0           3         1    1      2  \n",
266 |        "1           3         2    1      2  \n",
267 |        "2           3         1    1      2  \n",
268 |        "3           3         7    1      2  \n",
269 |        "4           3         1    1      2  "
270 |       ]
271 |      },
272 |      "execution_count": 4,
273 |      "metadata": {},
274 |      "output_type": "execute_result"
275 |     }
276 |    ],
277 |    "source": [
278 |     "cell_df = pd.read_csv(\"cell_samples.csv\")\n",
279 |     "cell_df.head()"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "markdown",
284 |    "metadata": {},
285 |    "source": [
286 |     "The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.\n",
287 |     "\n",
288 |     "The Class field contains the diagnosis, as confirmed by separate medical procedures, as to whether the samples are benign (value = 2) or malignant (value = 4).\n",
289 |     "\n",
290 |     "Lets look at the distribution of the classes based on Clump thickness and Uniformity of cell size:"
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": 5,
296 |    "metadata": {},
297 |    "outputs": [
298 |     {
299 |      "data": {
300 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX4AAAEGCAYAAABiq/5QAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAgAElEQVR4nO3dfXRU9b3v8fcXkpSJkmgh9nLEm6G9UsODRohZHKFHVJCuK1Xrsr2leq/SKF2tQVtrq21Xfeg6p8vj8bb2aG/vpY3IaUu0Bx9LfUB6dFWtbQhCFTIHrHVATrmHIXpzqolNQr73j5k8AoY8zN472Z/XWqw988tk7y+/2fnkl9/M7J+5OyIiEh8Twi5ARESCpeAXEYkZBb+ISMwo+EVEYkbBLyISMwVhF3Aspk6d6slkMuwyRETGlK1btx5097KB7WMi+JPJJI2NjWGXISIyppjZniO1a6pHRCRmFPwiIjGj4BcRiZkxMcd/JB0dHezbt4/33nsv7FLGjUmTJjF9+nQKCwvDLkVE8mjMBv++ffuYPHkyyWQSMwu7nDHP3Wlubmbfvn3MmDEj7HJEJI/yNtVjZveZ2QEz29Gn7YNm9oyZvZbbnjjc/b/33ntMmTJFoT9KzIwpU6bE6i+oTKaVLVv2k8m0hlpHKtXMunU7SKWaQ60jCqLSF1E4NzZufJ2rr36ajRtfH/V953PEfz9wL/BPfdpuBn7l7neY2c25+zcN9wAK/dEVp/6sr09RU/M0RUUTaG/voq5uGStWVARex+rVm7n33u0992trK7nnniWB1xEFUemLKJwbc+euZceO7C+/urpXmTt3Cq+8snLU9p+3Eb+7/xp4a0DzxcC63O11wCX5Or7I0WQyrdTUPE1bWyctLe20tXVSU/N04KO7VKq5X9AB3Hvv9tBHu2GISl9E4dzYuPH1ntDv9uqrzaM68g/6XT0fcvf9ALntSUd7oJmtMrNGM2vMZDKBFRiU5557juXLlwPw+OOPc8cddwR27O3bt/PEE08EdryoSadbKCrqf+oXFk4gnW4JtI6Ghv1Dah/PotIXUTg3Hn30D0NqH47Ivp3T3de4e5W7V5WVHfaJ43Hloosu4uabbw7seHEP/mSylPb2rn5tHR1dJJOlgdZRXT1tSO3jWVT6IgrnxiWX/JchtQ9H0MH/72Y2DSC3PRDkwUf7BZt0Os1pp53G1VdfzZw5c7j88svZvHkzCxcu5NRTT6WhoYGGhgbOPvtszjzzTM4++2x27dp12H7uv/9+amtrAXj99ddZsGABZ511FrfccgvHH388kP0LYfHixVx22WWcdtppXH755XSvnvbtb3+bs846izlz5rBq1aqe9sWLF3PTTTdRXV3NzJkzef7552lvb+eWW27hwQcfpLKykgcffHBU+mIsKSsrpq5uGYlEASUlRSQSBdTVLaOsrDjQOioqplBbW9mvrba2koqKKYHWEQVR6YsonBvLl3+EuXP7/7/nzp3C8uUfGb2DuHve/gFJYEef+/8A3Jy7fTNw57HsZ/78+T5QU1PTYW3vZ/36Jk8kvuelpd/3ROJ7vn790L7/SN544w2fOHGiv/LKK37o0CGfN2+er1y50ru6uvzRRx/1iy++2FtaWryjo8Pd3Z955hm/9NJL3d392Wef9QsvvNDd3deuXevXXnutu7tfeOGFvn79end3/+EPf+jHHXdcz+NLSkr8zTff9EOHDvmCBQv8+eefd3f35ubmnpquuOIKf/zxx93d/ZxzzvEbbrjB3d1/+ctf+vnnn3/Y8QYaar+OZQcOvOsNDX/yAwfeDbWOpqaDfv/9r3pT08FQ64iCqPRFFM6NX/ziD15T85T/4hd/GPY+gEY/Qqbm7V09ZlYPLAammtk+4FbgDuDnZlYD7AU+la/j99X3BZu2tmxbTc3TLFlSPuLf5DNmzGDu3LkAzJ49m/PPPx8zY+7cuaTTaVpaWrjyyit57bXXMDM6Ojred38vvfQSjz76KACf/exnufHGG3u+Vl1dzfTp0wGorKwknU6zaNEinn32We68805aW1t56623mD17Np/4xCcAuPTSSwGYP38+6XR6RP/X8aasrDjwUf6RVFRMieUo/0ii0hdRODeWL//I6I7y+8hb8Lv7iqN86fx8HfNoul+w6Q596H3BZqRP7gc+8IGe2xMmTOi5P2HCBDo7O/nWt77FueeeyyOPPEI6nWbx4sWjcqyJEyfS2dnJe++9xxe/+EUaGxs55ZRTuO222/q9F7/7e7ofLyIS2Rd3R1OYL9i0tLRw8sknA9m5/MEsWLCAhx56CIAHHnhg0Md3h/zUqVN555132LBhw6DfM3nyZP785z8P+jgRGZ9iEfxhvmDzta99ja9//essXLiQQ4cODfr4u+++m+9+97tUV1ezf/9+Skvf/5fTCSecwDXXXMPcuXO55JJLOOusswY9xrnnnktTU1NsX9wViTvz3DtAoqyqqsoHLsSSSqWoqBjap+kymVbS6RaSydLQ5++OprW1lUQigZnxwAMPUF9fz2OPPRbY8YfTryISTWa21d2rBraP2Yu0DUcUXrAZzNatW6mtrcXdOeGEE7jvvvvCLklExplYBf9Y8LGPfYzf//73YZchIuNYLOb4RUSkl4JfRCRmFPwiIjGj4BcRiRkF/wik02nmzJkz4v00NjZy3XXXjUJFIiKD07t6IqCqqoqqqsPeaisikhcxG/FngC257ejo7Ozkyiuv5PTTT+eyyy6jtbWVrVu3cs455zB//nyWLVvG/v3ZxSSOdJlk6L8oSyaTYenSpcybN4/Pf/7zlJeXc/DgQdLpNBUVFVxzzTXMnj2bCy64gLa+Fx8SETlGMQr+eqAcWJrb1o/KXnft2sWqVat45ZVXKCkp4Qc/+AGrV69mw4YNbN26lc997nN885vf7Hl8Z2cnDQ0N3H333dx+++2H7e/222/nvPPO4+WXX+aTn/wke/fu7fnaa6+9xrXXXsvOnTs54YQTeq7pIyIyFDGZ6skANUBb7h+5+0uAka3udcopp7Bw4UIArrjiCr7zne+wY8cOli5dCsChQ4eYNq13FaHBLpP8wgsv8MgjjwDw8Y9/nBNPPLHnazNmzKCysvJ9v19EZDAxCf40UERv6AMU5tpHFvxm1u/+5MmTmT17Ni+99NIRHz/YZZLf79pJAy/LrKkeERmOmEz1JIH2AW0dufaR2bt3b0/I19fXs2DBAjKZTE9bR0cHO3fuPOb9LVq0iJ///OcAbNq0ibfffnvENYqI9BWT4C8D6oAEUJLb1jHS0T5ARUUF69at4/TTT+ett97qmd+/6aabOOOMM6isrOQ3v/nNMe/v1ltvZdOmTcybN48nn3ySadOmMXny5BHXKSLSLVaXZc7O9afJjvRHHvr58Je//IWJEydSUFDASy+9xBe+8AW2b98e2PF1WWaR8UOXZQayYR/NwO+2d+9ePv3pT9PV1UVRURE/+tGPwi5JRMaZmAV/9J166qls27Yt7DJEZBwb03P8Y2GaaixRf4rEw5gN/kmTJtHc3KywGiXuTnNzM5MmTQq7FBHJszE71TN9+nT27dtHJjN6l1+Iu0mTJjF9+vSwyxCRPBuzwV9YWMiMGTPCLkNEZMwZs1M9IiIyPAp+EZGYUfCLiMSMgl9EJGYU/CIiMaPgFxGJGQW/iEjMKPhFRGJGwS8iEjMKfhGRmFHwi4jETCjBb2ZfNrOdZrbDzOrNTJeElBjLAFty25AqyLSyZct+MpnW0GqQ4AQe/GZ2MnAdUOXuc4CJwGeCrkMkGuqBcmBpblsffAX1KcrL17B06T9TXr6G+vpU4DVIsMKa6ikAEmZWABQDfwqpDpEQZYAaoA1oyW1rCHLkn8m0UlPzNG1tnbS0tNPW1klNzdMa+Y9zgQe/u/8bcBewF9gPtLj7poGPM7NVZtZoZo265r6MT2mgaEBbYa49oArSLRQV9Y+BwsIJpNMtgdUgwQtjqudE4GJgBvBXwHFmdsXAx7n7GnevcveqsrJoL5AuMjxJoH1AW0euPaAKkqW0t3f1r6Cji2SyNLAaJHhhTPUsAd5w94y7dwAPA2eHUIdIyMqAOiABlOS2dbn2gCooK6aubhmJRAElJUUkEgXU1S2jrKw4sBokeGGswLUXWGBmxWQnNc8HGkOoQyQCVpAdC6XJjvSD/+t2xYoKliwpJ51uIZksVejHQODB7+6/M7MNwMtAJ7ANWBN0HSLRUUYYgd+vgrJiBX6MhLLmrrvfCtwaxrFFROJOn9wVEYkZBb+ISMwo+EVEYkbBLyISMwp+EZGYUfCLiMSMgl9EJGYU/CIiMaPgFxGJGQW/iEjMKPhFRGJGwS8iEjMKfhGRmFHwx0wm08qWLfu1pqpEks7PXqlUM+vW7SCVah71fYdyWWYJR319ipqapykqmkB7exd1dctYsaIi7LJEAJ2ffa1evZl7793ec7+2tpJ77lkyavs3dx+1neVLVVWVNzZqka6RyGRaKS9fQ1tbZ09bIlHAnj2rtACHhE7nZ69UqplZs9Ye1t7UtJKKiilD2peZbXX3qoHtmuqJiXS6haKi/k93YeEE0umWkCoS6aXzs1dDw/4htQ+Hgj8mkslS2tu7+rV1dHSRTJaGVJFIL52fvaqrpw2pfTgU/DFRVlZMXd0yEokCSkqKSCQKqKtbFrs/oyWadH72qqiYQm1tZb+22trKIU/zvB/N8cdMJtNKOt1CMlkayx8qiTadn71SqWYaGvZTXT1t2KF/tDl+vasnZsrKimP/AyXRpfOzV0XFlFEd5felqR4RkZhR8IuIxIyCX0QkZhT8IiIxo+AXEYkZBb+ISMwo+EVEYkbBLyISMwp+EZGYUfCLiMTMkILfzI7LVyEiIhKMYwp+MzvbzJqAVO7+GWb2v/JamYiI5MWxjvi/BywDmgHc/ffA3+SrKBERyZ9jnupx9zcHNB0a7kHN7AQz22Bm/2pmKTP76+HuS4ZGi1n3ik5fZIAtua1EQRTOjSgstv6mmZ0NuJkVAdeRm/YZpu8DT7n7Zbn96TqsAdBi1r2i0xf1QA1QBLQDdcCKEOqQblE4NyKx2LqZTSUb1ksAAzYB17v7kH8VmVkJ8Hvgw36Mq8BoIZaR02LWvaLTFxmgHGjr05YA9gBlAdYh3aJwbkRpsfWEu1/u7h9y95Pc/QqgcEgV9Pow2TN+rZltM7MfH+ndQma2yswazawxk9GfwCOlxax7Racv0mRH+v0qybVLGKJwbkRpsfU3zKzezBJ92p4Y5jELgHnAD939TOBd4OaBD3L3Ne5e5e5VZWUa/YyUFrPuFZ2+SJKd3ulXSa5dwhCFcyNKi62/CjwPvGBmH8m12TCPuQ/Y5+6/y93fQPYXgeSRFrPuFZ2+KCM7p58ASnLbOjTNE54onBuRWWzdzF5293lmthD4EXATcLu7Dyuwzex54Gp332VmtwHHuftXj/Z4zfGPHi1m3Ss6fZEhO72TRKEfDVE4N/K52PqxBv+23LQMZjYNeBCocvdh9YiZVQI/JjvB+Udgpbu/fbTHK/hFRIbuaMF/rG/n/K/dN9x9v5mdB5w93GLcfTtwWDEiIpJ/7xv8ZnaFu/8UWGF2xCn9X+elKhERyZvBRvzdb7OcnO9CREQkGO8b/O7+f3Lb24MpR0RE8u19385pZteY2am522Zm95lZi5m9YmZnBlOiiIiMpsHex389vR8jXAGcQfaTtzcA/5i/skREJF8GC/5Od+/I3V4O/JO7N7v7Znrn/0VEZAwZLPi7zGyamU0Czgc29/la4ijfIyIiETbYu3q+BTQCE4HH3X0ngJmdQ/aDVyIiMsYMFvzFZK8be7q7v9ynvRH4b3mrSkRE8mawqZ6vu3sn2csr9HD3d939nfyVJSIi+TLYiL/ZzJ4FZpjZ4wO/6O4X5acsERHJl8GC/0Kyl0z+CfA/81+OiIjk22Cf3G0HfmtmZ7u7lsESERkHBrtI293u/iXgPjM77PrNmuoZmihc4zsKNUSF+qKX+qK/0bgWfpRrGGyq5ye57V2jetQYqq9PUVPzNEVFE2hv76KubhkrVlTEroaoUF/0Ul/0t3r1Zu69d3vP/draSu65Z8m4quGYFmIJ21hfiCWTaaW8fA1tbZ09bYlEAXv2rApsdBWFGqJCfdFLfdFfKtXMrFlrD2tvaloZ2Mh/NGs42kIsx7TmrpktNLNnzGy3mf3RzN4wM32A6xil0y0UFfXv6sLCCaTTLbGqISrUF73UF/01NOwfUvtYreFYV+CqA74MbAUOjdrRYyKZLKW9vatfW0dHF8lkaaxqiAr1RS/1RX/V1dOG1D5WazimET/Q4u5PuvuB3EXamt29edSqGOfKyoqpq1tGIlFASUkRiUQBdXXLAv1TOgo1RIX6opf6or+KiinU1lb2a6utrQz0Bd4gajjWxdbvIHu9noeBv3S3D7iMQ96M9Tn+blF450QUaogK9UUv9UV/4+VdPUeb4z/W4H82d7P7wQa4u583rGqGaLwEv4hIkI4W/IO9j/+G3M2Nua0DGeAFd39jdEsUEZEgDDbHPzn37/jcv8lAFfCkmX0mz7WJiEgeDHbJhiMusm5mHyS7KMsD+ShKRETy51jf1dOPu79Fdp5fRETGmGEFv5mdB7w9yrWIiEgABntx91V638nT7YPAn4D/ka+iREQkfwb75O7yAfcdaHb3d/NUj4iI5NlgL+7uCaoQEREJxrDm+EVEZOxS8IuIxIyCX0QkZhT8IiIxo+AXEYkZBb+ISMyEFvxmNtHMtpnZxsEfLTL6Dh7cQ1PTkxw8qHctZzKtbNmyn0ymVXVEpI5Uqpl163aQSo3+mldhjvivB1IhHl9i7MUX76K4+FROPvlSiotP5cUX7wq7pNDU16coL1/D0qX/THn5Gurrw/mxVB29Vq/ezKxZa7nqqqeYNWstq1dvHtX9H9NCLKPNzKYD64C/A25w94GfEO5HC7HIaDp4cA/FxadSXNzR09baWkhr62tMnVoeYmXBy2RaKS9fQ1tbZ09bIlHAnj2rAl2JS3X0SqWamTVr7WHtTU0rh7wS19EWYglrxH838DWg62gPMLNVZtZoZo2ZTCa4ymTcO3CgiY6Oif3aOjomcuBAU0gVhSedbqGoqH8MFBZOIJ1uUR0h1dHQsH9I7cMRePCb2XLggLtvfb/Hufsad69y96qysrKAqpM4OOmkWRQWHurXVlh4iJNOmhVSReFJJktpb+8//uro6CKZLFUdIdVRXT1tSO3DEcaIfyFwkZmlyS7kcp6Z/TSEOiSmpk4tZ9u279DaWkhLyyRaWwvZtu07sZvmASgrK6aubhmJRAElJUUkEgXU1S0LfMF11dGromIKtbWV/dpqaytHddH3UOb4ew5uthi4UXP8EoaDB/dw4EATJ500K5ah31cm00o63UIyWRp42KqOI0ulmmlo2E919bRhh/6wFlsXGc+mTi2PfeB3KysrDjVoVcfhKiqmjOoov69Qg9/dnwOeC7MGEZG40Sd3RURiRsEvIhIzCn4RkZhR8IuIxIyCX0QkZhT8IiIxo+AXEYkZBb+ISMwo+EVEYkbBLyISMwp+EZGYUfCLiMSMgl9EJGZiEvwZYEtuG2IVmVa2bNlPJtMa6xqiYtOmBv72b/+RTZsaQq0jCs9JKtXMunU7SKWaQ6sBotEXUZHX58TdI/9v/vz5Pnzr3T3h7qW57foR7GsEVaxv8kTie15a+n1PJL7n69c3xbKGqPjqV1f5u+8W+ttvT/J33y30r351VSh1ROE5qa19xuEfev7V1j4TeA3u0eiLqBit5wRo9CNkaqgrcB2r4a/AlQHKgbY+bQlgDxDcOr6ZTCvl5Wtoa+vsrSJRwJ49qwJb7CEKNUTFpk0NLFq0iOLijp621tZCXnjhBS64oDqwOqLwnKRSzcyatfaw9qamlXlbBORIotAXUTGaz8nRVuAa51M9aaBoQFthrj3AKtItFBX17+rCwgmk0y2xqiEqGhp+S3v7xH5tHR0TaGj4baB1ROE5aWjYP6T2fIlCX0RFEM/JOA/+JNA+oK0j1x5gFclS2tu7+lfR0UUyWRqrGqKiunoBRUWH+rUVFnZRXb0g0Dqi8JxUV08bUnu+RKEvoiKI52ScB38ZUEd2eqckt60jyGkeyK7fWVe3jESigJKSIhKJAurqlgX6J2wUaoiKCy6o5rbbVtLaWkhLywdobS3ktttWBjrNA9F4TioqplBbW9mvrba2MtBpHohGX0RFEM/JOJ/j75YhO72TJOjQ71dFppV0uoVksjS0EzoKNUTFpk0NNDT8lurqBYGHfl9ReE5SqWYaGvZTXT0t8NDvKwp9ERWj8ZwcbY4/JsEvIhI/MX1xV0REBlLwi4jEjIJfRCRmFPwiIjGj4BcRiRkFv4hIzCj4RURiRsEvIhIzCn4RkZhR8IuIxIyCX0QkZhT8IiIxo+AXEYmZwIPfzE4xs2fNLGVmO83s+qBrCE8UFn2PQg3RqGP37l1s3PgAu3fvCq2GrPD7Iiqisuj7eBfGiL8T+Iq7VwALgGvNbFYIdQSsnuz6v0tz2/qY1hCNOtau/QbTp89l0aKVTJ8+l7VrvxF4DVnh90VUrF69mVmz1nLVVU8xa9ZaVq/eHHZJ41bo1+M3s8eAe939maM9Zuxfjz8Ki75HoYZo1LF79y6mT5972GLr+/a9ysyZHw2khqzw+yIqorLo+3gTyevxm1kSOBP43RG+tsrMGs2sMZMZ638Cpwl/0fco1BCNOnbv3nbExdZ3794WWA1ZacLui6iIyqLvcRFa8JvZ8cBDwJfc/T8Gft3d17h7lbtXlZWN9dFPkvAXfY9CDdGoY+bMM4+42PrMmWcGVkNWkrD7Iiqisuh7XIQS/GZWSDb0f+buD4dRQ7CisOh7FGqIRh0zZ36UBx+8sd9i6w8+eGPA0zwQhb6Iiqgs+h4Xgc/xm5kB64C33P1Lx/I9Y3+Ov1sUFn2PQg3RqGP37l3s3r2NmTPPDCH0+wq/L6IiKou+jxeRWWzdzBYBzwOvAl255m+4+xNH+57xE/wiIsE5WvAXBF2Iu78AWNDHFRGRLH1yV0QkZhT8IiIxo+AXEYkZBb+ISMwo+EVEYkbBLyISMwp+EZGYUfCLiMSMgl9EJGYU/CIiMaPgFxGJGQW/iEjMKPhFRGImJsF/CXB8bhumr5BdY/UrIdZwJ9nVLu8MsYao1PEicGtuG6YMsCW3Fcm/0BdbPxYjux7/ka4AHcb/eSK9yw903+8MuIbjgNYB998JuIao1HEB8MyA+08HXANAPVBDdu3ddrIrcK0IoQ4ZjyK52Hr+HW2EH/TI/yv0D32AQwQ78r+T/mEL8C7Bj7ijUMeL9A99gE0EP/LPkA39NqAlt61BI3/Jt3Ee/JuH2J4vG4bYng/1Q2zPlyjUsWmI7fmSJjvS76sw1y6SP+M8+JcMsT1fLhtiez4cbfog6GmFKNRxwRDb8yVJdnqnr45cu0j+aI4/MAVkp3e6hTHHfzzZaZVuYc3xR6GOZfQf4Yc9x19INvQ1xy+jJ6Zz/JAN+YvJhsvFhBP6kA35G4D/nNsGHfqQDde/Bypz2zBCPyp1PA28ANyS24YR+pAN+T1kpx/3oNCXIMRgxC8iEk8xHvGLiEhfCn4RkZhR8IuIxIyCX0QkZhT8IiIxo+AXEYkZBb+ISMyMiffxm1mG7KdbxoOpwMGwi4gI9UUv9UUv9UWvkfZFubuXDWwcE8E/nphZ45E+UBFH6ote6ote6ote+eoLTfWIiMSMgl9EJGYU/MFbE3YBEaK+6KW+6KW+6JWXvtAcv4hIzGjELyISMwp+EZGYUfAHwMxOMbNnzSxlZjvN7PqwawqbmU00s21mtjHsWsJkZieY2QYz+9fc+fHXYdcUJjP7cu5nZIeZ1ZvZpLBrCoqZ3WdmB8xsR5+2D5rZM2b2Wm574mgcS8EfjE7gK+5eASwArjWzWSHXFLbrgVTYRUTA94Gn3P004Axi3CdmdjJwHVDl7nPIrk/6mXCrCtT9wMcHtN0M/MrdTwV+lbs/Ygr+ALj7fnd/OXf7z2R/uE8Ot6rwmNl04ELgx2HXEiYzKwH+huxCu7h7u7v/v3CrCl0BkDCzAqAY+FPI9QTG3X8NvDWg+WJgXe72OuCS0TiWgj9gZpYEzgR+F24lobob+BrQFXYhIfswkAHW5qa9fmxmx4VdVFjc/d+Au4C9wH6gxd03hVtV6D7k7vshO4AEThqNnSr4A2RmxwMPAV9y9/8Iu54wmNly4IC7bw27lggoAOYBP3T3M4F3GaU/5cei3Pz1xcAM4K+A48zsinCrGp8U/AExs0Kyof8zd3847HpCtBC4yMzSwAPAeWb203BLCs0+YJ+7d//1t4HsL4K4WgK84e4Zd+8AHgbODrmmsP27mU0DyG0PjMZOFfwBMDMjO4+bcvfvhl1PmNz96+4+3d2TZF+4+xd3j+Wozt3/L/CmmX0013Q+0BRiSWHbCywws+Lcz8z5xPjF7pzHgStzt68EHhuNnRaMxk5kUAuB/w68ambbc23fcPcnQqxJomE18DMzKwL+CKwMuZ7QuPvvzGwD8DLZd8JtI0aXbzCzemAxMNXM9gG3AncAPzezGrK/GD81KsfSJRtEROJFUz0iIjGj4BcRiRkFv4hIzCj4RURiRsEvIhIzCn4RwMz+k5k9YGavm1mTmT1hZjP7XilRZLzQ+/gl9nIfFnoEWOfun8m1VQIfCrUwkTzRiF8EzgU63P1/dze4+3bgze77ZnaVmd3b5/5GM1ucu/2Omf29mW01s81mVm1mz5nZH83soj7f/5iZPWVmu8zs1sD+dyIDKPhFYA4wkovGHQc85+7zgT8DfwssBT4JfLvP46qBy4FK4FNmVjWCY4oMm6Z6REauHXgqd/tV4C/u3mFmrwLJPo97xt2bAczsYWAR0BhkoSKgEb8IwE5g/iCP6aT/z0vfJQE7vPfaJ13AXwDcvYv+g6uB10fR9VIkFAp+EfgX4ANmdk13g5mdBZT3eUwaqDSzCWZ2Ctlpm6FamltDNUF2JaUXR1CzyLAp+CX2cqP1T5IN5tfNbCdwG/2X/XsReIPsVM5dZK8gOVQvAD8BtgMPubumeSQUujqnSADM7Cqyi4jXhl2LiEb8IhNTEpYAAAAqSURBVCIxoxG/iEjMaMQvIhIzCn4RkZhR8IuIxIyCX0QkZhT8IiIx8/8BvpdIkMWkrKgAAAAASUVORK5CYII=\n",
301 |       "text/plain": [
302 |        "<Figure size 432x288 with 1 Axes>"
303 |       ]
304 |      },
305 |      "metadata": {
306 |       "needs_background": "light"
307 |      },
308 |      "output_type": "display_data"
309 |     }
310 |    ],
311 |    "source": [
312 |     "ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');\n",
313 |     "cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);\n",
314 |     "plt.show()"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "markdown",
319 |    "metadata": {},
320 |    "source": [
321 |     "## Data pre-processing and selection"
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "markdown",
326 |    "metadata": {},
327 |    "source": [
328 |     "Lets first look at columns data types:"
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "code",
333 |    "execution_count": 6,
334 |    "metadata": {},
335 |    "outputs": [
336 |     {
337 |      "data": {
338 |       "text/plain": [
339 |        "ID              int64\n",
340 |        "Clump           int64\n",
341 |        "UnifSize        int64\n",
342 |        "UnifShape       int64\n",
343 |        "MargAdh         int64\n",
344 |        "SingEpiSize     int64\n",
345 |        "BareNuc        object\n",
346 |        "BlandChrom      int64\n",
347 |        "NormNucl        int64\n",
348 |        "Mit             int64\n",
349 |        "Class           int64\n",
350 |        "dtype: object"
351 |       ]
352 |      },
353 |      "execution_count": 6,
354 |      "metadata": {},
355 |      "output_type": "execute_result"
356 |     }
357 |    ],
358 |    "source": [
359 |     "cell_df.dtypes"
360 |    ]
361 |   },
362 |   {
363 |    "cell_type": "markdown",
364 |    "metadata": {},
365 |    "source": [
366 |     "It looks like the __BareNuc__ column includes some values that are not numerical. We can drop those rows:"
367 |    ]
368 |   },
369 |   {
370 |    "cell_type": "code",
371 |    "execution_count": 11,
372 |    "metadata": {},
373 |    "outputs": [
374 |     {
375 |      "data": {
376 |       "text/plain": [
377 |        "ID             int64\n",
378 |        "Clump          int64\n",
379 |        "UnifSize       int64\n",
380 |        "UnifShape      int64\n",
381 |        "MargAdh        int64\n",
382 |        "SingEpiSize    int64\n",
383 |        "BareNuc        int64\n",
384 |        "BlandChrom     int64\n",
385 |        "NormNucl       int64\n",
386 |        "Mit            int64\n",
387 |        "Class          int64\n",
388 |        "dtype: object"
389 |       ]
390 |      },
391 |      "execution_count": 11,
392 |      "metadata": {},
393 |      "output_type": "execute_result"
394 |     }
395 |    ],
396 |    "source": [
397 |     "cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]\n",
398 |     "cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')\n",
399 |     "cell_df.dtypes"
400 |    ]
401 |   },
402 |   {
403 |    "cell_type": "code",
404 |    "execution_count": 16,
405 |    "metadata": {},
406 |    "outputs": [
407 |     {
408 |      "data": {
409 |       "text/plain": [
410 |        "array([[ 5,  1,  1,  1,  2,  1,  3,  1,  1],\n",
411 |        "       [ 5,  4,  4,  5,  7, 10,  3,  2,  1],\n",
412 |        "       [ 3,  1,  1,  1,  2,  2,  3,  1,  1],\n",
413 |        "       [ 6,  8,  8,  1,  3,  4,  3,  7,  1],\n",
414 |        "       [ 4,  1,  1,  3,  2,  1,  3,  1,  1]])"
415 |       ]
416 |      },
417 |      "execution_count": 16,
418 |      "metadata": {},
419 |      "output_type": "execute_result"
420 |     }
421 |    ],
422 |    "source": [
423 |     "feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]\n",
424 |     "X = np.asarray(feature_df)\n",
425 |     "X[0:5]"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "markdown",
430 |    "metadata": {},
431 |    "source": [
432 |     "We want the model to predict the value of Class (that is, benign (=2) or malignant (=4)). As this field can have one of only two possible values, we need to change its measurement level to reflect this."
433 |    ]
434 |   },
435 |   {
436 |    "cell_type": "code",
437 |    "execution_count": 15,
438 |    "metadata": {},
439 |    "outputs": [
440 |     {
441 |      "data": {
442 |       "text/plain": [
443 |        "array([2, 2, 2, 2, 2])"
444 |       ]
445 |      },
446 |      "execution_count": 15,
447 |      "metadata": {},
448 |      "output_type": "execute_result"
449 |     }
450 |    ],
451 |    "source": [
452 |     "cell_df['Class'] = cell_df['Class'].astype('int')\n",
453 |     "y = np.asarray(cell_df['Class'])\n",
454 |     "y [0:5]"
455 |    ]
456 |   },
457 |   {
458 |    "cell_type": "markdown",
459 |    "metadata": {},
460 |    "source": [
461 |     "## Train/Test dataset"
462 |    ]
463 |   },
464 |   {
465 |    "cell_type": "markdown",
466 |    "metadata": {},
467 |    "source": [
468 |     "Okay, we split our dataset into train and test set:"
469 |    ]
470 |   },
471 |   {
472 |    "cell_type": "code",
473 |    "execution_count": 14,
474 |    "metadata": {},
475 |    "outputs": [
476 |     {
477 |      "name": "stdout",
478 |      "output_type": "stream",
479 |      "text": [
480 |       "Train set: (559, 9) (559,)\n",
481 |       "Test set: (140, 9) (140,)\n"
482 |      ]
483 |     }
484 |    ],
485 |    "source": [
486 |     "X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n",
487 |     "print ('Train set:', X_train.shape,  y_train.shape)\n",
488 |     "print ('Test set:', X_test.shape,  y_test.shape)"
489 |    ]
490 |   },
491 |   {
492 |    "cell_type": "markdown",
493 |    "metadata": {},
494 |    "source": [
495 |     "<h2 id=\"modeling\">Modeling (SVM with Scikit-learn)</h2>"
496 |    ]
497 |   },
498 |   {
499 |    "cell_type": "markdown",
500 |    "metadata": {},
501 |    "source": [
502 |     "The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:\n",
503 |     "\n",
504 |     "    1.Linear\n",
505 |     "    2.Polynomial\n",
506 |     "    3.Radial basis function (RBF)\n",
507 |     "    4.Sigmoid\n",
508 |     "Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab."
509 |    ]
510 |   },
511 |   {
512 |    "cell_type": "code",
513 |    "execution_count": 17,
514 |    "metadata": {},
515 |    "outputs": [
516 |     {
517 |      "ename": "ValueError",
518 |      "evalue": "could not convert string to float: '?'",
519 |      "output_type": "error",
520 |      "traceback": [
521 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
522 |       "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
523 |       "\u001b[0;32m<ipython-input-17-c26d6a860e5b>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0msvm\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0mclf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msvm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSVC\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkernel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'rbf'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mclf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
524 |       "\u001b[0;32m~/conda/lib/python3.6/site-packages/sklearn/svm/base.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m    147\u001b[0m         X, y = check_X_y(X, y, dtype=np.float64,\n\u001b[1;32m    148\u001b[0m                          \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'C'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'csr'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 149\u001b[0;31m                          accept_large_sparse=False)\n\u001b[0m\u001b[1;32m    150\u001b[0m         \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_validate_targets\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    151\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
525 |       "\u001b[0;32m~/conda/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_X_y\u001b[0;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m    754\u001b[0m                     \u001b[0mensure_min_features\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mensure_min_features\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    755\u001b[0m                     \u001b[0mwarn_on_dtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mwarn_on_dtype\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 756\u001b[0;31m                     estimator=estimator)\n\u001b[0m\u001b[1;32m    757\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mmulti_output\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    758\u001b[0m         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,\n",
526 |       "\u001b[0;32m~/conda/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m    525\u001b[0m             \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    526\u001b[0m                 \u001b[0mwarnings\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msimplefilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'error'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mComplexWarning\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 527\u001b[0;31m                 \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    528\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    529\u001b[0m                 raise ValueError(\"Complex data not supported\\n\"\n",
527 |       "\u001b[0;32m~/conda/lib/python3.6/site-packages/numpy/core/numeric.py\u001b[0m in \u001b[0;36masarray\u001b[0;34m(a, dtype, order)\u001b[0m\n\u001b[1;32m    499\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    500\u001b[0m     \"\"\"\n\u001b[0;32m--> 501\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    502\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    503\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
528 |       "\u001b[0;31mValueError\u001b[0m: could not convert string to float: '?'"
529 |      ]
530 |     }
531 |    ],
532 |    "source": [
533 |     "from sklearn import svm\n",
534 |     "clf = svm.SVC(kernel='rbf')\n",
535 |     "clf.fit(X_train, y_train) "
536 |    ]
537 |   },
538 |   {
539 |    "cell_type": "markdown",
540 |    "metadata": {},
541 |    "source": [
542 |     "After being fitted, the model can then be used to predict new values:"
543 |    ]
544 |   },
545 |   {
546 |    "cell_type": "code",
547 |    "execution_count": null,
548 |    "metadata": {},
549 |    "outputs": [],
550 |    "source": [
551 |     "yhat = clf.predict(X_test)\n",
552 |     "yhat [0:5]"
553 |    ]
554 |   },
555 |   {
556 |    "cell_type": "markdown",
557 |    "metadata": {},
558 |    "source": [
559 |     "<h2 id=\"evaluation\">Evaluation</h2>"
560 |    ]
561 |   },
562 |   {
563 |    "cell_type": "code",
564 |    "execution_count": null,
565 |    "metadata": {},
566 |    "outputs": [],
567 |    "source": [
568 |     "from sklearn.metrics import classification_report, confusion_matrix\n",
569 |     "import itertools"
570 |    ]
571 |   },
572 |   {
573 |    "cell_type": "code",
574 |    "execution_count": null,
575 |    "metadata": {},
576 |    "outputs": [],
577 |    "source": [
578 |     "def plot_confusion_matrix(cm, classes,\n",
579 |     "                          normalize=False,\n",
580 |     "                          title='Confusion matrix',\n",
581 |     "                          cmap=plt.cm.Blues):\n",
582 |     "    \"\"\"\n",
583 |     "    This function prints and plots the confusion matrix.\n",
584 |     "    Normalization can be applied by setting `normalize=True`.\n",
585 |     "    \"\"\"\n",
586 |     "    if normalize:\n",
587 |     "        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
588 |     "        print(\"Normalized confusion matrix\")\n",
589 |     "    else:\n",
590 |     "        print('Confusion matrix, without normalization')\n",
591 |     "\n",
592 |     "    print(cm)\n",
593 |     "\n",
594 |     "    plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
595 |     "    plt.title(title)\n",
596 |     "    plt.colorbar()\n",
597 |     "    tick_marks = np.arange(len(classes))\n",
598 |     "    plt.xticks(tick_marks, classes, rotation=45)\n",
599 |     "    plt.yticks(tick_marks, classes)\n",
600 |     "\n",
601 |     "    fmt = '.2f' if normalize else 'd'\n",
602 |     "    thresh = cm.max() / 2.\n",
603 |     "    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
604 |     "        plt.text(j, i, format(cm[i, j], fmt),\n",
605 |     "                 horizontalalignment=\"center\",\n",
606 |     "                 color=\"white\" if cm[i, j] > thresh else \"black\")\n",
607 |     "\n",
608 |     "    plt.tight_layout()\n",
609 |     "    plt.ylabel('True label')\n",
610 |     "    plt.xlabel('Predicted label')"
611 |    ]
612 |   },
613 |   {
614 |    "cell_type": "code",
615 |    "execution_count": null,
616 |    "metadata": {},
617 |    "outputs": [],
618 |    "source": [
619 |     "# Compute confusion matrix\n",
620 |     "cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])\n",
621 |     "np.set_printoptions(precision=2)\n",
622 |     "\n",
623 |     "print (classification_report(y_test, yhat))\n",
624 |     "\n",
625 |     "# Plot non-normalized confusion matrix\n",
626 |     "plt.figure()\n",
627 |     "plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False,  title='Confusion matrix')"
628 |    ]
629 |   },
630 |   {
631 |    "cell_type": "markdown",
632 |    "metadata": {},
633 |    "source": [
634 |     "You can also easily use the __f1_score__ from sklearn library:"
635 |    ]
636 |   },
637 |   {
638 |    "cell_type": "code",
639 |    "execution_count": null,
640 |    "metadata": {},
641 |    "outputs": [],
642 |    "source": [
643 |     "from sklearn.metrics import f1_score\n",
644 |     "f1_score(y_test, yhat, average='weighted') "
645 |    ]
646 |   },
647 |   {
648 |    "cell_type": "markdown",
649 |    "metadata": {},
650 |    "source": [
651 |     "Lets try jaccard index for accuracy:"
652 |    ]
653 |   },
654 |   {
655 |    "cell_type": "code",
656 |    "execution_count": null,
657 |    "metadata": {},
658 |    "outputs": [],
659 |    "source": [
660 |     "from sklearn.metrics import jaccard_similarity_score\n",
661 |     "jaccard_similarity_score(y_test, yhat)"
662 |    ]
663 |   },
664 |   {
665 |    "cell_type": "markdown",
666 |    "metadata": {},
667 |    "source": [
668 |     "<h2 id=\"practice\">Practice</h2>\n",
669 |     "Can you rebuild the model, but this time with a __linear__ kernel? You can use __kernel='linear'__ option, when you define the svm. How the accuracy changes with the new kernel function?"
670 |    ]
671 |   },
672 |   {
673 |    "cell_type": "code",
674 |    "execution_count": null,
675 |    "metadata": {},
676 |    "outputs": [],
677 |    "source": [
678 |     "# write your code here\n"
679 |    ]
680 |   },
681 |   {
682 |    "cell_type": "markdown",
683 |    "metadata": {},
684 |    "source": [
685 |     "Double-click __here__ for the solution.\n",
686 |     "\n",
687 |     "<!-- Your answer is below:\n",
688 |     "    \n",
689 |     "clf2 = svm.SVC(kernel='linear')\n",
690 |     "clf2.fit(X_train, y_train) \n",
691 |     "yhat2 = clf2.predict(X_test)\n",
692 |     "print(\"Avg F1-score: %.4f\" % f1_score(y_test, yhat2, average='weighted'))\n",
693 |     "print(\"Jaccard score: %.4f\" % jaccard_similarity_score(y_test, yhat2))\n",
694 |     "\n",
695 |     "-->"
696 |    ]
697 |   },
698 |   {
699 |    "cell_type": "markdown",
700 |    "metadata": {
701 |     "button": false,
702 |     "new_sheet": false,
703 |     "run_control": {
704 |      "read_only": false
705 |     }
706 |    },
707 |    "source": [
708 |     "<h2>Want to learn more?</h2>\n",
709 |     "\n",
710 |     "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href=\"http://cocl.us/ML0101EN-SPSSModeler\">SPSS Modeler</a>\n",
711 |     "\n",
712 |     "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href=\"https://cocl.us/ML0101EN_DSX\">Watson Studio</a>\n",
713 |     "\n",
714 |     "<h3>Thanks for completing this lesson!</h3>\n",
715 |     "\n",
716 |     "<h4>Author:  <a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a></h4>\n",
717 |     "<p><a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>\n",
718 |     "\n",
719 |     "<hr>\n",
720 |     "\n",
721 |     "<p>Copyright &copy; 2018 <a href=\"https://cocl.us/DX0108EN_CC\">Cognitive Class</a>. This notebook and its source code are released under the terms of the <a href=\"https://bigdatauniversity.com/mit-license/\">MIT License</a>.</p>"
722 |    ]
723 |   }
724 |  ],
725 |  "metadata": {
726 |   "kernelspec": {
727 |    "display_name": "Python 3",
728 |    "language": "python",
729 |    "name": "python3"
730 |   },
731 |   "language_info": {
732 |    "codemirror_mode": {
733 |     "name": "ipython",
734 |     "version": 3
735 |    },
736 |    "file_extension": ".py",
737 |    "mimetype": "text/x-python",
738 |    "name": "python",
739 |    "nbconvert_exporter": "python",
740 |    "pygments_lexer": "ipython3",
741 |    "version": "3.6.7"
742 |   }
743 |  },
744 |  "nbformat": 4,
745 |  "nbformat_minor": 4
746 | }
747 | 


--------------------------------------------------------------------------------
/Classification Models/ML0101EN-Reg-NoneLinearRegression-py-v1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "<a href=\"https://www.bigdatauniversity.com\"><img src = \"https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png\" width=\"400\" align=\"center\"></a>\n",
  8 |     "\n",
  9 |     "<h1><center>Non Linear Regression Analysis</center></h1>"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "markdown",
 14 |    "metadata": {},
 15 |    "source": [
 16 |     "If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression because, as the name implies, linear regression presumes that the data is linear. \n",
 17 |     "Let's learn about non linear regressions and apply an example on python. In this notebook, we fit a non-linear model to the datapoints corrensponding to China's GDP from 1960 to 2014."
 18 |    ]
 19 |   },
 20 |   {
 21 |    "cell_type": "markdown",
 22 |    "metadata": {},
 23 |    "source": [
 24 |     "<h2 id=\"importing_libraries\">Importing required libraries</h2>"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": null,
 30 |    "metadata": {
 31 |     "collapsed": false,
 32 |     "jupyter": {
 33 |      "outputs_hidden": false
 34 |     }
 35 |    },
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "import numpy as np\n",
 39 |     "import matplotlib.pyplot as plt\n",
 40 |     "%matplotlib inline"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "Though Linear regression is very good to solve many problems, it cannot be used for all datasets. First recall how linear regression, could model a dataset. It models a linear relation between a dependent variable y and independent variable x. It had a simple equation, of degree 1, for example y = $2x$ + 3."
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": null,
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "x = np.arange(-5.0, 5.0, 0.1)\n",
 57 |     "\n",
 58 |     "##You can adjust the slope and intercept to verify the changes in the graph\n",
 59 |     "y = 2*(x) + 3\n",
 60 |     "y_noise = 2 * np.random.normal(size=x.size)\n",
 61 |     "ydata = y + y_noise\n",
 62 |     "#plt.figure(figsize=(8,6))\n",
 63 |     "plt.plot(x, ydata,  'bo')\n",
 64 |     "plt.plot(x,y, 'r') \n",
 65 |     "plt.ylabel('Dependent Variable')\n",
 66 |     "plt.xlabel('Indepdendent Variable')\n",
 67 |     "plt.show()"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "Non-linear regressions are a relationship between independent variables $x$ and a dependent variable $y$ which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of $k$ degrees (maximum power of $x$). \n",
 75 |     "\n",
 76 |     "$$ \\ y = a x^3 + b x^2 + c x + d \\ $$\n",
 77 |     "\n",
 78 |     "Non-linear functions can have elements like exponentials, logarithms, fractions, and others. For example: $$ y = \\log(x)$$\n",
 79 |     "    \n",
 80 |     "Or even, more complicated such as :\n",
 81 |     "$$ y = \\log(a x^3 + b x^2 + c x + d)$$"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "Let's take a look at a cubic function's graph."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": null,
 94 |    "metadata": {
 95 |     "collapsed": false,
 96 |     "jupyter": {
 97 |      "outputs_hidden": false
 98 |     }
 99 |    },
100 |    "outputs": [],
101 |    "source": [
102 |     "x = np.arange(-5.0, 5.0, 0.1)\n",
103 |     "\n",
104 |     "##You can adjust the slope and intercept to verify the changes in the graph\n",
105 |     "y = 1*(x**3) + 1*(x**2) + 1*x + 3\n",
106 |     "y_noise = 20 * np.random.normal(size=x.size)\n",
107 |     "ydata = y + y_noise\n",
108 |     "plt.plot(x, ydata,  'bo')\n",
109 |     "plt.plot(x,y, 'r') \n",
110 |     "plt.ylabel('Dependent Variable')\n",
111 |     "plt.xlabel('Indepdendent Variable')\n",
112 |     "plt.show()"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "As you can see, this function has $x^3$ and $x^2$ as independent variables. Also, the graphic of this function is not a straight line over the 2D plane. So this is a non-linear function."
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "metadata": {},
125 |    "source": [
126 |     "Some other types of non-linear functions are:"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "markdown",
131 |    "metadata": {},
132 |    "source": [
133 |     "### Quadratic"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "$$ Y = X^2 $$"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": null,
146 |    "metadata": {
147 |     "collapsed": false,
148 |     "jupyter": {
149 |      "outputs_hidden": false
150 |     }
151 |    },
152 |    "outputs": [],
153 |    "source": [
154 |     "x = np.arange(-5.0, 5.0, 0.1)\n",
155 |     "\n",
156 |     "##You can adjust the slope and intercept to verify the changes in the graph\n",
157 |     "\n",
158 |     "y = np.power(x,2)\n",
159 |     "y_noise = 2 * np.random.normal(size=x.size)\n",
160 |     "ydata = y + y_noise\n",
161 |     "plt.plot(x, ydata,  'bo')\n",
162 |     "plt.plot(x,y, 'r') \n",
163 |     "plt.ylabel('Dependent Variable')\n",
164 |     "plt.xlabel('Indepdendent Variable')\n",
165 |     "plt.show()"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "### Exponential"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "markdown",
177 |    "metadata": {},
178 |    "source": [
179 |     "An exponential function with base c is defined by $$ Y = a + b c^X$$ where b ≠0, c > 0 , c ≠1, and x is any real number. The base, c, is constant and the exponent, x, is a variable. \n",
180 |     "\n"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": null,
186 |    "metadata": {
187 |     "collapsed": false,
188 |     "jupyter": {
189 |      "outputs_hidden": false
190 |     }
191 |    },
192 |    "outputs": [],
193 |    "source": [
194 |     "X = np.arange(-5.0, 5.0, 0.1)\n",
195 |     "\n",
196 |     "##You can adjust the slope and intercept to verify the changes in the graph\n",
197 |     "\n",
198 |     "Y= np.exp(X)\n",
199 |     "\n",
200 |     "plt.plot(X,Y) \n",
201 |     "plt.ylabel('Dependent Variable')\n",
202 |     "plt.xlabel('Indepdendent Variable')\n",
203 |     "plt.show()"
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "markdown",
208 |    "metadata": {},
209 |    "source": [
210 |     "### Logarithmic\n",
211 |     "\n",
212 |     "The response $y$ is a results of applying logarithmic map from input $x$'s to output variable $y$. It is one of the simplest form of __log()__: i.e. $$ y = \\log(x)$$\n",
213 |     "\n",
214 |     "Please consider that instead of $x$, we can use $X$, which can be polynomial representation of the $x$'s. In general form it would be written as  \n",
215 |     "\\begin{equation}\n",
216 |     "y = \\log(X)\n",
217 |     "\\end{equation}"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": null,
223 |    "metadata": {
224 |     "collapsed": false,
225 |     "jupyter": {
226 |      "outputs_hidden": false
227 |     }
228 |    },
229 |    "outputs": [],
230 |    "source": [
231 |     "X = np.arange(-5.0, 5.0, 0.1)\n",
232 |     "\n",
233 |     "Y = np.log(X)\n",
234 |     "\n",
235 |     "plt.plot(X,Y) \n",
236 |     "plt.ylabel('Dependent Variable')\n",
237 |     "plt.xlabel('Indepdendent Variable')\n",
238 |     "plt.show()"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "markdown",
243 |    "metadata": {},
244 |    "source": [
245 |     "### Sigmoidal/Logistic"
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "markdown",
250 |    "metadata": {},
251 |    "source": [
252 |     "$$ Y = a + \\frac{b}{1+ c^{(X-d)}}$$"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": null,
258 |    "metadata": {},
259 |    "outputs": [],
260 |    "source": [
261 |     "X = np.arange(-5.0, 5.0, 0.1)\n",
262 |     "\n",
263 |     "\n",
264 |     "Y = 1-4/(1+np.power(3, X-2))\n",
265 |     "\n",
266 |     "plt.plot(X,Y) \n",
267 |     "plt.ylabel('Dependent Variable')\n",
268 |     "plt.xlabel('Indepdendent Variable')\n",
269 |     "plt.show()"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "markdown",
274 |    "metadata": {},
275 |    "source": [
276 |     "<a id=\"ref2\"></a>\n",
277 |     "# Non-Linear Regression example"
278 |    ]
279 |   },
280 |   {
281 |    "cell_type": "markdown",
282 |    "metadata": {},
283 |    "source": [
284 |     "For an example, we're going to try and fit a non-linear model to the datapoints corresponding to China's GDP from 1960 to 2014. We download a dataset with two columns, the first, a year between 1960 and 2014, the second, China's corresponding annual gross domestic income in US dollars for that year. "
285 |    ]
286 |   },
287 |   {
288 |    "cell_type": "code",
289 |    "execution_count": null,
290 |    "metadata": {
291 |     "collapsed": false,
292 |     "jupyter": {
293 |      "outputs_hidden": false
294 |     }
295 |    },
296 |    "outputs": [],
297 |    "source": [
298 |     "import numpy as np\n",
299 |     "import pandas as pd\n",
300 |     "\n",
301 |     "#downloading dataset\n",
302 |     "!wget -nv -O china_gdp.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv\n",
303 |     "    \n",
304 |     "df = pd.read_csv(\"china_gdp.csv\")\n",
305 |     "df.head(10)"
306 |    ]
307 |   },
308 |   {
309 |    "cell_type": "markdown",
310 |    "metadata": {},
311 |    "source": [
312 |     "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "markdown",
317 |    "metadata": {},
318 |    "source": [
319 |     "### Plotting the Dataset ###\n",
320 |     "This is what the datapoints look like. It kind of looks like an either logistic or exponential function. The growth starts off slow, then from 2005 on forward, the growth is very significant. And finally, it decelerate slightly in the 2010s."
321 |    ]
322 |   },
323 |   {
324 |    "cell_type": "code",
325 |    "execution_count": null,
326 |    "metadata": {
327 |     "collapsed": false,
328 |     "jupyter": {
329 |      "outputs_hidden": false
330 |     }
331 |    },
332 |    "outputs": [],
333 |    "source": [
334 |     "plt.figure(figsize=(8,5))\n",
335 |     "x_data, y_data = (df[\"Year\"].values, df[\"Value\"].values)\n",
336 |     "plt.plot(x_data, y_data, 'ro')\n",
337 |     "plt.ylabel('GDP')\n",
338 |     "plt.xlabel('Year')\n",
339 |     "plt.show()"
340 |    ]
341 |   },
342 |   {
343 |    "cell_type": "markdown",
344 |    "metadata": {},
345 |    "source": [
346 |     "### Choosing a model ###\n",
347 |     "\n",
348 |     "From an initial look at the plot, we determine that the logistic function could be a good approximation,\n",
349 |     "since it has the property of starting with a slow growth, increasing growth in the middle, and then decreasing again at the end; as illustrated below:"
350 |    ]
351 |   },
352 |   {
353 |    "cell_type": "code",
354 |    "execution_count": null,
355 |    "metadata": {
356 |     "collapsed": false,
357 |     "jupyter": {
358 |      "outputs_hidden": false
359 |     }
360 |    },
361 |    "outputs": [],
362 |    "source": [
363 |     "X = np.arange(-5.0, 5.0, 0.1)\n",
364 |     "Y = 1.0 / (1.0 + np.exp(-X))\n",
365 |     "\n",
366 |     "plt.plot(X,Y) \n",
367 |     "plt.ylabel('Dependent Variable')\n",
368 |     "plt.xlabel('Indepdendent Variable')\n",
369 |     "plt.show()"
370 |    ]
371 |   },
372 |   {
373 |    "cell_type": "markdown",
374 |    "metadata": {},
375 |    "source": [
376 |     "\n",
377 |     "\n",
378 |     "The formula for the logistic function is the following:\n",
379 |     "\n",
380 |     "$$ \\hat{Y} = \\frac1{1+e^{\\beta_1(X-\\beta_2)}}$$\n",
381 |     "\n",
382 |     "$\\beta_1$: Controls the curve's steepness,\n",
383 |     "\n",
384 |     "$\\beta_2$: Slides the curve on the x-axis."
385 |    ]
386 |   },
387 |   {
388 |    "cell_type": "markdown",
389 |    "metadata": {},
390 |    "source": [
391 |     "### Building The Model ###\n",
392 |     "Now, let's build our regression model and initialize its parameters. "
393 |    ]
394 |   },
395 |   {
396 |    "cell_type": "code",
397 |    "execution_count": null,
398 |    "metadata": {},
399 |    "outputs": [],
400 |    "source": [
401 |     "def sigmoid(x, Beta_1, Beta_2):\n",
402 |     "     y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))\n",
403 |     "     return y"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "markdown",
408 |    "metadata": {},
409 |    "source": [
410 |     "Lets look at a sample sigmoid line that might fit with the data:"
411 |    ]
412 |   },
413 |   {
414 |    "cell_type": "code",
415 |    "execution_count": null,
416 |    "metadata": {
417 |     "collapsed": false,
418 |     "jupyter": {
419 |      "outputs_hidden": false
420 |     }
421 |    },
422 |    "outputs": [],
423 |    "source": [
424 |     "beta_1 = 0.10\n",
425 |     "beta_2 = 1990.0\n",
426 |     "\n",
427 |     "#logistic function\n",
428 |     "Y_pred = sigmoid(x_data, beta_1 , beta_2)\n",
429 |     "\n",
430 |     "#plot initial prediction against datapoints\n",
431 |     "plt.plot(x_data, Y_pred*15000000000000.)\n",
432 |     "plt.plot(x_data, y_data, 'ro')"
433 |    ]
434 |   },
435 |   {
436 |    "cell_type": "markdown",
437 |    "metadata": {},
438 |    "source": [
439 |     "Our task here is to find the best parameters for our model. Lets first normalize our x and y:"
440 |    ]
441 |   },
442 |   {
443 |    "cell_type": "code",
444 |    "execution_count": null,
445 |    "metadata": {},
446 |    "outputs": [],
447 |    "source": [
448 |     "# Lets normalize our data\n",
449 |     "xdata =x_data/max(x_data)\n",
450 |     "ydata =y_data/max(y_data)"
451 |    ]
452 |   },
453 |   {
454 |    "cell_type": "markdown",
455 |    "metadata": {},
456 |    "source": [
457 |     "#### How we find the best parameters for our fit line?\n",
458 |     "we can use __curve_fit__ which uses non-linear least squares to fit our sigmoid function, to data. Optimal values for the parameters so that the sum of the squared residuals of sigmoid(xdata, *popt) - ydata is minimized.\n",
459 |     "\n",
460 |     "popt are our optimized parameters."
461 |    ]
462 |   },
463 |   {
464 |    "cell_type": "code",
465 |    "execution_count": null,
466 |    "metadata": {},
467 |    "outputs": [],
468 |    "source": [
469 |     "from scipy.optimize import curve_fit\n",
470 |     "popt, pcov = curve_fit(sigmoid, xdata, ydata)\n",
471 |     "#print the final parameters\n",
472 |     "print(\" beta_1 = %f, beta_2 = %f\" % (popt[0], popt[1]))"
473 |    ]
474 |   },
475 |   {
476 |    "cell_type": "markdown",
477 |    "metadata": {},
478 |    "source": [
479 |     "Now we plot our resulting regression model."
480 |    ]
481 |   },
482 |   {
483 |    "cell_type": "code",
484 |    "execution_count": null,
485 |    "metadata": {},
486 |    "outputs": [],
487 |    "source": [
488 |     "x = np.linspace(1960, 2015, 55)\n",
489 |     "x = x/max(x)\n",
490 |     "plt.figure(figsize=(8,5))\n",
491 |     "y = sigmoid(x, *popt)\n",
492 |     "plt.plot(xdata, ydata, 'ro', label='data')\n",
493 |     "plt.plot(x,y, linewidth=3.0, label='fit')\n",
494 |     "plt.legend(loc='best')\n",
495 |     "plt.ylabel('GDP')\n",
496 |     "plt.xlabel('Year')\n",
497 |     "plt.show()"
498 |    ]
499 |   },
500 |   {
501 |    "cell_type": "markdown",
502 |    "metadata": {},
503 |    "source": [
504 |     "## Practice\n",
505 |     "Can you calculate what is the accuracy of our model?"
506 |    ]
507 |   },
508 |   {
509 |    "cell_type": "code",
510 |    "execution_count": null,
511 |    "metadata": {},
512 |    "outputs": [],
513 |    "source": [
514 |     "# write your code here\n",
515 |     "\n",
516 |     "\n"
517 |    ]
518 |   },
519 |   {
520 |    "cell_type": "markdown",
521 |    "metadata": {},
522 |    "source": [
523 |     "Double-click __here__ for the solution.\n",
524 |     "\n",
525 |     "<!-- Your answer is below:\n",
526 |     "    \n",
527 |     "# split data into train/test\n",
528 |     "msk = np.random.rand(len(df)) < 0.8\n",
529 |     "train_x = xdata[msk]\n",
530 |     "test_x = xdata[~msk]\n",
531 |     "train_y = ydata[msk]\n",
532 |     "test_y = ydata[~msk]\n",
533 |     "\n",
534 |     "# build the model using train set\n",
535 |     "popt, pcov = curve_fit(sigmoid, train_x, train_y)\n",
536 |     "\n",
537 |     "# predict using test set\n",
538 |     "y_hat = sigmoid(test_x, *popt)\n",
539 |     "\n",
540 |     "# evaluation\n",
541 |     "print(\"Mean absolute error: %.2f\" % np.mean(np.absolute(y_hat - test_y)))\n",
542 |     "print(\"Residual sum of squares (MSE): %.2f\" % np.mean((y_hat - test_y) ** 2))\n",
543 |     "from sklearn.metrics import r2_score\n",
544 |     "print(\"R2-score: %.2f\" % r2_score(y_hat , test_y) )\n",
545 |     "\n",
546 |     "-->"
547 |    ]
548 |   },
549 |   {
550 |    "cell_type": "markdown",
551 |    "metadata": {},
552 |    "source": [
553 |     "<h2>Want to learn more?</h2>\n",
554 |     "\n",
555 |     "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href=\"http://cocl.us/ML0101EN-SPSSModeler\">SPSS Modeler</a>\n",
556 |     "\n",
557 |     "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href=\"https://cocl.us/ML0101EN_DSX\">Watson Studio</a>\n",
558 |     "\n",
559 |     "<h3>Thanks for completing this lesson!</h3>\n",
560 |     "\n",
561 |     "<h4>Author:  <a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a></h4>\n",
562 |     "<p><a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>\n",
563 |     "\n",
564 |     "<hr>\n",
565 |     "\n",
566 |     "<p>Copyright &copy; 2018 <a href=\"https://cocl.us/DX0108EN_CC\">Cognitive Class</a>. This notebook and its source code are released under the terms of the <a href=\"https://bigdatauniversity.com/mit-license/\">MIT License</a>.</p>"
567 |    ]
568 |   }
569 |  ],
570 |  "metadata": {
571 |   "kernelspec": {
572 |    "display_name": "Python 3",
573 |    "language": "python",
574 |    "name": "python3"
575 |   },
576 |   "language_info": {
577 |    "codemirror_mode": {
578 |     "name": "ipython",
579 |     "version": 3
580 |    },
581 |    "file_extension": ".py",
582 |    "mimetype": "text/x-python",
583 |    "name": "python",
584 |    "nbconvert_exporter": "python",
585 |    "pygments_lexer": "ipython3",
586 |    "version": "3.6.7"
587 |   }
588 |  },
589 |  "nbformat": 4,
590 |  "nbformat_minor": 4
591 | }
592 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Assignment Intructions:
 2 | 
 3 | Now that you have been equipped with the skills to use different Machine Learning algorithms, over the course of five weeks, you will have the opportunity to practice and apply it on a dataset. In this project, you will complete a notebook where you will build a classifier to predict whether a loan case will be paid off or not.
 4 | 
 5 | You load a historical dataset from previous loan applications, clean the data, and apply different classification algorithm on the data. You are expected to use the following algorithms to build your models:
 6 | 
 7 | * k-Nearest Neighbour
 8 | * Decision Tree
 9 | * Support Vector Machine
10 | * Logistic Regression
11 | 
12 | The results is reported as the accuracy of each classifier, using the following metrics when these are applicable:
13 | 
14 | * Jaccard index
15 | * F1-score
16 | * LogLoass
17 | ------------
18 | ## Setup Instructions:
19 | ### A-Create an account in Watson Studio if you dont have (If you already have it, jump to step B).
20 | 
21 | * Browse into https://www.ibm.com/cloud/watson-studio
22 | * Click on 'Start your free trial'
23 | * Enter your email, and click 'Next'
24 | * Enter your Name, and choose a Password. Then click on 'Create Account'
25 | * Go to your email, and confirm your account.
26 | * Click on 'Proceed'
27 | * In "Select Organization and Space" form, leave everything as default, and click on 'Continue'
28 | * It is done. Click on 'Get started!'
29 | 
30 | ### B-Sign in into Watson Studio and import your notebook
31 | 
32 | * Sign in into https://www.ibm.com/cloud/watson-studio
33 | * Click on 'New Project'
34 | * Select 'Data Science' as type of project.
35 | * Give a name to your project, and a description for your reference, then setup your project as following and click "Create".
36 | 
37 | > Notice 1: because you are going to share this project with your peer for evaluation, please make sure you have unchecked `Restrict who can be a collaborator`
38 | 
39 | > Notice 2: You have to create an IBM Object Storage, if you dont have any IBM Object Storage (you can use the free Lite plan)
40 | 
41 | * From the top-right, Click on 'Add to project', and then select 'Notebook'. C
42 | 
43 | * In the 'New notebook' form, click on 'From URL', and enter the Notebook URL: https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ML0101EN-Proj-Loan-py-v1.ipynb
44 | 
45 | * Give the notebook a proper name and description and click on `Create Notebook` to initialize the notebook
46 | 
47 | C. Complete the notebook
48 | 
49 | * Start running the notebook
50 | * Complete the notebook based on the description in the notebook.
51 | 


--------------------------------------------------------------------------------
/Recommender System/ML0101EN-RecSys-Collaborative-Filtering-movies-py-v1.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {
   6 |     "button": false,
   7 |     "deletable": true,
   8 |     "new_sheet": false,
   9 |     "run_control": {
  10 |      "read_only": false
  11 |     }
  12 |    },
  13 |    "source": [
  14 |     "<a href=\"https://www.bigdatauniversity.com\"><img src=\"https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png\" width=\"400\" align=\"center\"></a>\n",
  15 |     "\n",
  16 |     "<h1 align=\"center\"><font size=\"5\">COLLABORATIVE FILTERING</font></h1>"
  17 |    ]
  18 |   },
  19 |   {
  20 |    "cell_type": "markdown",
  21 |    "metadata": {
  22 |     "button": false,
  23 |     "deletable": true,
  24 |     "new_sheet": false,
  25 |     "run_control": {
  26 |      "read_only": false
  27 |     }
  28 |    },
  29 |    "source": [
  30 |     "Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement simple version of one using Python and the Pandas library."
  31 |    ]
  32 |   },
  33 |   {
  34 |    "cell_type": "markdown",
  35 |    "metadata": {
  36 |     "button": false,
  37 |     "deletable": true,
  38 |     "new_sheet": false,
  39 |     "run_control": {
  40 |      "read_only": false
  41 |     }
  42 |    },
  43 |    "source": [
  44 |     "<h1>Table of contents</h1>\n",
  45 |     "\n",
  46 |     "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
  47 |     "    <ol>\n",
  48 |     "        <li><a href=\"#ref1\">Acquiring the Data</a></li>\n",
  49 |     "        <li><a href=\"#ref2\">Preprocessing</a></li>\n",
  50 |     "        <li><a href=\"#ref3\">Collaborative Filtering</a></li>\n",
  51 |     "    </ol>\n",
  52 |     "</div>\n",
  53 |     "<br>\n",
  54 |     "<hr>"
  55 |    ]
  56 |   },
  57 |   {
  58 |    "cell_type": "markdown",
  59 |    "metadata": {
  60 |     "button": false,
  61 |     "deletable": true,
  62 |     "new_sheet": false,
  63 |     "run_control": {
  64 |      "read_only": false
  65 |     }
  66 |    },
  67 |    "source": [
  68 |     "\n",
  69 |     "\n",
  70 |     "<a id=\"ref1\"></a>\n",
  71 |     "# Acquiring the Data"
  72 |    ]
  73 |   },
  74 |   {
  75 |    "cell_type": "markdown",
  76 |    "metadata": {
  77 |     "button": false,
  78 |     "deletable": true,
  79 |     "new_sheet": false,
  80 |     "run_control": {
  81 |      "read_only": false
  82 |     }
  83 |    },
  84 |    "source": [
  85 |     "To acquire and extract the data, simply run the following Bash scripts:  \n",
  86 |     "Dataset acquired from [GroupLens](http://grouplens.org/datasets/movielens/). Lets download the dataset. To download the data, we will use **`!wget`** to download it from IBM Object Storage.  \n",
  87 |     "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)"
  88 |    ]
  89 |   },
  90 |   {
  91 |    "cell_type": "code",
  92 |    "execution_count": null,
  93 |    "metadata": {
  94 |     "button": false,
  95 |     "collapsed": false,
  96 |     "deletable": true,
  97 |     "jupyter": {
  98 |      "outputs_hidden": false
  99 |     },
 100 |     "new_sheet": false,
 101 |     "run_control": {
 102 |      "read_only": false
 103 |     }
 104 |    },
 105 |    "outputs": [],
 106 |    "source": [
 107 |     "!wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip\n",
 108 |     "print('unziping ...')\n",
 109 |     "!unzip -o -j moviedataset.zip "
 110 |    ]
 111 |   },
 112 |   {
 113 |    "cell_type": "markdown",
 114 |    "metadata": {
 115 |     "button": false,
 116 |     "deletable": true,
 117 |     "new_sheet": false,
 118 |     "run_control": {
 119 |      "read_only": false
 120 |     }
 121 |    },
 122 |    "source": [
 123 |     "Now you're ready to start working with the data!"
 124 |    ]
 125 |   },
 126 |   {
 127 |    "cell_type": "markdown",
 128 |    "metadata": {
 129 |     "button": false,
 130 |     "deletable": true,
 131 |     "new_sheet": false,
 132 |     "run_control": {
 133 |      "read_only": false
 134 |     }
 135 |    },
 136 |    "source": [
 137 |     "<hr>\n",
 138 |     "\n",
 139 |     "<a id=\"ref2\"></a>\n",
 140 |     "# Preprocessing"
 141 |    ]
 142 |   },
 143 |   {
 144 |    "cell_type": "markdown",
 145 |    "metadata": {
 146 |     "button": false,
 147 |     "deletable": true,
 148 |     "new_sheet": false,
 149 |     "run_control": {
 150 |      "read_only": false
 151 |     }
 152 |    },
 153 |    "source": [
 154 |     "First, let's get all of the imports out of the way:"
 155 |    ]
 156 |   },
 157 |   {
 158 |    "cell_type": "code",
 159 |    "execution_count": null,
 160 |    "metadata": {
 161 |     "button": false,
 162 |     "collapsed": false,
 163 |     "deletable": true,
 164 |     "jupyter": {
 165 |      "outputs_hidden": false
 166 |     },
 167 |     "new_sheet": false,
 168 |     "run_control": {
 169 |      "read_only": false
 170 |     }
 171 |    },
 172 |    "outputs": [],
 173 |    "source": [
 174 |     "#Dataframe manipulation library\n",
 175 |     "import pandas as pd\n",
 176 |     "#Math functions, we'll only need the sqrt function so let's import only that\n",
 177 |     "from math import sqrt\n",
 178 |     "import numpy as np\n",
 179 |     "import matplotlib.pyplot as plt\n",
 180 |     "%matplotlib inline"
 181 |    ]
 182 |   },
 183 |   {
 184 |    "cell_type": "markdown",
 185 |    "metadata": {
 186 |     "button": false,
 187 |     "deletable": true,
 188 |     "new_sheet": false,
 189 |     "run_control": {
 190 |      "read_only": false
 191 |     }
 192 |    },
 193 |    "source": [
 194 |     "Now let's read each file into their Dataframes:"
 195 |    ]
 196 |   },
 197 |   {
 198 |    "cell_type": "code",
 199 |    "execution_count": null,
 200 |    "metadata": {
 201 |     "button": false,
 202 |     "collapsed": false,
 203 |     "deletable": true,
 204 |     "jupyter": {
 205 |      "outputs_hidden": false
 206 |     },
 207 |     "new_sheet": false,
 208 |     "run_control": {
 209 |      "read_only": false
 210 |     }
 211 |    },
 212 |    "outputs": [],
 213 |    "source": [
 214 |     "#Storing the movie information into a pandas dataframe\n",
 215 |     "movies_df = pd.read_csv('movies.csv')\n",
 216 |     "#Storing the user information into a pandas dataframe\n",
 217 |     "ratings_df = pd.read_csv('ratings.csv')"
 218 |    ]
 219 |   },
 220 |   {
 221 |    "cell_type": "markdown",
 222 |    "metadata": {
 223 |     "button": false,
 224 |     "deletable": true,
 225 |     "new_sheet": false,
 226 |     "run_control": {
 227 |      "read_only": false
 228 |     }
 229 |    },
 230 |    "source": [
 231 |     "Let's also take a peek at how each of them are organized:"
 232 |    ]
 233 |   },
 234 |   {
 235 |    "cell_type": "code",
 236 |    "execution_count": null,
 237 |    "metadata": {
 238 |     "button": false,
 239 |     "collapsed": false,
 240 |     "deletable": true,
 241 |     "jupyter": {
 242 |      "outputs_hidden": false
 243 |     },
 244 |     "new_sheet": false,
 245 |     "run_control": {
 246 |      "read_only": false
 247 |     }
 248 |    },
 249 |    "outputs": [],
 250 |    "source": [
 251 |     "#Head is a function that gets the first N rows of a dataframe. N's default is 5.\n",
 252 |     "movies_df.head()"
 253 |    ]
 254 |   },
 255 |   {
 256 |    "cell_type": "markdown",
 257 |    "metadata": {
 258 |     "button": false,
 259 |     "deletable": true,
 260 |     "new_sheet": false,
 261 |     "run_control": {
 262 |      "read_only": false
 263 |     }
 264 |    },
 265 |    "source": [
 266 |     "So each movie has a unique ID, a title with its release year along with it (Which may contain unicode characters) and several different genres in the same field. Let's remove the year from the title column and place it into its own one by using the handy [extract](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html#pandas.Series.str.extract) function that Pandas has."
 267 |    ]
 268 |   },
 269 |   {
 270 |    "cell_type": "markdown",
 271 |    "metadata": {
 272 |     "button": false,
 273 |     "deletable": true,
 274 |     "new_sheet": false,
 275 |     "run_control": {
 276 |      "read_only": false
 277 |     }
 278 |    },
 279 |    "source": [
 280 |     "Let's remove the year from the __title__ column by using pandas' replace function and store in a new __year__ column."
 281 |    ]
 282 |   },
 283 |   {
 284 |    "cell_type": "code",
 285 |    "execution_count": null,
 286 |    "metadata": {
 287 |     "button": false,
 288 |     "collapsed": false,
 289 |     "deletable": true,
 290 |     "jupyter": {
 291 |      "outputs_hidden": false
 292 |     },
 293 |     "new_sheet": false,
 294 |     "run_control": {
 295 |      "read_only": false
 296 |     }
 297 |    },
 298 |    "outputs": [],
 299 |    "source": [
 300 |     "#Using regular expressions to find a year stored between parentheses\n",
 301 |     "#We specify the parantheses so we don't conflict with movies that have years in their titles\n",
 302 |     "movies_df['year'] = movies_df.title.str.extract('(\\(\\d\\d\\d\\d\\))',expand=False)\n",
 303 |     "#Removing the parentheses\n",
 304 |     "movies_df['year'] = movies_df.year.str.extract('(\\d\\d\\d\\d)',expand=False)\n",
 305 |     "#Removing the years from the 'title' column\n",
 306 |     "movies_df['title'] = movies_df.title.str.replace('(\\(\\d\\d\\d\\d\\))', '')\n",
 307 |     "#Applying the strip function to get rid of any ending whitespace characters that may have appeared\n",
 308 |     "movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())"
 309 |    ]
 310 |   },
 311 |   {
 312 |    "cell_type": "markdown",
 313 |    "metadata": {
 314 |     "button": false,
 315 |     "deletable": true,
 316 |     "new_sheet": false,
 317 |     "run_control": {
 318 |      "read_only": false
 319 |     }
 320 |    },
 321 |    "source": [
 322 |     "Let's look at the result!"
 323 |    ]
 324 |   },
 325 |   {
 326 |    "cell_type": "code",
 327 |    "execution_count": null,
 328 |    "metadata": {
 329 |     "button": false,
 330 |     "collapsed": false,
 331 |     "deletable": true,
 332 |     "jupyter": {
 333 |      "outputs_hidden": false
 334 |     },
 335 |     "new_sheet": false,
 336 |     "run_control": {
 337 |      "read_only": false
 338 |     }
 339 |    },
 340 |    "outputs": [],
 341 |    "source": [
 342 |     "movies_df.head()"
 343 |    ]
 344 |   },
 345 |   {
 346 |    "cell_type": "markdown",
 347 |    "metadata": {
 348 |     "button": false,
 349 |     "deletable": true,
 350 |     "new_sheet": false,
 351 |     "run_control": {
 352 |      "read_only": false
 353 |     }
 354 |    },
 355 |    "source": [
 356 |     "With that, let's also drop the genres column since we won't need it for this particular recommendation system."
 357 |    ]
 358 |   },
 359 |   {
 360 |    "cell_type": "code",
 361 |    "execution_count": null,
 362 |    "metadata": {
 363 |     "button": false,
 364 |     "collapsed": false,
 365 |     "deletable": true,
 366 |     "jupyter": {
 367 |      "outputs_hidden": false
 368 |     },
 369 |     "new_sheet": false,
 370 |     "run_control": {
 371 |      "read_only": false
 372 |     }
 373 |    },
 374 |    "outputs": [],
 375 |    "source": [
 376 |     "#Dropping the genres column\n",
 377 |     "movies_df = movies_df.drop('genres', 1)"
 378 |    ]
 379 |   },
 380 |   {
 381 |    "cell_type": "markdown",
 382 |    "metadata": {
 383 |     "button": false,
 384 |     "deletable": true,
 385 |     "new_sheet": false,
 386 |     "run_control": {
 387 |      "read_only": false
 388 |     }
 389 |    },
 390 |    "source": [
 391 |     "Here's the final movies dataframe:"
 392 |    ]
 393 |   },
 394 |   {
 395 |    "cell_type": "code",
 396 |    "execution_count": null,
 397 |    "metadata": {
 398 |     "button": false,
 399 |     "collapsed": false,
 400 |     "deletable": true,
 401 |     "jupyter": {
 402 |      "outputs_hidden": false
 403 |     },
 404 |     "new_sheet": false,
 405 |     "run_control": {
 406 |      "read_only": false
 407 |     }
 408 |    },
 409 |    "outputs": [],
 410 |    "source": [
 411 |     "movies_df.head()"
 412 |    ]
 413 |   },
 414 |   {
 415 |    "cell_type": "markdown",
 416 |    "metadata": {
 417 |     "button": false,
 418 |     "deletable": true,
 419 |     "new_sheet": false,
 420 |     "run_control": {
 421 |      "read_only": false
 422 |     }
 423 |    },
 424 |    "source": [
 425 |     "<br>"
 426 |    ]
 427 |   },
 428 |   {
 429 |    "cell_type": "markdown",
 430 |    "metadata": {
 431 |     "button": false,
 432 |     "deletable": true,
 433 |     "new_sheet": false,
 434 |     "run_control": {
 435 |      "read_only": false
 436 |     }
 437 |    },
 438 |    "source": [
 439 |     "Next, let's look at the ratings dataframe."
 440 |    ]
 441 |   },
 442 |   {
 443 |    "cell_type": "code",
 444 |    "execution_count": null,
 445 |    "metadata": {
 446 |     "button": false,
 447 |     "collapsed": false,
 448 |     "deletable": true,
 449 |     "jupyter": {
 450 |      "outputs_hidden": false
 451 |     },
 452 |     "new_sheet": false,
 453 |     "run_control": {
 454 |      "read_only": false
 455 |     }
 456 |    },
 457 |    "outputs": [],
 458 |    "source": [
 459 |     "ratings_df.head()"
 460 |    ]
 461 |   },
 462 |   {
 463 |    "cell_type": "markdown",
 464 |    "metadata": {
 465 |     "button": false,
 466 |     "deletable": true,
 467 |     "new_sheet": false,
 468 |     "run_control": {
 469 |      "read_only": false
 470 |     }
 471 |    },
 472 |    "source": [
 473 |     "Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won't be needing the timestamp column, so let's drop it to save on memory."
 474 |    ]
 475 |   },
 476 |   {
 477 |    "cell_type": "code",
 478 |    "execution_count": null,
 479 |    "metadata": {
 480 |     "button": false,
 481 |     "collapsed": false,
 482 |     "deletable": true,
 483 |     "jupyter": {
 484 |      "outputs_hidden": false
 485 |     },
 486 |     "new_sheet": false,
 487 |     "run_control": {
 488 |      "read_only": false
 489 |     }
 490 |    },
 491 |    "outputs": [],
 492 |    "source": [
 493 |     "#Drop removes a specified row or column from a dataframe\n",
 494 |     "ratings_df = ratings_df.drop('timestamp', 1)"
 495 |    ]
 496 |   },
 497 |   {
 498 |    "cell_type": "markdown",
 499 |    "metadata": {
 500 |     "button": false,
 501 |     "deletable": true,
 502 |     "new_sheet": false,
 503 |     "run_control": {
 504 |      "read_only": false
 505 |     }
 506 |    },
 507 |    "source": [
 508 |     "Here's how the final ratings Dataframe looks like:"
 509 |    ]
 510 |   },
 511 |   {
 512 |    "cell_type": "code",
 513 |    "execution_count": null,
 514 |    "metadata": {
 515 |     "button": false,
 516 |     "collapsed": false,
 517 |     "deletable": true,
 518 |     "jupyter": {
 519 |      "outputs_hidden": false
 520 |     },
 521 |     "new_sheet": false,
 522 |     "run_control": {
 523 |      "read_only": false
 524 |     },
 525 |     "scrolled": true
 526 |    },
 527 |    "outputs": [],
 528 |    "source": [
 529 |     "ratings_df.head()"
 530 |    ]
 531 |   },
 532 |   {
 533 |    "cell_type": "markdown",
 534 |    "metadata": {
 535 |     "button": false,
 536 |     "deletable": true,
 537 |     "new_sheet": false,
 538 |     "run_control": {
 539 |      "read_only": false
 540 |     }
 541 |    },
 542 |    "source": [
 543 |     "<hr>\n",
 544 |     "\n",
 545 |     "<a id=\"ref3\"></a>\n",
 546 |     "# Collaborative Filtering"
 547 |    ]
 548 |   },
 549 |   {
 550 |    "cell_type": "markdown",
 551 |    "metadata": {
 552 |     "button": false,
 553 |     "deletable": true,
 554 |     "new_sheet": false,
 555 |     "run_control": {
 556 |      "read_only": false
 557 |     }
 558 |    },
 559 |    "source": [
 560 |     "Now, time to start our work on recommendation systems. \n",
 561 |     "\n",
 562 |     "The first technique we're going to take a look at is called __Collaborative Filtering__, which is also known as __User-User Filtering__. As hinted by its alternate name, this technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one we will be using here is going to be based on the __Pearson Correlation Function__.\n",
 563 |     "\n",
 564 |     "<img src=\"https://ibm.box.com/shared/static/1ql8cbwhtkmbr6nge5e706ikzm5mua5w.png\" width=800px>\n",
 565 |     "\n",
 566 |     "\n",
 567 |     "The process for creating a User Based recommendation system is as follows:\n",
 568 |     "- Select a user with the movies the user has watched\n",
 569 |     "- Based on his rating to movies, find the top X neighbours \n",
 570 |     "- Get the watched movie record of the user for each neighbour.\n",
 571 |     "- Calculate a similarity score using some formula\n",
 572 |     "- Recommend the items with the highest score\n",
 573 |     "\n",
 574 |     "\n",
 575 |     "Let's begin by creating an input user to recommend movies to:\n",
 576 |     "\n",
 577 |     "Notice: To add more movies, simply increase the amount of elements in the userInput. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a \"The\", like \"The Matrix\" then write it in like this: 'Matrix, The' ."
 578 |    ]
 579 |   },
 580 |   {
 581 |    "cell_type": "code",
 582 |    "execution_count": null,
 583 |    "metadata": {
 584 |     "button": false,
 585 |     "collapsed": false,
 586 |     "deletable": true,
 587 |     "jupyter": {
 588 |      "outputs_hidden": false
 589 |     },
 590 |     "new_sheet": false,
 591 |     "run_control": {
 592 |      "read_only": false
 593 |     }
 594 |    },
 595 |    "outputs": [],
 596 |    "source": [
 597 |     "userInput = [\n",
 598 |     "            {'title':'Breakfast Club, The', 'rating':5},\n",
 599 |     "            {'title':'Toy Story', 'rating':3.5},\n",
 600 |     "            {'title':'Jumanji', 'rating':2},\n",
 601 |     "            {'title':\"Pulp Fiction\", 'rating':5},\n",
 602 |     "            {'title':'Akira', 'rating':4.5}\n",
 603 |     "         ] \n",
 604 |     "inputMovies = pd.DataFrame(userInput)\n",
 605 |     "inputMovies"
 606 |    ]
 607 |   },
 608 |   {
 609 |    "cell_type": "markdown",
 610 |    "metadata": {
 611 |     "button": false,
 612 |     "deletable": true,
 613 |     "new_sheet": false,
 614 |     "run_control": {
 615 |      "read_only": false
 616 |     }
 617 |    },
 618 |    "source": [
 619 |     "#### Add movieId to input user\n",
 620 |     "With the input complete, let's extract the input movies's ID's from the movies dataframe and add them into it.\n",
 621 |     "\n",
 622 |     "We can achieve this by first filtering out the rows that contain the input movies' title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space."
 623 |    ]
 624 |   },
 625 |   {
 626 |    "cell_type": "code",
 627 |    "execution_count": null,
 628 |    "metadata": {
 629 |     "button": false,
 630 |     "collapsed": false,
 631 |     "deletable": true,
 632 |     "jupyter": {
 633 |      "outputs_hidden": false
 634 |     },
 635 |     "new_sheet": false,
 636 |     "run_control": {
 637 |      "read_only": false
 638 |     },
 639 |     "scrolled": true
 640 |    },
 641 |    "outputs": [],
 642 |    "source": [
 643 |     "#Filtering out the movies by title\n",
 644 |     "inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]\n",
 645 |     "#Then merging it so we can get the movieId. It's implicitly merging it by title.\n",
 646 |     "inputMovies = pd.merge(inputId, inputMovies)\n",
 647 |     "#Dropping information we won't use from the input dataframe\n",
 648 |     "inputMovies = inputMovies.drop('year', 1)\n",
 649 |     "#Final input dataframe\n",
 650 |     "#If a movie you added in above isn't here, then it might not be in the original \n",
 651 |     "#dataframe or it might spelled differently, please check capitalisation.\n",
 652 |     "inputMovies"
 653 |    ]
 654 |   },
 655 |   {
 656 |    "cell_type": "markdown",
 657 |    "metadata": {
 658 |     "button": false,
 659 |     "deletable": true,
 660 |     "new_sheet": false,
 661 |     "run_control": {
 662 |      "read_only": false
 663 |     }
 664 |    },
 665 |    "source": [
 666 |     "#### The users who has seen the same movies\n",
 667 |     "Now with the movie ID's in our input, we can now get the subset of users that have watched and reviewed the movies in our input.\n"
 668 |    ]
 669 |   },
 670 |   {
 671 |    "cell_type": "code",
 672 |    "execution_count": null,
 673 |    "metadata": {
 674 |     "button": false,
 675 |     "collapsed": false,
 676 |     "deletable": true,
 677 |     "jupyter": {
 678 |      "outputs_hidden": false
 679 |     },
 680 |     "new_sheet": false,
 681 |     "run_control": {
 682 |      "read_only": false
 683 |     }
 684 |    },
 685 |    "outputs": [],
 686 |    "source": [
 687 |     "#Filtering out users that have watched movies that the input has watched and storing it\n",
 688 |     "userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]\n",
 689 |     "userSubset.head()"
 690 |    ]
 691 |   },
 692 |   {
 693 |    "cell_type": "markdown",
 694 |    "metadata": {
 695 |     "button": false,
 696 |     "deletable": true,
 697 |     "new_sheet": false,
 698 |     "run_control": {
 699 |      "read_only": false
 700 |     }
 701 |    },
 702 |    "source": [
 703 |     "We now group up the rows by user ID."
 704 |    ]
 705 |   },
 706 |   {
 707 |    "cell_type": "code",
 708 |    "execution_count": null,
 709 |    "metadata": {
 710 |     "button": false,
 711 |     "collapsed": false,
 712 |     "deletable": true,
 713 |     "jupyter": {
 714 |      "outputs_hidden": false
 715 |     },
 716 |     "new_sheet": false,
 717 |     "run_control": {
 718 |      "read_only": false
 719 |     }
 720 |    },
 721 |    "outputs": [],
 722 |    "source": [
 723 |     "#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter\n",
 724 |     "userSubsetGroup = userSubset.groupby(['userId'])"
 725 |    ]
 726 |   },
 727 |   {
 728 |    "cell_type": "markdown",
 729 |    "metadata": {
 730 |     "button": false,
 731 |     "deletable": true,
 732 |     "new_sheet": false,
 733 |     "run_control": {
 734 |      "read_only": false
 735 |     }
 736 |    },
 737 |    "source": [
 738 |     "lets look at one of the users, e.g. the one with userID=1130"
 739 |    ]
 740 |   },
 741 |   {
 742 |    "cell_type": "code",
 743 |    "execution_count": null,
 744 |    "metadata": {
 745 |     "button": false,
 746 |     "collapsed": false,
 747 |     "deletable": true,
 748 |     "jupyter": {
 749 |      "outputs_hidden": false
 750 |     },
 751 |     "new_sheet": false,
 752 |     "run_control": {
 753 |      "read_only": false
 754 |     }
 755 |    },
 756 |    "outputs": [],
 757 |    "source": [
 758 |     "userSubsetGroup.get_group(1130)"
 759 |    ]
 760 |   },
 761 |   {
 762 |    "cell_type": "markdown",
 763 |    "metadata": {
 764 |     "button": false,
 765 |     "deletable": true,
 766 |     "new_sheet": false,
 767 |     "run_control": {
 768 |      "read_only": false
 769 |     }
 770 |    },
 771 |    "source": [
 772 |     "Let's also sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user."
 773 |    ]
 774 |   },
 775 |   {
 776 |    "cell_type": "code",
 777 |    "execution_count": null,
 778 |    "metadata": {
 779 |     "button": false,
 780 |     "collapsed": false,
 781 |     "deletable": true,
 782 |     "jupyter": {
 783 |      "outputs_hidden": false
 784 |     },
 785 |     "new_sheet": false,
 786 |     "run_control": {
 787 |      "read_only": false
 788 |     }
 789 |    },
 790 |    "outputs": [],
 791 |    "source": [
 792 |     "#Sorting it so users with movie most in common with the input will have priority\n",
 793 |     "userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)"
 794 |    ]
 795 |   },
 796 |   {
 797 |    "cell_type": "markdown",
 798 |    "metadata": {
 799 |     "button": false,
 800 |     "deletable": true,
 801 |     "new_sheet": false,
 802 |     "run_control": {
 803 |      "read_only": false
 804 |     }
 805 |    },
 806 |    "source": [
 807 |     "Now lets look at the first user"
 808 |    ]
 809 |   },
 810 |   {
 811 |    "cell_type": "code",
 812 |    "execution_count": null,
 813 |    "metadata": {
 814 |     "button": false,
 815 |     "collapsed": false,
 816 |     "deletable": true,
 817 |     "jupyter": {
 818 |      "outputs_hidden": false
 819 |     },
 820 |     "new_sheet": false,
 821 |     "run_control": {
 822 |      "read_only": false
 823 |     }
 824 |    },
 825 |    "outputs": [],
 826 |    "source": [
 827 |     "userSubsetGroup[0:3]"
 828 |    ]
 829 |   },
 830 |   {
 831 |    "cell_type": "markdown",
 832 |    "metadata": {
 833 |     "button": false,
 834 |     "deletable": true,
 835 |     "new_sheet": false,
 836 |     "run_control": {
 837 |      "read_only": false
 838 |     }
 839 |    },
 840 |    "source": [
 841 |     "#### Similarity of users to input user\n",
 842 |     "Next, we are going to compare all users (not really all !!!) to our specified user and find the one that is most similar.  \n",
 843 |     "we're going to find out how similar each user is to the input through the __Pearson Correlation Coefficient__. It is used to measure the strength of a linear association between two variables. The formula for finding this coefficient between sets X and Y with N values can be seen in the image below. \n",
 844 |     "\n",
 845 |     "Why Pearson Correlation?\n",
 846 |     "\n",
 847 |     "Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y,then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems because for example two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales .\n",
 848 |     "\n",
 849 |     "![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0 \"Pearson Correlation\")\n",
 850 |     "\n",
 851 |     "The values given by the formula vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation. \n",
 852 |     "\n",
 853 |     "In our case, a 1 means that the two users have similar tastes while a -1 means the opposite."
 854 |    ]
 855 |   },
 856 |   {
 857 |    "cell_type": "markdown",
 858 |    "metadata": {
 859 |     "button": false,
 860 |     "deletable": true,
 861 |     "new_sheet": false,
 862 |     "run_control": {
 863 |      "read_only": false
 864 |     }
 865 |    },
 866 |    "source": [
 867 |     "We will select a subset of users to iterate through. This limit is imposed because we don't want to waste too much time going through every single user."
 868 |    ]
 869 |   },
 870 |   {
 871 |    "cell_type": "code",
 872 |    "execution_count": null,
 873 |    "metadata": {
 874 |     "button": false,
 875 |     "collapsed": false,
 876 |     "deletable": true,
 877 |     "jupyter": {
 878 |      "outputs_hidden": false
 879 |     },
 880 |     "new_sheet": false,
 881 |     "run_control": {
 882 |      "read_only": false
 883 |     }
 884 |    },
 885 |    "outputs": [],
 886 |    "source": [
 887 |     "userSubsetGroup = userSubsetGroup[0:100]"
 888 |    ]
 889 |   },
 890 |   {
 891 |    "cell_type": "markdown",
 892 |    "metadata": {
 893 |     "button": false,
 894 |     "deletable": true,
 895 |     "new_sheet": false,
 896 |     "run_control": {
 897 |      "read_only": false
 898 |     }
 899 |    },
 900 |    "source": [
 901 |     "Now, we calculate the Pearson Correlation between input user and subset group, and store it in a dictionary, where the key is the user Id and the value is the coefficient\n"
 902 |    ]
 903 |   },
 904 |   {
 905 |    "cell_type": "code",
 906 |    "execution_count": null,
 907 |    "metadata": {
 908 |     "button": false,
 909 |     "collapsed": false,
 910 |     "deletable": true,
 911 |     "jupyter": {
 912 |      "outputs_hidden": false
 913 |     },
 914 |     "new_sheet": false,
 915 |     "run_control": {
 916 |      "read_only": false
 917 |     },
 918 |     "scrolled": true
 919 |    },
 920 |    "outputs": [],
 921 |    "source": [
 922 |     "#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient\n",
 923 |     "pearsonCorrelationDict = {}\n",
 924 |     "\n",
 925 |     "#For every user group in our subset\n",
 926 |     "for name, group in userSubsetGroup:\n",
 927 |     "    #Let's start by sorting the input and current user group so the values aren't mixed up later on\n",
 928 |     "    group = group.sort_values(by='movieId')\n",
 929 |     "    inputMovies = inputMovies.sort_values(by='movieId')\n",
 930 |     "    #Get the N for the formula\n",
 931 |     "    nRatings = len(group)\n",
 932 |     "    #Get the review scores for the movies that they both have in common\n",
 933 |     "    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]\n",
 934 |     "    #And then store them in a temporary buffer variable in a list format to facilitate future calculations\n",
 935 |     "    tempRatingList = temp_df['rating'].tolist()\n",
 936 |     "    #Let's also put the current user group reviews in a list format\n",
 937 |     "    tempGroupList = group['rating'].tolist()\n",
 938 |     "    #Now let's calculate the pearson correlation between two users, so called, x and y\n",
 939 |     "    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)\n",
 940 |     "    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)\n",
 941 |     "    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)\n",
 942 |     "    \n",
 943 |     "    #If the denominator is different than zero, then divide, else, 0 correlation.\n",
 944 |     "    if Sxx != 0 and Syy != 0:\n",
 945 |     "        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)\n",
 946 |     "    else:\n",
 947 |     "        pearsonCorrelationDict[name] = 0\n"
 948 |    ]
 949 |   },
 950 |   {
 951 |    "cell_type": "code",
 952 |    "execution_count": null,
 953 |    "metadata": {},
 954 |    "outputs": [],
 955 |    "source": [
 956 |     "pearsonCorrelationDict.items()"
 957 |    ]
 958 |   },
 959 |   {
 960 |    "cell_type": "code",
 961 |    "execution_count": null,
 962 |    "metadata": {},
 963 |    "outputs": [],
 964 |    "source": [
 965 |     "pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')\n",
 966 |     "pearsonDF.columns = ['similarityIndex']\n",
 967 |     "pearsonDF['userId'] = pearsonDF.index\n",
 968 |     "pearsonDF.index = range(len(pearsonDF))\n",
 969 |     "pearsonDF.head()"
 970 |    ]
 971 |   },
 972 |   {
 973 |    "cell_type": "markdown",
 974 |    "metadata": {
 975 |     "button": false,
 976 |     "deletable": true,
 977 |     "new_sheet": false,
 978 |     "run_control": {
 979 |      "read_only": false
 980 |     }
 981 |    },
 982 |    "source": [
 983 |     "#### The top x similar users to input user\n",
 984 |     "Now let's get the top 50 users that are most similar to the input."
 985 |    ]
 986 |   },
 987 |   {
 988 |    "cell_type": "code",
 989 |    "execution_count": null,
 990 |    "metadata": {
 991 |     "button": false,
 992 |     "collapsed": false,
 993 |     "deletable": true,
 994 |     "jupyter": {
 995 |      "outputs_hidden": false
 996 |     },
 997 |     "new_sheet": false,
 998 |     "run_control": {
 999 |      "read_only": false
1000 |     }
1001 |    },
1002 |    "outputs": [],
1003 |    "source": [
1004 |     "topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]\n",
1005 |     "topUsers.head()"
1006 |    ]
1007 |   },
1008 |   {
1009 |    "cell_type": "markdown",
1010 |    "metadata": {
1011 |     "button": false,
1012 |     "deletable": true,
1013 |     "new_sheet": false,
1014 |     "run_control": {
1015 |      "read_only": false
1016 |     }
1017 |    },
1018 |    "source": [
1019 |     "Now, let's start recommending movies to the input user.\n",
1020 |     "\n",
1021 |     "#### Rating of selected users to all movies\n",
1022 |     "We're going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our __pearsonDF__ from the ratings dataframe and then store their correlation in a new column called _similarityIndex\". This is achieved below by merging of these two tables."
1023 |    ]
1024 |   },
1025 |   {
1026 |    "cell_type": "code",
1027 |    "execution_count": null,
1028 |    "metadata": {
1029 |     "button": false,
1030 |     "collapsed": false,
1031 |     "deletable": true,
1032 |     "jupyter": {
1033 |      "outputs_hidden": false
1034 |     },
1035 |     "new_sheet": false,
1036 |     "run_control": {
1037 |      "read_only": false
1038 |     },
1039 |     "scrolled": true
1040 |    },
1041 |    "outputs": [],
1042 |    "source": [
1043 |     "topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')\n",
1044 |     "topUsersRating.head()"
1045 |    ]
1046 |   },
1047 |   {
1048 |    "cell_type": "markdown",
1049 |    "metadata": {
1050 |     "button": false,
1051 |     "deletable": true,
1052 |     "new_sheet": false,
1053 |     "run_control": {
1054 |      "read_only": false
1055 |     }
1056 |    },
1057 |    "source": [
1058 |     "Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.\n",
1059 |     "\n",
1060 |     "We can easily do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns:\n",
1061 |     "\n",
1062 |     "It shows the idea of all similar users to candidate movies for the input user:"
1063 |    ]
1064 |   },
1065 |   {
1066 |    "cell_type": "code",
1067 |    "execution_count": null,
1068 |    "metadata": {
1069 |     "button": false,
1070 |     "collapsed": false,
1071 |     "deletable": true,
1072 |     "jupyter": {
1073 |      "outputs_hidden": false
1074 |     },
1075 |     "new_sheet": false,
1076 |     "run_control": {
1077 |      "read_only": false
1078 |     }
1079 |    },
1080 |    "outputs": [],
1081 |    "source": [
1082 |     "#Multiplies the similarity by the user's ratings\n",
1083 |     "topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']\n",
1084 |     "topUsersRating.head()"
1085 |    ]
1086 |   },
1087 |   {
1088 |    "cell_type": "code",
1089 |    "execution_count": null,
1090 |    "metadata": {
1091 |     "button": false,
1092 |     "collapsed": false,
1093 |     "deletable": true,
1094 |     "jupyter": {
1095 |      "outputs_hidden": false
1096 |     },
1097 |     "new_sheet": false,
1098 |     "run_control": {
1099 |      "read_only": false
1100 |     }
1101 |    },
1102 |    "outputs": [],
1103 |    "source": [
1104 |     "#Applies a sum to the topUsers after grouping it up by userId\n",
1105 |     "tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]\n",
1106 |     "tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']\n",
1107 |     "tempTopUsersRating.head()"
1108 |    ]
1109 |   },
1110 |   {
1111 |    "cell_type": "code",
1112 |    "execution_count": null,
1113 |    "metadata": {
1114 |     "button": false,
1115 |     "collapsed": false,
1116 |     "deletable": true,
1117 |     "jupyter": {
1118 |      "outputs_hidden": false
1119 |     },
1120 |     "new_sheet": false,
1121 |     "run_control": {
1122 |      "read_only": false
1123 |     }
1124 |    },
1125 |    "outputs": [],
1126 |    "source": [
1127 |     "#Creates an empty dataframe\n",
1128 |     "recommendation_df = pd.DataFrame()\n",
1129 |     "#Now we take the weighted average\n",
1130 |     "recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']\n",
1131 |     "recommendation_df['movieId'] = tempTopUsersRating.index\n",
1132 |     "recommendation_df.head()"
1133 |    ]
1134 |   },
1135 |   {
1136 |    "cell_type": "markdown",
1137 |    "metadata": {
1138 |     "button": false,
1139 |     "deletable": true,
1140 |     "new_sheet": false,
1141 |     "run_control": {
1142 |      "read_only": false
1143 |     }
1144 |    },
1145 |    "source": [
1146 |     "Now let's sort it and see the top 20 movies that the algorithm recommended!"
1147 |    ]
1148 |   },
1149 |   {
1150 |    "cell_type": "code",
1151 |    "execution_count": null,
1152 |    "metadata": {
1153 |     "button": false,
1154 |     "collapsed": false,
1155 |     "deletable": true,
1156 |     "jupyter": {
1157 |      "outputs_hidden": false
1158 |     },
1159 |     "new_sheet": false,
1160 |     "run_control": {
1161 |      "read_only": false
1162 |     }
1163 |    },
1164 |    "outputs": [],
1165 |    "source": [
1166 |     "recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)\n",
1167 |     "recommendation_df.head(10)"
1168 |    ]
1169 |   },
1170 |   {
1171 |    "cell_type": "code",
1172 |    "execution_count": null,
1173 |    "metadata": {
1174 |     "button": false,
1175 |     "collapsed": false,
1176 |     "deletable": true,
1177 |     "jupyter": {
1178 |      "outputs_hidden": false
1179 |     },
1180 |     "new_sheet": false,
1181 |     "run_control": {
1182 |      "read_only": false
1183 |     },
1184 |     "scrolled": true
1185 |    },
1186 |    "outputs": [],
1187 |    "source": [
1188 |     "movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]"
1189 |    ]
1190 |   },
1191 |   {
1192 |    "cell_type": "markdown",
1193 |    "metadata": {
1194 |     "button": false,
1195 |     "deletable": true,
1196 |     "new_sheet": false,
1197 |     "run_control": {
1198 |      "read_only": false
1199 |     }
1200 |    },
1201 |    "source": [
1202 |     "### Advantages and Disadvantages of Collaborative Filtering\n",
1203 |     "\n",
1204 |     "##### Advantages\n",
1205 |     "* Takes other user's ratings into consideration\n",
1206 |     "* Doesn't need to study or extract information from the recommended item\n",
1207 |     "* Adapts to the user's interests which might change over time\n",
1208 |     "\n",
1209 |     "##### Disadvantages\n",
1210 |     "* Approximation function can be slow\n",
1211 |     "* There might be a low of amount of users to approximate\n",
1212 |     "* Privacy issues when trying to learn the user's preferences"
1213 |    ]
1214 |   },
1215 |   {
1216 |    "cell_type": "markdown",
1217 |    "metadata": {
1218 |     "button": false,
1219 |     "deletable": true,
1220 |     "new_sheet": false,
1221 |     "run_control": {
1222 |      "read_only": false
1223 |     }
1224 |    },
1225 |    "source": [
1226 |     "<h2>Want to learn more?</h2>\n",
1227 |     "\n",
1228 |     "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href=\"http://cocl.us/ML0101EN-SPSSModeler\">SPSS Modeler</a>\n",
1229 |     "\n",
1230 |     "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href=\"https://cocl.us/ML0101EN_DSX\">Watson Studio</a>\n",
1231 |     "\n",
1232 |     "<h3>Thanks for completing this lesson!</h3>\n",
1233 |     "\n",
1234 |     "<h4>Author:  <a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a></h4>\n",
1235 |     "<p><a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>\n",
1236 |     "\n",
1237 |     "<hr>\n",
1238 |     "\n",
1239 |     "<p>Copyright &copy; 2018 <a href=\"https://cocl.us/DX0108EN_CC\">Cognitive Class</a>. This notebook and its source code are released under the terms of the <a href=\"https://bigdatauniversity.com/mit-license/\">MIT License</a>.</p>"
1240 |    ]
1241 |   }
1242 |  ],
1243 |  "metadata": {
1244 |   "kernelspec": {
1245 |    "display_name": "Python 3",
1246 |    "language": "python",
1247 |    "name": "python3"
1248 |   },
1249 |   "language_info": {
1250 |    "codemirror_mode": {
1251 |     "name": "ipython",
1252 |     "version": 3
1253 |    },
1254 |    "file_extension": ".py",
1255 |    "mimetype": "text/x-python",
1256 |    "name": "python",
1257 |    "nbconvert_exporter": "python",
1258 |    "pygments_lexer": "ipython3",
1259 |    "version": "3.6.7"
1260 |   },
1261 |   "widgets": {
1262 |    "state": {},
1263 |    "version": "1.1.2"
1264 |   }
1265 |  },
1266 |  "nbformat": 4,
1267 |  "nbformat_minor": 4
1268 | }
1269 | 


--------------------------------------------------------------------------------
/Recommender System/ML0101EN-RecSys-Content-Based-movies-py-v1.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "<a href=\"https://www.bigdatauniversity.com\"><img src=\"https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png\" width=\"400\" align=\"center\"></a>\n",
   8 |     "\n",
   9 |     "<h1 align=\"center\"><font size=\"5\">CONTENT-BASED FILTERING</font></h1>"
  10 |    ]
  11 |   },
  12 |   {
  13 |    "cell_type": "markdown",
  14 |    "metadata": {},
  15 |    "source": [
  16 |     "Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous, and can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore Content-based recommendation systems and implement a simple version of one using Python and the Pandas library."
  17 |    ]
  18 |   },
  19 |   {
  20 |    "cell_type": "markdown",
  21 |    "metadata": {},
  22 |    "source": [
  23 |     "### Table of contents\n",
  24 |     "\n",
  25 |     "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
  26 |     "    <ol>\n",
  27 |     "        <li><a href=\"#ref1\">Acquiring the Data</a></li>\n",
  28 |     "        <li><a href=\"#ref2\">Preprocessing</a></li>\n",
  29 |     "        <li><a href=\"#ref3\">Content-Based Filtering</a></li>\n",
  30 |     "    </ol>\n",
  31 |     "</div>\n",
  32 |     "<br>"
  33 |    ]
  34 |   },
  35 |   {
  36 |    "cell_type": "markdown",
  37 |    "metadata": {},
  38 |    "source": [
  39 |     "<a id=\"ref1\"></a>\n",
  40 |     "# Acquiring the Data"
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "markdown",
  45 |    "metadata": {},
  46 |    "source": [
  47 |     "To acquire and extract the data, simply run the following Bash scripts:  \n",
  48 |     "Dataset acquired from [GroupLens](http://grouplens.org/datasets/movielens/). Lets download the dataset. To download the data, we will use **`!wget`** to download it from IBM Object Storage.  \n",
  49 |     "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)"
  50 |    ]
  51 |   },
  52 |   {
  53 |    "cell_type": "code",
  54 |    "execution_count": 1,
  55 |    "metadata": {
  56 |     "collapsed": false,
  57 |     "jupyter": {
  58 |      "outputs_hidden": false
  59 |     }
  60 |    },
  61 |    "outputs": [
  62 |     {
  63 |      "name": "stdout",
  64 |      "output_type": "stream",
  65 |      "text": [
  66 |       "--2019-07-11 16:36:32--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip\n",
  67 |       "Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193\n",
  68 |       "Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.\n",
  69 |       "HTTP request sent, awaiting response... 200 OK\n",
  70 |       "Length: 160301210 (153M) [application/zip]\n",
  71 |       "Saving to: ‘moviedataset.zip’\n",
  72 |       "\n",
  73 |       "moviedataset.zip    100%[===================>] 152.88M  19.4MB/s    in 8.0s    \n",
  74 |       "\n",
  75 |       "2019-07-11 16:36:41 (19.2 MB/s) - ‘moviedataset.zip’ saved [160301210/160301210]\n",
  76 |       "\n",
  77 |       "unziping ...\n",
  78 |       "Archive:  moviedataset.zip\n",
  79 |       "  inflating: links.csv               \n",
  80 |       "  inflating: movies.csv              \n",
  81 |       "  inflating: ratings.csv             \n",
  82 |       "  inflating: README.txt              \n",
  83 |       "  inflating: tags.csv                \n"
  84 |      ]
  85 |     }
  86 |    ],
  87 |    "source": [
  88 |     "!wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip\n",
  89 |     "print('unziping ...')\n",
  90 |     "!unzip -o -j moviedataset.zip "
  91 |    ]
  92 |   },
  93 |   {
  94 |    "cell_type": "markdown",
  95 |    "metadata": {},
  96 |    "source": [
  97 |     "Now you're ready to start working with the data!"
  98 |    ]
  99 |   },
 100 |   {
 101 |    "cell_type": "markdown",
 102 |    "metadata": {},
 103 |    "source": [
 104 |     "<a id=\"ref2\"></a>\n",
 105 |     "# Preprocessing"
 106 |    ]
 107 |   },
 108 |   {
 109 |    "cell_type": "markdown",
 110 |    "metadata": {},
 111 |    "source": [
 112 |     "First, let's get all of the imports out of the way:"
 113 |    ]
 114 |   },
 115 |   {
 116 |    "cell_type": "code",
 117 |    "execution_count": 7,
 118 |    "metadata": {
 119 |     "collapsed": false,
 120 |     "jupyter": {
 121 |      "outputs_hidden": false
 122 |     }
 123 |    },
 124 |    "outputs": [],
 125 |    "source": [
 126 |     "#Dataframe manipulation library\n",
 127 |     "import pandas as pd\n",
 128 |     "#Math functions, we'll only need the sqrt function so let's import only that\n",
 129 |     "from math import sqrt\n",
 130 |     "import numpy as np\n",
 131 |     "import matplotlib.pyplot as plt\n",
 132 |     "%matplotlib inline"
 133 |    ]
 134 |   },
 135 |   {
 136 |    "cell_type": "markdown",
 137 |    "metadata": {},
 138 |    "source": [
 139 |     "Now let's read each file into their Dataframes:"
 140 |    ]
 141 |   },
 142 |   {
 143 |    "cell_type": "code",
 144 |    "execution_count": 14,
 145 |    "metadata": {
 146 |     "collapsed": false,
 147 |     "jupyter": {
 148 |      "outputs_hidden": false
 149 |     }
 150 |    },
 151 |    "outputs": [
 152 |     {
 153 |      "data": {
 154 |       "text/html": [
 155 |        "<div>\n",
 156 |        "<style scoped>\n",
 157 |        "    .dataframe tbody tr th:only-of-type {\n",
 158 |        "        vertical-align: middle;\n",
 159 |        "    }\n",
 160 |        "\n",
 161 |        "    .dataframe tbody tr th {\n",
 162 |        "        vertical-align: top;\n",
 163 |        "    }\n",
 164 |        "\n",
 165 |        "    .dataframe thead th {\n",
 166 |        "        text-align: right;\n",
 167 |        "    }\n",
 168 |        "</style>\n",
 169 |        "<table border=\"1\" class=\"dataframe\">\n",
 170 |        "  <thead>\n",
 171 |        "    <tr style=\"text-align: right;\">\n",
 172 |        "      <th></th>\n",
 173 |        "      <th>movieId</th>\n",
 174 |        "      <th>title</th>\n",
 175 |        "      <th>genres</th>\n",
 176 |        "    </tr>\n",
 177 |        "  </thead>\n",
 178 |        "  <tbody>\n",
 179 |        "    <tr>\n",
 180 |        "      <th>0</th>\n",
 181 |        "      <td>1</td>\n",
 182 |        "      <td>Toy Story (1995)</td>\n",
 183 |        "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
 184 |        "    </tr>\n",
 185 |        "    <tr>\n",
 186 |        "      <th>1</th>\n",
 187 |        "      <td>2</td>\n",
 188 |        "      <td>Jumanji (1995)</td>\n",
 189 |        "      <td>Adventure|Children|Fantasy</td>\n",
 190 |        "    </tr>\n",
 191 |        "    <tr>\n",
 192 |        "      <th>2</th>\n",
 193 |        "      <td>3</td>\n",
 194 |        "      <td>Grumpier Old Men (1995)</td>\n",
 195 |        "      <td>Comedy|Romance</td>\n",
 196 |        "    </tr>\n",
 197 |        "    <tr>\n",
 198 |        "      <th>3</th>\n",
 199 |        "      <td>4</td>\n",
 200 |        "      <td>Waiting to Exhale (1995)</td>\n",
 201 |        "      <td>Comedy|Drama|Romance</td>\n",
 202 |        "    </tr>\n",
 203 |        "    <tr>\n",
 204 |        "      <th>4</th>\n",
 205 |        "      <td>5</td>\n",
 206 |        "      <td>Father of the Bride Part II (1995)</td>\n",
 207 |        "      <td>Comedy</td>\n",
 208 |        "    </tr>\n",
 209 |        "  </tbody>\n",
 210 |        "</table>\n",
 211 |        "</div>"
 212 |       ],
 213 |       "text/plain": [
 214 |        "   movieId                               title  \\\n",
 215 |        "0        1                    Toy Story (1995)   \n",
 216 |        "1        2                      Jumanji (1995)   \n",
 217 |        "2        3             Grumpier Old Men (1995)   \n",
 218 |        "3        4            Waiting to Exhale (1995)   \n",
 219 |        "4        5  Father of the Bride Part II (1995)   \n",
 220 |        "\n",
 221 |        "                                        genres  \n",
 222 |        "0  Adventure|Animation|Children|Comedy|Fantasy  \n",
 223 |        "1                   Adventure|Children|Fantasy  \n",
 224 |        "2                               Comedy|Romance  \n",
 225 |        "3                         Comedy|Drama|Romance  \n",
 226 |        "4                                       Comedy  "
 227 |       ]
 228 |      },
 229 |      "execution_count": 14,
 230 |      "metadata": {},
 231 |      "output_type": "execute_result"
 232 |     }
 233 |    ],
 234 |    "source": [
 235 |     "#Storing the movie information into a pandas dataframe\n",
 236 |     "movies_df = pd.read_csv('movies.csv')\n",
 237 |     "#Storing the user information into a pandas dataframe\n",
 238 |     "ratings_df = pd.read_csv('ratings.csv')\n",
 239 |     "#Head is a function that gets the first N rows of a dataframe. N's default is 5.\n",
 240 |     "movies_df.head()"
 241 |    ]
 242 |   },
 243 |   {
 244 |    "cell_type": "markdown",
 245 |    "metadata": {},
 246 |    "source": [
 247 |     "Let's also remove the year from the __title__ column by using pandas' replace function and store in a new __year__ column."
 248 |    ]
 249 |   },
 250 |   {
 251 |    "cell_type": "code",
 252 |    "execution_count": 15,
 253 |    "metadata": {
 254 |     "collapsed": false,
 255 |     "jupyter": {
 256 |      "outputs_hidden": false
 257 |     }
 258 |    },
 259 |    "outputs": [
 260 |     {
 261 |      "data": {
 262 |       "text/html": [
 263 |        "<div>\n",
 264 |        "<style scoped>\n",
 265 |        "    .dataframe tbody tr th:only-of-type {\n",
 266 |        "        vertical-align: middle;\n",
 267 |        "    }\n",
 268 |        "\n",
 269 |        "    .dataframe tbody tr th {\n",
 270 |        "        vertical-align: top;\n",
 271 |        "    }\n",
 272 |        "\n",
 273 |        "    .dataframe thead th {\n",
 274 |        "        text-align: right;\n",
 275 |        "    }\n",
 276 |        "</style>\n",
 277 |        "<table border=\"1\" class=\"dataframe\">\n",
 278 |        "  <thead>\n",
 279 |        "    <tr style=\"text-align: right;\">\n",
 280 |        "      <th></th>\n",
 281 |        "      <th>movieId</th>\n",
 282 |        "      <th>title</th>\n",
 283 |        "      <th>genres</th>\n",
 284 |        "      <th>year</th>\n",
 285 |        "    </tr>\n",
 286 |        "  </thead>\n",
 287 |        "  <tbody>\n",
 288 |        "    <tr>\n",
 289 |        "      <th>0</th>\n",
 290 |        "      <td>1</td>\n",
 291 |        "      <td>Toy Story</td>\n",
 292 |        "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
 293 |        "      <td>1995</td>\n",
 294 |        "    </tr>\n",
 295 |        "    <tr>\n",
 296 |        "      <th>1</th>\n",
 297 |        "      <td>2</td>\n",
 298 |        "      <td>Jumanji</td>\n",
 299 |        "      <td>Adventure|Children|Fantasy</td>\n",
 300 |        "      <td>1995</td>\n",
 301 |        "    </tr>\n",
 302 |        "    <tr>\n",
 303 |        "      <th>2</th>\n",
 304 |        "      <td>3</td>\n",
 305 |        "      <td>Grumpier Old Men</td>\n",
 306 |        "      <td>Comedy|Romance</td>\n",
 307 |        "      <td>1995</td>\n",
 308 |        "    </tr>\n",
 309 |        "    <tr>\n",
 310 |        "      <th>3</th>\n",
 311 |        "      <td>4</td>\n",
 312 |        "      <td>Waiting to Exhale</td>\n",
 313 |        "      <td>Comedy|Drama|Romance</td>\n",
 314 |        "      <td>1995</td>\n",
 315 |        "    </tr>\n",
 316 |        "    <tr>\n",
 317 |        "      <th>4</th>\n",
 318 |        "      <td>5</td>\n",
 319 |        "      <td>Father of the Bride Part II</td>\n",
 320 |        "      <td>Comedy</td>\n",
 321 |        "      <td>1995</td>\n",
 322 |        "    </tr>\n",
 323 |        "  </tbody>\n",
 324 |        "</table>\n",
 325 |        "</div>"
 326 |       ],
 327 |       "text/plain": [
 328 |        "   movieId                        title  \\\n",
 329 |        "0        1                    Toy Story   \n",
 330 |        "1        2                      Jumanji   \n",
 331 |        "2        3             Grumpier Old Men   \n",
 332 |        "3        4            Waiting to Exhale   \n",
 333 |        "4        5  Father of the Bride Part II   \n",
 334 |        "\n",
 335 |        "                                        genres  year  \n",
 336 |        "0  Adventure|Animation|Children|Comedy|Fantasy  1995  \n",
 337 |        "1                   Adventure|Children|Fantasy  1995  \n",
 338 |        "2                               Comedy|Romance  1995  \n",
 339 |        "3                         Comedy|Drama|Romance  1995  \n",
 340 |        "4                                       Comedy  1995  "
 341 |       ]
 342 |      },
 343 |      "execution_count": 15,
 344 |      "metadata": {},
 345 |      "output_type": "execute_result"
 346 |     }
 347 |    ],
 348 |    "source": [
 349 |     "#Using regular expressions to find a year stored between parentheses\n",
 350 |     "#We specify the parantheses so we don't conflict with movies that have years in their titles\n",
 351 |     "movies_df['year'] = movies_df.title.str.extract('(\\(\\d\\d\\d\\d\\))',expand=False)\n",
 352 |     "#Removing the parentheses\n",
 353 |     "movies_df['year'] = movies_df.year.str.extract('(\\d\\d\\d\\d)',expand=False)\n",
 354 |     "#Removing the years from the 'title' column\n",
 355 |     "movies_df['title'] = movies_df.title.str.replace('(\\(\\d\\d\\d\\d\\))', '')\n",
 356 |     "#Applying the strip function to get rid of any ending whitespace characters that may have appeared\n",
 357 |     "movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())\n",
 358 |     "movies_df.head()"
 359 |    ]
 360 |   },
 361 |   {
 362 |    "cell_type": "markdown",
 363 |    "metadata": {},
 364 |    "source": [
 365 |     "With that, let's also split the values in the __Genres__ column into a __list of Genres__ to simplify future use. This can be achieved by applying Python's split string function on the correct column."
 366 |    ]
 367 |   },
 368 |   {
 369 |    "cell_type": "code",
 370 |    "execution_count": 16,
 371 |    "metadata": {
 372 |     "collapsed": false,
 373 |     "jupyter": {
 374 |      "outputs_hidden": false
 375 |     }
 376 |    },
 377 |    "outputs": [
 378 |     {
 379 |      "data": {
 380 |       "text/html": [
 381 |        "<div>\n",
 382 |        "<style scoped>\n",
 383 |        "    .dataframe tbody tr th:only-of-type {\n",
 384 |        "        vertical-align: middle;\n",
 385 |        "    }\n",
 386 |        "\n",
 387 |        "    .dataframe tbody tr th {\n",
 388 |        "        vertical-align: top;\n",
 389 |        "    }\n",
 390 |        "\n",
 391 |        "    .dataframe thead th {\n",
 392 |        "        text-align: right;\n",
 393 |        "    }\n",
 394 |        "</style>\n",
 395 |        "<table border=\"1\" class=\"dataframe\">\n",
 396 |        "  <thead>\n",
 397 |        "    <tr style=\"text-align: right;\">\n",
 398 |        "      <th></th>\n",
 399 |        "      <th>movieId</th>\n",
 400 |        "      <th>title</th>\n",
 401 |        "      <th>genres</th>\n",
 402 |        "      <th>year</th>\n",
 403 |        "    </tr>\n",
 404 |        "  </thead>\n",
 405 |        "  <tbody>\n",
 406 |        "    <tr>\n",
 407 |        "      <th>0</th>\n",
 408 |        "      <td>1</td>\n",
 409 |        "      <td>Toy Story</td>\n",
 410 |        "      <td>[Adventure, Animation, Children, Comedy, Fantasy]</td>\n",
 411 |        "      <td>1995</td>\n",
 412 |        "    </tr>\n",
 413 |        "    <tr>\n",
 414 |        "      <th>1</th>\n",
 415 |        "      <td>2</td>\n",
 416 |        "      <td>Jumanji</td>\n",
 417 |        "      <td>[Adventure, Children, Fantasy]</td>\n",
 418 |        "      <td>1995</td>\n",
 419 |        "    </tr>\n",
 420 |        "    <tr>\n",
 421 |        "      <th>2</th>\n",
 422 |        "      <td>3</td>\n",
 423 |        "      <td>Grumpier Old Men</td>\n",
 424 |        "      <td>[Comedy, Romance]</td>\n",
 425 |        "      <td>1995</td>\n",
 426 |        "    </tr>\n",
 427 |        "    <tr>\n",
 428 |        "      <th>3</th>\n",
 429 |        "      <td>4</td>\n",
 430 |        "      <td>Waiting to Exhale</td>\n",
 431 |        "      <td>[Comedy, Drama, Romance]</td>\n",
 432 |        "      <td>1995</td>\n",
 433 |        "    </tr>\n",
 434 |        "    <tr>\n",
 435 |        "      <th>4</th>\n",
 436 |        "      <td>5</td>\n",
 437 |        "      <td>Father of the Bride Part II</td>\n",
 438 |        "      <td>[Comedy]</td>\n",
 439 |        "      <td>1995</td>\n",
 440 |        "    </tr>\n",
 441 |        "  </tbody>\n",
 442 |        "</table>\n",
 443 |        "</div>"
 444 |       ],
 445 |       "text/plain": [
 446 |        "   movieId                        title  \\\n",
 447 |        "0        1                    Toy Story   \n",
 448 |        "1        2                      Jumanji   \n",
 449 |        "2        3             Grumpier Old Men   \n",
 450 |        "3        4            Waiting to Exhale   \n",
 451 |        "4        5  Father of the Bride Part II   \n",
 452 |        "\n",
 453 |        "                                              genres  year  \n",
 454 |        "0  [Adventure, Animation, Children, Comedy, Fantasy]  1995  \n",
 455 |        "1                     [Adventure, Children, Fantasy]  1995  \n",
 456 |        "2                                  [Comedy, Romance]  1995  \n",
 457 |        "3                           [Comedy, Drama, Romance]  1995  \n",
 458 |        "4                                           [Comedy]  1995  "
 459 |       ]
 460 |      },
 461 |      "execution_count": 16,
 462 |      "metadata": {},
 463 |      "output_type": "execute_result"
 464 |     }
 465 |    ],
 466 |    "source": [
 467 |     "#Every genre is separated by a | so we simply have to call the split function on |\n",
 468 |     "movies_df['genres'] = movies_df.genres.str.split('|')\n",
 469 |     "movies_df.head()"
 470 |    ]
 471 |   },
 472 |   {
 473 |    "cell_type": "markdown",
 474 |    "metadata": {},
 475 |    "source": [
 476 |     "Since keeping genres in a list format isn't optimal for the content-based recommendation system technique, we will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. In this case, we store every different genre in columns that contain either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't. Let's also store this dataframe in another variable since genres won't be important for our first recommendation system."
 477 |    ]
 478 |   },
 479 |   {
 480 |    "cell_type": "code",
 481 |    "execution_count": 18,
 482 |    "metadata": {
 483 |     "collapsed": false,
 484 |     "jupyter": {
 485 |      "outputs_hidden": false
 486 |     }
 487 |    },
 488 |    "outputs": [
 489 |     {
 490 |      "data": {
 491 |       "text/html": [
 492 |        "<div>\n",
 493 |        "<style scoped>\n",
 494 |        "    .dataframe tbody tr th:only-of-type {\n",
 495 |        "        vertical-align: middle;\n",
 496 |        "    }\n",
 497 |        "\n",
 498 |        "    .dataframe tbody tr th {\n",
 499 |        "        vertical-align: top;\n",
 500 |        "    }\n",
 501 |        "\n",
 502 |        "    .dataframe thead th {\n",
 503 |        "        text-align: right;\n",
 504 |        "    }\n",
 505 |        "</style>\n",
 506 |        "<table border=\"1\" class=\"dataframe\">\n",
 507 |        "  <thead>\n",
 508 |        "    <tr style=\"text-align: right;\">\n",
 509 |        "      <th></th>\n",
 510 |        "      <th>movieId</th>\n",
 511 |        "      <th>title</th>\n",
 512 |        "      <th>genres</th>\n",
 513 |        "      <th>year</th>\n",
 514 |        "      <th>Adventure</th>\n",
 515 |        "      <th>Animation</th>\n",
 516 |        "      <th>Children</th>\n",
 517 |        "      <th>Comedy</th>\n",
 518 |        "      <th>Fantasy</th>\n",
 519 |        "      <th>Romance</th>\n",
 520 |        "      <th>...</th>\n",
 521 |        "      <th>Horror</th>\n",
 522 |        "      <th>Mystery</th>\n",
 523 |        "      <th>Sci-Fi</th>\n",
 524 |        "      <th>IMAX</th>\n",
 525 |        "      <th>Documentary</th>\n",
 526 |        "      <th>War</th>\n",
 527 |        "      <th>Musical</th>\n",
 528 |        "      <th>Western</th>\n",
 529 |        "      <th>Film-Noir</th>\n",
 530 |        "      <th>(no genres listed)</th>\n",
 531 |        "    </tr>\n",
 532 |        "  </thead>\n",
 533 |        "  <tbody>\n",
 534 |        "    <tr>\n",
 535 |        "      <th>0</th>\n",
 536 |        "      <td>1</td>\n",
 537 |        "      <td>Toy Story</td>\n",
 538 |        "      <td>[Adventure, Animation, Children, Comedy, Fantasy]</td>\n",
 539 |        "      <td>1995</td>\n",
 540 |        "      <td>1.0</td>\n",
 541 |        "      <td>1.0</td>\n",
 542 |        "      <td>1.0</td>\n",
 543 |        "      <td>1.0</td>\n",
 544 |        "      <td>1.0</td>\n",
 545 |        "      <td>0.0</td>\n",
 546 |        "      <td>...</td>\n",
 547 |        "      <td>0.0</td>\n",
 548 |        "      <td>0.0</td>\n",
 549 |        "      <td>0.0</td>\n",
 550 |        "      <td>0.0</td>\n",
 551 |        "      <td>0.0</td>\n",
 552 |        "      <td>0.0</td>\n",
 553 |        "      <td>0.0</td>\n",
 554 |        "      <td>0.0</td>\n",
 555 |        "      <td>0.0</td>\n",
 556 |        "      <td>0.0</td>\n",
 557 |        "    </tr>\n",
 558 |        "    <tr>\n",
 559 |        "      <th>1</th>\n",
 560 |        "      <td>2</td>\n",
 561 |        "      <td>Jumanji</td>\n",
 562 |        "      <td>[Adventure, Children, Fantasy]</td>\n",
 563 |        "      <td>1995</td>\n",
 564 |        "      <td>1.0</td>\n",
 565 |        "      <td>0.0</td>\n",
 566 |        "      <td>1.0</td>\n",
 567 |        "      <td>0.0</td>\n",
 568 |        "      <td>1.0</td>\n",
 569 |        "      <td>0.0</td>\n",
 570 |        "      <td>...</td>\n",
 571 |        "      <td>0.0</td>\n",
 572 |        "      <td>0.0</td>\n",
 573 |        "      <td>0.0</td>\n",
 574 |        "      <td>0.0</td>\n",
 575 |        "      <td>0.0</td>\n",
 576 |        "      <td>0.0</td>\n",
 577 |        "      <td>0.0</td>\n",
 578 |        "      <td>0.0</td>\n",
 579 |        "      <td>0.0</td>\n",
 580 |        "      <td>0.0</td>\n",
 581 |        "    </tr>\n",
 582 |        "    <tr>\n",
 583 |        "      <th>2</th>\n",
 584 |        "      <td>3</td>\n",
 585 |        "      <td>Grumpier Old Men</td>\n",
 586 |        "      <td>[Comedy, Romance]</td>\n",
 587 |        "      <td>1995</td>\n",
 588 |        "      <td>0.0</td>\n",
 589 |        "      <td>0.0</td>\n",
 590 |        "      <td>0.0</td>\n",
 591 |        "      <td>1.0</td>\n",
 592 |        "      <td>0.0</td>\n",
 593 |        "      <td>1.0</td>\n",
 594 |        "      <td>...</td>\n",
 595 |        "      <td>0.0</td>\n",
 596 |        "      <td>0.0</td>\n",
 597 |        "      <td>0.0</td>\n",
 598 |        "      <td>0.0</td>\n",
 599 |        "      <td>0.0</td>\n",
 600 |        "      <td>0.0</td>\n",
 601 |        "      <td>0.0</td>\n",
 602 |        "      <td>0.0</td>\n",
 603 |        "      <td>0.0</td>\n",
 604 |        "      <td>0.0</td>\n",
 605 |        "    </tr>\n",
 606 |        "    <tr>\n",
 607 |        "      <th>3</th>\n",
 608 |        "      <td>4</td>\n",
 609 |        "      <td>Waiting to Exhale</td>\n",
 610 |        "      <td>[Comedy, Drama, Romance]</td>\n",
 611 |        "      <td>1995</td>\n",
 612 |        "      <td>0.0</td>\n",
 613 |        "      <td>0.0</td>\n",
 614 |        "      <td>0.0</td>\n",
 615 |        "      <td>1.0</td>\n",
 616 |        "      <td>0.0</td>\n",
 617 |        "      <td>1.0</td>\n",
 618 |        "      <td>...</td>\n",
 619 |        "      <td>0.0</td>\n",
 620 |        "      <td>0.0</td>\n",
 621 |        "      <td>0.0</td>\n",
 622 |        "      <td>0.0</td>\n",
 623 |        "      <td>0.0</td>\n",
 624 |        "      <td>0.0</td>\n",
 625 |        "      <td>0.0</td>\n",
 626 |        "      <td>0.0</td>\n",
 627 |        "      <td>0.0</td>\n",
 628 |        "      <td>0.0</td>\n",
 629 |        "    </tr>\n",
 630 |        "    <tr>\n",
 631 |        "      <th>4</th>\n",
 632 |        "      <td>5</td>\n",
 633 |        "      <td>Father of the Bride Part II</td>\n",
 634 |        "      <td>[Comedy]</td>\n",
 635 |        "      <td>1995</td>\n",
 636 |        "      <td>0.0</td>\n",
 637 |        "      <td>0.0</td>\n",
 638 |        "      <td>0.0</td>\n",
 639 |        "      <td>1.0</td>\n",
 640 |        "      <td>0.0</td>\n",
 641 |        "      <td>0.0</td>\n",
 642 |        "      <td>...</td>\n",
 643 |        "      <td>0.0</td>\n",
 644 |        "      <td>0.0</td>\n",
 645 |        "      <td>0.0</td>\n",
 646 |        "      <td>0.0</td>\n",
 647 |        "      <td>0.0</td>\n",
 648 |        "      <td>0.0</td>\n",
 649 |        "      <td>0.0</td>\n",
 650 |        "      <td>0.0</td>\n",
 651 |        "      <td>0.0</td>\n",
 652 |        "      <td>0.0</td>\n",
 653 |        "    </tr>\n",
 654 |        "  </tbody>\n",
 655 |        "</table>\n",
 656 |        "<p>5 rows × 24 columns</p>\n",
 657 |        "</div>"
 658 |       ],
 659 |       "text/plain": [
 660 |        "   movieId                        title  \\\n",
 661 |        "0        1                    Toy Story   \n",
 662 |        "1        2                      Jumanji   \n",
 663 |        "2        3             Grumpier Old Men   \n",
 664 |        "3        4            Waiting to Exhale   \n",
 665 |        "4        5  Father of the Bride Part II   \n",
 666 |        "\n",
 667 |        "                                              genres  year  Adventure  \\\n",
 668 |        "0  [Adventure, Animation, Children, Comedy, Fantasy]  1995        1.0   \n",
 669 |        "1                     [Adventure, Children, Fantasy]  1995        1.0   \n",
 670 |        "2                                  [Comedy, Romance]  1995        0.0   \n",
 671 |        "3                           [Comedy, Drama, Romance]  1995        0.0   \n",
 672 |        "4                                           [Comedy]  1995        0.0   \n",
 673 |        "\n",
 674 |        "   Animation  Children  Comedy  Fantasy  Romance  ...  Horror  Mystery  \\\n",
 675 |        "0        1.0       1.0     1.0      1.0      0.0  ...     0.0      0.0   \n",
 676 |        "1        0.0       1.0     0.0      1.0      0.0  ...     0.0      0.0   \n",
 677 |        "2        0.0       0.0     1.0      0.0      1.0  ...     0.0      0.0   \n",
 678 |        "3        0.0       0.0     1.0      0.0      1.0  ...     0.0      0.0   \n",
 679 |        "4        0.0       0.0     1.0      0.0      0.0  ...     0.0      0.0   \n",
 680 |        "\n",
 681 |        "   Sci-Fi  IMAX  Documentary  War  Musical  Western  Film-Noir  \\\n",
 682 |        "0     0.0   0.0          0.0  0.0      0.0      0.0        0.0   \n",
 683 |        "1     0.0   0.0          0.0  0.0      0.0      0.0        0.0   \n",
 684 |        "2     0.0   0.0          0.0  0.0      0.0      0.0        0.0   \n",
 685 |        "3     0.0   0.0          0.0  0.0      0.0      0.0        0.0   \n",
 686 |        "4     0.0   0.0          0.0  0.0      0.0      0.0        0.0   \n",
 687 |        "\n",
 688 |        "   (no genres listed)  \n",
 689 |        "0                 0.0  \n",
 690 |        "1                 0.0  \n",
 691 |        "2                 0.0  \n",
 692 |        "3                 0.0  \n",
 693 |        "4                 0.0  \n",
 694 |        "\n",
 695 |        "[5 rows x 24 columns]"
 696 |       ]
 697 |      },
 698 |      "execution_count": 18,
 699 |      "metadata": {},
 700 |      "output_type": "execute_result"
 701 |     }
 702 |    ],
 703 |    "source": [
 704 |     "#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.\n",
 705 |     "moviesWithGenres_df = movies_df.copy()\n",
 706 |     "\n",
 707 |     "#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column\n",
 708 |     "for index, row in movies_df.iterrows():\n",
 709 |     "    for genre in row['genres']:\n",
 710 |     "        moviesWithGenres_df.at[index, genre] = 1\n",
 711 |     "#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre\n",
 712 |     "moviesWithGenres_df = moviesWithGenres_df.fillna(0)\n",
 713 |     "moviesWithGenres_df.head()"
 714 |    ]
 715 |   },
 716 |   {
 717 |    "cell_type": "markdown",
 718 |    "metadata": {},
 719 |    "source": [
 720 |     "Next, let's look at the ratings dataframe."
 721 |    ]
 722 |   },
 723 |   {
 724 |    "cell_type": "code",
 725 |    "execution_count": null,
 726 |    "metadata": {
 727 |     "collapsed": false,
 728 |     "jupyter": {
 729 |      "outputs_hidden": false
 730 |     }
 731 |    },
 732 |    "outputs": [],
 733 |    "source": [
 734 |     "ratings_df.head()"
 735 |    ]
 736 |   },
 737 |   {
 738 |    "cell_type": "markdown",
 739 |    "metadata": {},
 740 |    "source": [
 741 |     "Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won't be needing the timestamp column, so let's drop it to save on memory."
 742 |    ]
 743 |   },
 744 |   {
 745 |    "cell_type": "code",
 746 |    "execution_count": null,
 747 |    "metadata": {
 748 |     "collapsed": false,
 749 |     "jupyter": {
 750 |      "outputs_hidden": false
 751 |     }
 752 |    },
 753 |    "outputs": [],
 754 |    "source": [
 755 |     "#Drop removes a specified row or column from a dataframe\n",
 756 |     "ratings_df = ratings_df.drop('timestamp', 1)\n",
 757 |     "ratings_df.head()"
 758 |    ]
 759 |   },
 760 |   {
 761 |    "cell_type": "markdown",
 762 |    "metadata": {},
 763 |    "source": [
 764 |     "<a id=\"ref3\"></a>\n",
 765 |     "# Content-Based recommendation system"
 766 |    ]
 767 |   },
 768 |   {
 769 |    "cell_type": "markdown",
 770 |    "metadata": {},
 771 |    "source": [
 772 |     "Now, let's take a look at how to implement __Content-Based__ or __Item-Item recommendation systems__. This technique attempts to figure out what a user's favourite aspects of an item is, and then recommends items that present those aspects. In our case, we're going to try to figure out the input's favorite genres from the movies and ratings given.\n",
 773 |     "\n",
 774 |     "Let's begin by creating an input user to recommend movies to:\n",
 775 |     "\n",
 776 |     "Notice: To add more movies, simply increase the amount of elements in the __userInput__. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a \"The\", like \"The Matrix\" then write it in like this: 'Matrix, The' ."
 777 |    ]
 778 |   },
 779 |   {
 780 |    "cell_type": "code",
 781 |    "execution_count": null,
 782 |    "metadata": {
 783 |     "collapsed": false,
 784 |     "jupyter": {
 785 |      "outputs_hidden": false
 786 |     }
 787 |    },
 788 |    "outputs": [],
 789 |    "source": [
 790 |     "userInput = [\n",
 791 |     "            {'title':'Breakfast Club, The', 'rating':5},\n",
 792 |     "            {'title':'Toy Story', 'rating':3.5},\n",
 793 |     "            {'title':'Jumanji', 'rating':2},\n",
 794 |     "            {'title':\"Pulp Fiction\", 'rating':5},\n",
 795 |     "            {'title':'Akira', 'rating':4.5}\n",
 796 |     "         ] \n",
 797 |     "inputMovies = pd.DataFrame(userInput)\n",
 798 |     "inputMovies"
 799 |    ]
 800 |   },
 801 |   {
 802 |    "cell_type": "markdown",
 803 |    "metadata": {},
 804 |    "source": [
 805 |     "#### Add movieId to input user\n",
 806 |     "With the input complete, let's extract the input movie's ID's from the movies dataframe and add them into it.\n",
 807 |     "\n",
 808 |     "We can achieve this by first filtering out the rows that contain the input movie's title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space."
 809 |    ]
 810 |   },
 811 |   {
 812 |    "cell_type": "code",
 813 |    "execution_count": null,
 814 |    "metadata": {
 815 |     "collapsed": false,
 816 |     "jupyter": {
 817 |      "outputs_hidden": false
 818 |     },
 819 |     "scrolled": true
 820 |    },
 821 |    "outputs": [],
 822 |    "source": [
 823 |     "#Filtering out the movies by title\n",
 824 |     "inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]\n",
 825 |     "#Then merging it so we can get the movieId. It's implicitly merging it by title.\n",
 826 |     "inputMovies = pd.merge(inputId, inputMovies)\n",
 827 |     "#Dropping information we won't use from the input dataframe\n",
 828 |     "inputMovies = inputMovies.drop('genres', 1).drop('year', 1)\n",
 829 |     "#Final input dataframe\n",
 830 |     "#If a movie you added in above isn't here, then it might not be in the original \n",
 831 |     "#dataframe or it might spelled differently, please check capitalisation.\n",
 832 |     "inputMovies"
 833 |    ]
 834 |   },
 835 |   {
 836 |    "cell_type": "markdown",
 837 |    "metadata": {},
 838 |    "source": [
 839 |     "We're going to start by learning the input's preferences, so let's get the subset of movies that the input has watched from the Dataframe containing genres defined with binary values."
 840 |    ]
 841 |   },
 842 |   {
 843 |    "cell_type": "code",
 844 |    "execution_count": null,
 845 |    "metadata": {
 846 |     "collapsed": false,
 847 |     "jupyter": {
 848 |      "outputs_hidden": false
 849 |     }
 850 |    },
 851 |    "outputs": [],
 852 |    "source": [
 853 |     "#Filtering out the movies from the input\n",
 854 |     "userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]\n",
 855 |     "userMovies"
 856 |    ]
 857 |   },
 858 |   {
 859 |    "cell_type": "markdown",
 860 |    "metadata": {},
 861 |    "source": [
 862 |     "We'll only need the actual genre table, so let's clean this up a bit by resetting the index and dropping the movieId, title, genres and year columns."
 863 |    ]
 864 |   },
 865 |   {
 866 |    "cell_type": "code",
 867 |    "execution_count": null,
 868 |    "metadata": {
 869 |     "collapsed": false,
 870 |     "jupyter": {
 871 |      "outputs_hidden": false
 872 |     }
 873 |    },
 874 |    "outputs": [],
 875 |    "source": [
 876 |     "#Resetting the index to avoid future issues\n",
 877 |     "userMovies = userMovies.reset_index(drop=True)\n",
 878 |     "#Dropping unnecessary issues due to save memory and to avoid issues\n",
 879 |     "userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)\n",
 880 |     "userGenreTable"
 881 |    ]
 882 |   },
 883 |   {
 884 |    "cell_type": "markdown",
 885 |    "metadata": {},
 886 |    "source": [
 887 |     "Now we're ready to start learning the input's preferences!\n",
 888 |     "\n",
 889 |     "To do this, we're going to turn each genre into weights. We can do this by using the input's reviews and multiplying them into the input's genre table and then summing up the resulting table by column. This operation is actually a dot product between a matrix and a vector, so we can simply accomplish by calling Pandas's \"dot\" function."
 890 |    ]
 891 |   },
 892 |   {
 893 |    "cell_type": "code",
 894 |    "execution_count": null,
 895 |    "metadata": {
 896 |     "collapsed": false,
 897 |     "jupyter": {
 898 |      "outputs_hidden": false
 899 |     }
 900 |    },
 901 |    "outputs": [],
 902 |    "source": [
 903 |     "inputMovies['rating']"
 904 |    ]
 905 |   },
 906 |   {
 907 |    "cell_type": "code",
 908 |    "execution_count": null,
 909 |    "metadata": {
 910 |     "collapsed": false,
 911 |     "jupyter": {
 912 |      "outputs_hidden": false
 913 |     }
 914 |    },
 915 |    "outputs": [],
 916 |    "source": [
 917 |     "#Dot produt to get weights\n",
 918 |     "userProfile = userGenreTable.transpose().dot(inputMovies['rating'])\n",
 919 |     "#The user profile\n",
 920 |     "userProfile"
 921 |    ]
 922 |   },
 923 |   {
 924 |    "cell_type": "markdown",
 925 |    "metadata": {},
 926 |    "source": [
 927 |     "Now, we have the weights for every of the user's preferences. This is known as the User Profile. Using this, we can recommend movies that satisfy the user's preferences."
 928 |    ]
 929 |   },
 930 |   {
 931 |    "cell_type": "markdown",
 932 |    "metadata": {},
 933 |    "source": [
 934 |     "Let's start by extracting the genre table from the original dataframe:"
 935 |    ]
 936 |   },
 937 |   {
 938 |    "cell_type": "code",
 939 |    "execution_count": null,
 940 |    "metadata": {
 941 |     "collapsed": false,
 942 |     "jupyter": {
 943 |      "outputs_hidden": false
 944 |     }
 945 |    },
 946 |    "outputs": [],
 947 |    "source": [
 948 |     "#Now let's get the genres of every movie in our original dataframe\n",
 949 |     "genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])\n",
 950 |     "#And drop the unnecessary information\n",
 951 |     "genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)\n",
 952 |     "genreTable.head()"
 953 |    ]
 954 |   },
 955 |   {
 956 |    "cell_type": "code",
 957 |    "execution_count": null,
 958 |    "metadata": {
 959 |     "collapsed": false,
 960 |     "jupyter": {
 961 |      "outputs_hidden": false
 962 |     }
 963 |    },
 964 |    "outputs": [],
 965 |    "source": [
 966 |     "genreTable.shape"
 967 |    ]
 968 |   },
 969 |   {
 970 |    "cell_type": "markdown",
 971 |    "metadata": {},
 972 |    "source": [
 973 |     "With the input's profile and the complete list of movies and their genres in hand, we're going to take the weighted average of every movie based on the input profile and recommend the top twenty movies that most satisfy it."
 974 |    ]
 975 |   },
 976 |   {
 977 |    "cell_type": "code",
 978 |    "execution_count": null,
 979 |    "metadata": {
 980 |     "collapsed": false,
 981 |     "jupyter": {
 982 |      "outputs_hidden": false
 983 |     }
 984 |    },
 985 |    "outputs": [],
 986 |    "source": [
 987 |     "#Multiply the genres by the weights and then take the weighted average\n",
 988 |     "recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())\n",
 989 |     "recommendationTable_df.head()"
 990 |    ]
 991 |   },
 992 |   {
 993 |    "cell_type": "code",
 994 |    "execution_count": null,
 995 |    "metadata": {
 996 |     "collapsed": false,
 997 |     "jupyter": {
 998 |      "outputs_hidden": false
 999 |     }
1000 |    },
1001 |    "outputs": [],
1002 |    "source": [
1003 |     "#Sort our recommendations in descending order\n",
1004 |     "recommendationTable_df = recommendationTable_df.sort_values(ascending=False)\n",
1005 |     "#Just a peek at the values\n",
1006 |     "recommendationTable_df.head()"
1007 |    ]
1008 |   },
1009 |   {
1010 |    "cell_type": "markdown",
1011 |    "metadata": {},
1012 |    "source": [
1013 |     "Now here's the recommendation table!"
1014 |    ]
1015 |   },
1016 |   {
1017 |    "cell_type": "code",
1018 |    "execution_count": null,
1019 |    "metadata": {
1020 |     "collapsed": false,
1021 |     "jupyter": {
1022 |      "outputs_hidden": false
1023 |     },
1024 |     "scrolled": true
1025 |    },
1026 |    "outputs": [],
1027 |    "source": [
1028 |     "#The final recommendation table\n",
1029 |     "movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]"
1030 |    ]
1031 |   },
1032 |   {
1033 |    "cell_type": "markdown",
1034 |    "metadata": {},
1035 |    "source": [
1036 |     "### Advantages and Disadvantages of Content-Based Filtering\n",
1037 |     "\n",
1038 |     "##### Advantages\n",
1039 |     "* Learns user's preferences\n",
1040 |     "* Highly personalized for the user\n",
1041 |     "\n",
1042 |     "##### Disadvantages\n",
1043 |     "* Doesn't take into account what others think of the item, so low quality item recommendations might happen\n",
1044 |     "* Extracting data is not always intuitive\n",
1045 |     "* Determining what characteristics of the item the user dislikes or likes is not always obvious"
1046 |    ]
1047 |   },
1048 |   {
1049 |    "cell_type": "markdown",
1050 |    "metadata": {},
1051 |    "source": [
1052 |     "<h2>Want to learn more?</h2>\n",
1053 |     "\n",
1054 |     "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href=\"http://cocl.us/ML0101EN-SPSSModeler\">SPSS Modeler</a>\n",
1055 |     "\n",
1056 |     "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href=\"https://cocl.us/ML0101EN_DSX\">Watson Studio</a>\n",
1057 |     "\n",
1058 |     "<h3>Thanks for completing this lesson!</h3>\n",
1059 |     "\n",
1060 |     "<h4>Author:  <a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a></h4>\n",
1061 |     "<p><a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>\n",
1062 |     "\n",
1063 |     "<hr>\n",
1064 |     "\n",
1065 |     "<p>Copyright &copy; 2018 <a href=\"https://cocl.us/DX0108EN_CC\">Cognitive Class</a>. This notebook and its source code are released under the terms of the <a href=\"https://bigdatauniversity.com/mit-license/\">MIT License</a>.</p> "
1066 |    ]
1067 |   }
1068 |  ],
1069 |  "metadata": {
1070 |   "kernelspec": {
1071 |    "display_name": "Python 3",
1072 |    "language": "python",
1073 |    "name": "python3"
1074 |   },
1075 |   "language_info": {
1076 |    "codemirror_mode": {
1077 |     "name": "ipython",
1078 |     "version": 3
1079 |    },
1080 |    "file_extension": ".py",
1081 |    "mimetype": "text/x-python",
1082 |    "name": "python",
1083 |    "nbconvert_exporter": "python",
1084 |    "pygments_lexer": "ipython3",
1085 |    "version": "3.6.7"
1086 |   },
1087 |   "widgets": {
1088 |    "state": {},
1089 |    "version": "1.1.2"
1090 |   }
1091 |  },
1092 |  "nbformat": 4,
1093 |  "nbformat_minor": 4
1094 | }
1095 | 


--------------------------------------------------------------------------------
	Age	Sex	BP	Cholesterol	Na_to_K	Drug
0	23	F	HIGH	HIGH	25.355	drugY
1	47	M	LOW	HIGH	13.093	drugC
2	47	M	LOW	HIGH	10.114	drugC
3	28	F	NORMAL	HIGH	7.798	drugX
4	61	F	LOW	HIGH	18.043	drugY
	tenure	age	address	income	ed	employ	equip	callcard	wireless	longmon	...	pager	internet	callwait	confer	ebill	loglong	logtoll	lninc	custcat	churn
0	11.0	33.0	7.0	136.0	5.0	5.0	0.0	1.0	1.0	4.40	...	1.0	0.0	1.0	1.0	0.0	1.482	3.033	4.913	4.0	1.0
1	33.0	33.0	12.0	33.0	2.0	0.0	0.0	0.0	0.0	9.45	...	0.0	0.0	0.0	0.0	0.0	2.246	3.240	3.497	1.0	1.0
2	23.0	30.0	9.0	30.0	1.0	2.0	0.0	0.0	0.0	6.30	...	0.0	0.0	0.0	1.0	0.0	1.841	3.240	3.401	3.0	0.0
3	38.0	35.0	5.0	76.0	2.0	10.0	1.0	1.0	1.0	6.05	...	1.0	1.0	1.0	1.0	1.0	1.800	3.807	4.331	4.0	0.0
4	7.0	35.0	14.0	80.0	2.0	15.0	0.0	1.0	0.0	7.10	...	0.0	0.0	1.0	1.0	0.0	1.960	3.091	4.382	3.0	0.0
	ID	Clump	UnifSize	UnifShape	MargAdh	SingEpiSize	BareNuc	BlandChrom	NormNucl	Mit	Class
0	1000025	5	1	1	1	2	1	3	1	1	2
1	1002945	5	4	4	5	7	10	3	2	1	2
2	1015425	3	1	1	1	2	2	3	1	1	2
3	1016277	6	8	8	1	3	4	3	7	1	2
4	1017023	4	1	1	3	2	1	3	1	1	2
	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
	movieId	title	genres	year
0	1	Toy Story	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995
1	2	Jumanji	Adventure\|Children\|Fantasy	1995
2	3	Grumpier Old Men	Comedy\|Romance	1995
3	4	Waiting to Exhale	Comedy\|Drama\|Romance	1995
4	5	Father of the Bride Part II	Comedy	1995
	movieId	title	genres	year
0	1	Toy Story	[Adventure, Animation, Children, Comedy, Fantasy]	1995
1	2	Jumanji	[Adventure, Children, Fantasy]	1995
2	3	Grumpier Old Men	[Comedy, Romance]	1995
3	4	Waiting to Exhale	[Comedy, Drama, Romance]	1995
4	5	Father of the Bride Part II	[Comedy]	1995