├── Classification Models ├── ML0101EN-Clas-Decision-Trees-drug-py-v1.ipynb ├── ML0101EN-Clas-K-Nearest-neighbors-CustCat-py-v1.ipynb ├── ML0101EN-Clas-Logistic-Reg-churn-py-v1.ipynb ├── ML0101EN-Clas-SVM-cancer-py-v1.ipynb └── ML0101EN-Reg-NoneLinearRegression-py-v1.ipynb ├── Final Project.ipynb ├── README.md └── Recommender System ├── ML0101EN-RecSys-Collaborative-Filtering-movies-py-v1.ipynb └── ML0101EN-RecSys-Content-Based-movies-py-v1.ipynb /Classification Models/ML0101EN-Clas-Decision-Trees-drug-py-v1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "button": false, 7 | "deletable": true, 8 | "new_sheet": false, 9 | "run_control": { 10 | "read_only": false 11 | } 12 | }, 13 | "source": [ 14 | "\n", 15 | "\n", 16 | "

Decision Trees

" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "button": false, 23 | "deletable": true, 24 | "new_sheet": false, 25 | "run_control": { 26 | "read_only": false 27 | } 28 | }, 29 | "source": [ 30 | "In this lab exercise, you will learn a popular machine learning algorithm, Decision Tree. You will use this classification algorithm to build a model from historical data of patients, and their response to different medications. Then you use the trained decision tree to predict the class of a unknown patient, or to find a proper drug for a new patient." 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "

Table of contents

\n", 38 | "\n", 39 | "
\n", 40 | "
    \n", 41 | "
  1. About the dataset
  2. \n", 42 | "
  3. Downloading the Data
  4. \n", 43 | "
  5. Pre-processing
  6. \n", 44 | "
  7. Setting up the Decision Tree
  8. \n", 45 | "
  9. Modeling
  10. \n", 46 | "
  11. Prediction
  12. \n", 47 | "
  13. Evaluation
  14. \n", 48 | "
  15. Visualization
  16. \n", 49 | "
\n", 50 | "
\n", 51 | "
\n", 52 | "
" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": { 58 | "button": false, 59 | "deletable": true, 60 | "new_sheet": false, 61 | "run_control": { 62 | "read_only": false 63 | } 64 | }, 65 | "source": [ 66 | "Import the Following Libraries:\n", 67 | "" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 1, 77 | "metadata": { 78 | "button": false, 79 | "deletable": true, 80 | "new_sheet": false, 81 | "run_control": { 82 | "read_only": false 83 | } 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "import numpy as np \n", 88 | "import pandas as pd\n", 89 | "from sklearn.tree import DecisionTreeClassifier" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": { 95 | "button": false, 96 | "deletable": true, 97 | "new_sheet": false, 98 | "run_control": { 99 | "read_only": false 100 | } 101 | }, 102 | "source": [ 103 | "
\n", 104 | "

About the dataset

\n", 105 | " Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y. \n", 106 | "
\n", 107 | "
\n", 108 | " Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.\n", 109 | "
\n", 110 | "
\n", 111 | " It is a sample of binary classifier, and you can use the training part of the dataset \n", 112 | " to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe it to a new patient.\n", 113 | "
\n" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": { 119 | "button": false, 120 | "deletable": true, 121 | "new_sheet": false, 122 | "run_control": { 123 | "read_only": false 124 | } 125 | }, 126 | "source": [ 127 | "
\n", 128 | "

Downloading the Data

\n", 129 | " To download the data, we will use !wget to download it from IBM Object Storage.\n", 130 | "
" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 2, 136 | "metadata": {}, 137 | "outputs": [ 138 | { 139 | "name": "stdout", 140 | "output_type": "stream", 141 | "text": [ 142 | "--2019-07-10 23:57:20-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv\n", 143 | "Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193\n", 144 | "Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.\n", 145 | "HTTP request sent, awaiting response... 200 OK\n", 146 | "Length: 6027 (5.9K) [text/csv]\n", 147 | "Saving to: ‘drug200.csv’\n", 148 | "\n", 149 | "drug200.csv 100%[===================>] 5.89K --.-KB/s in 0s \n", 150 | "\n", 151 | "2019-07-10 23:57:21 (89.2 MB/s) - ‘drug200.csv’ saved [6027/6027]\n", 152 | "\n" 153 | ] 154 | } 155 | ], 156 | "source": [ 157 | "!wget -O drug200.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/drug200.csv" 158 | ] 159 | }, 160 | { 161 | "cell_type": "markdown", 162 | "metadata": {}, 163 | "source": [ 164 | "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "now, read data using pandas dataframe:" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 7, 177 | "metadata": { 178 | "button": false, 179 | "deletable": true, 180 | "new_sheet": false, 181 | "run_control": { 182 | "read_only": false 183 | } 184 | }, 185 | "outputs": [ 186 | { 187 | "data": { 188 | "text/html": [ 189 | "
\n", 190 | "\n", 203 | "\n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | "
AgeSexBPCholesterolNa_to_KDrug
023FHIGHHIGH25.355drugY
147MLOWHIGH13.093drugC
247MLOWHIGH10.114drugC
328FNORMALHIGH7.798drugX
461FLOWHIGH18.043drugY
\n", 263 | "
" 264 | ], 265 | "text/plain": [ 266 | " Age Sex BP Cholesterol Na_to_K Drug\n", 267 | "0 23 F HIGH HIGH 25.355 drugY\n", 268 | "1 47 M LOW HIGH 13.093 drugC\n", 269 | "2 47 M LOW HIGH 10.114 drugC\n", 270 | "3 28 F NORMAL HIGH 7.798 drugX\n", 271 | "4 61 F LOW HIGH 18.043 drugY" 272 | ] 273 | }, 274 | "execution_count": 7, 275 | "metadata": {}, 276 | "output_type": "execute_result" 277 | } 278 | ], 279 | "source": [ 280 | "my_data = pd.read_csv(\"drug200.csv\", delimiter=\",\")\n", 281 | "my_data[0:5]" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": { 287 | "button": false, 288 | "deletable": true, 289 | "new_sheet": false, 290 | "run_control": { 291 | "read_only": false 292 | } 293 | }, 294 | "source": [ 295 | "
\n", 296 | "

Practice

\n", 297 | " What is the size of data? \n", 298 | "
" 299 | ] 300 | }, 301 | { 302 | "cell_type": "code", 303 | "execution_count": 11, 304 | "metadata": { 305 | "button": false, 306 | "deletable": true, 307 | "new_sheet": false, 308 | "run_control": { 309 | "read_only": false 310 | } 311 | }, 312 | "outputs": [ 313 | { 314 | "data": { 315 | "text/plain": [ 316 | "(200, 6)" 317 | ] 318 | }, 319 | "execution_count": 11, 320 | "metadata": {}, 321 | "output_type": "execute_result" 322 | } 323 | ], 324 | "source": [ 325 | "# write your code here\n", 326 | "my_data.shape\n", 327 | "\n", 328 | "\n" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "
\n", 336 | "

Pre-processing

\n", 337 | "
" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": { 343 | "button": false, 344 | "deletable": true, 345 | "new_sheet": false, 346 | "run_control": { 347 | "read_only": false 348 | } 349 | }, 350 | "source": [ 351 | "Using my_data as the Drug.csv data read by pandas, declare the following variables:
\n", 352 | "\n", 353 | "" 357 | ] 358 | }, 359 | { 360 | "cell_type": "markdown", 361 | "metadata": { 362 | "button": false, 363 | "deletable": true, 364 | "new_sheet": false, 365 | "run_control": { 366 | "read_only": false 367 | } 368 | }, 369 | "source": [ 370 | "Remove the column containing the target name since it doesn't contain numeric values." 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": 18, 376 | "metadata": {}, 377 | "outputs": [ 378 | { 379 | "data": { 380 | "text/html": [ 381 | "
\n", 382 | "\n", 395 | "\n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | "
AgeSexBPCholesterolNa_to_KDrugHIGHLOWNORMAL
0231HIGHHIGH25.355drugY100
1470LOWHIGH13.093drugC010
2470LOWHIGH10.114drugC010
3281NORMALHIGH7.798drugX001
4611LOWHIGH18.043drugY010
\n", 473 | "
" 474 | ], 475 | "text/plain": [ 476 | " Age Sex BP Cholesterol Na_to_K Drug HIGH LOW NORMAL\n", 477 | "0 23 1 HIGH HIGH 25.355 drugY 1 0 0\n", 478 | "1 47 0 LOW HIGH 13.093 drugC 0 1 0\n", 479 | "2 47 0 LOW HIGH 10.114 drugC 0 1 0\n", 480 | "3 28 1 NORMAL HIGH 7.798 drugX 0 0 1\n", 481 | "4 61 1 LOW HIGH 18.043 drugY 0 1 0" 482 | ] 483 | }, 484 | "execution_count": 18, 485 | "metadata": {}, 486 | "output_type": "execute_result" 487 | } 488 | ], 489 | "source": [] 490 | }, 491 | { 492 | "cell_type": "code", 493 | "execution_count": 8, 494 | "metadata": {}, 495 | "outputs": [ 496 | { 497 | "data": { 498 | "text/plain": [ 499 | "array([[23, 'F', 'HIGH', 'HIGH', 25.355],\n", 500 | " [47, 'M', 'LOW', 'HIGH', 13.093],\n", 501 | " [47, 'M', 'LOW', 'HIGH', 10.113999999999999],\n", 502 | " [28, 'F', 'NORMAL', 'HIGH', 7.797999999999999],\n", 503 | " [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)" 504 | ] 505 | }, 506 | "execution_count": 8, 507 | "metadata": {}, 508 | "output_type": "execute_result" 509 | } 510 | ], 511 | "source": [ 512 | "X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values\n", 513 | "X[0:5]" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "As you may figure out, some features in this dataset are categorical such as __Sex__ or __BP__. Unfortunately, Sklearn Decision Trees do not handle categorical variables. But still we can convert these features to numerical values. __pandas.get_dummies()__\n", 521 | "Convert categorical variable into dummy/indicator variables." 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": null, 527 | "metadata": {}, 528 | "outputs": [], 529 | "source": [] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": 9, 534 | "metadata": {}, 535 | "outputs": [ 536 | { 537 | "data": { 538 | "text/plain": [ 539 | "array([[23, 0, 0, 0, 25.355],\n", 540 | " [47, 1, 1, 0, 13.093],\n", 541 | " [47, 1, 1, 0, 10.113999999999999],\n", 542 | " [28, 0, 2, 0, 7.797999999999999],\n", 543 | " [61, 0, 1, 0, 18.043]], dtype=object)" 544 | ] 545 | }, 546 | "execution_count": 9, 547 | "metadata": {}, 548 | "output_type": "execute_result" 549 | } 550 | ], 551 | "source": [ 552 | "from sklearn import preprocessing\n", 553 | "le_sex = preprocessing.LabelEncoder()\n", 554 | "le_sex.fit(['F','M'])\n", 555 | "X[:,1] = le_sex.transform(X[:,1]) \n", 556 | "\n", 557 | "\n", 558 | "le_BP = preprocessing.LabelEncoder()\n", 559 | "le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])\n", 560 | "X[:,2] = le_BP.transform(X[:,2])\n", 561 | "\n", 562 | "\n", 563 | "le_Chol = preprocessing.LabelEncoder()\n", 564 | "le_Chol.fit([ 'NORMAL', 'HIGH'])\n", 565 | "X[:,3] = le_Chol.transform(X[:,3]) \n", 566 | "\n", 567 | "X[0:5]" 568 | ] 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "Now we can fill the target variable." 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": 10, 580 | "metadata": { 581 | "button": false, 582 | "deletable": true, 583 | "new_sheet": false, 584 | "run_control": { 585 | "read_only": false 586 | } 587 | }, 588 | "outputs": [ 589 | { 590 | "data": { 591 | "text/plain": [ 592 | "0 drugY\n", 593 | "1 drugC\n", 594 | "2 drugC\n", 595 | "3 drugX\n", 596 | "4 drugY\n", 597 | "Name: Drug, dtype: object" 598 | ] 599 | }, 600 | "execution_count": 10, 601 | "metadata": {}, 602 | "output_type": "execute_result" 603 | } 604 | ], 605 | "source": [ 606 | "y = my_data[\"Drug\"]\n", 607 | "y[0:5]" 608 | ] 609 | }, 610 | { 611 | "cell_type": "markdown", 612 | "metadata": { 613 | "button": false, 614 | "deletable": true, 615 | "new_sheet": false, 616 | "run_control": { 617 | "read_only": false 618 | } 619 | }, 620 | "source": [ 621 | "
\n", 622 | "\n", 623 | "
\n", 624 | "

Setting up the Decision Tree

\n", 625 | " We will be using train/test split on our decision tree. Let's import train_test_split from sklearn.cross_validation.\n", 626 | "
" 627 | ] 628 | }, 629 | { 630 | "cell_type": "code", 631 | "execution_count": 19, 632 | "metadata": { 633 | "button": false, 634 | "deletable": true, 635 | "new_sheet": false, 636 | "run_control": { 637 | "read_only": false 638 | } 639 | }, 640 | "outputs": [], 641 | "source": [ 642 | "from sklearn.model_selection import train_test_split" 643 | ] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "metadata": { 648 | "button": false, 649 | "deletable": true, 650 | "new_sheet": false, 651 | "run_control": { 652 | "read_only": false 653 | } 654 | }, 655 | "source": [ 656 | "Now train_test_split will return 4 different parameters. We will name them:
\n", 657 | "X_trainset, X_testset, y_trainset, y_testset

\n", 658 | "The train_test_split will need the parameters:
\n", 659 | "X, y, test_size=0.3, and random_state=3.

\n", 660 | "The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits." 661 | ] 662 | }, 663 | { 664 | "cell_type": "code", 665 | "execution_count": 20, 666 | "metadata": { 667 | "button": false, 668 | "deletable": true, 669 | "new_sheet": false, 670 | "run_control": { 671 | "read_only": false 672 | } 673 | }, 674 | "outputs": [], 675 | "source": [ 676 | "X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": { 682 | "button": false, 683 | "deletable": true, 684 | "new_sheet": false, 685 | "run_control": { 686 | "read_only": false 687 | } 688 | }, 689 | "source": [ 690 | "

Practice

\n", 691 | "Print the shape of X_trainset and y_trainset. Ensure that the dimensions match" 692 | ] 693 | }, 694 | { 695 | "cell_type": "code", 696 | "execution_count": null, 697 | "metadata": { 698 | "button": false, 699 | "collapsed": true, 700 | "deletable": true, 701 | "jupyter": { 702 | "outputs_hidden": true 703 | }, 704 | "new_sheet": false, 705 | "run_control": { 706 | "read_only": false 707 | } 708 | }, 709 | "outputs": [], 710 | "source": [ 711 | "# your code\n", 712 | "\n" 713 | ] 714 | }, 715 | { 716 | "cell_type": "markdown", 717 | "metadata": { 718 | "button": false, 719 | "deletable": true, 720 | "new_sheet": false, 721 | "run_control": { 722 | "read_only": false 723 | } 724 | }, 725 | "source": [ 726 | "Print the shape of X_testset and y_testset. Ensure that the dimensions match" 727 | ] 728 | }, 729 | { 730 | "cell_type": "code", 731 | "execution_count": null, 732 | "metadata": { 733 | "button": false, 734 | "collapsed": true, 735 | "deletable": true, 736 | "jupyter": { 737 | "outputs_hidden": true 738 | }, 739 | "new_sheet": false, 740 | "run_control": { 741 | "read_only": false 742 | } 743 | }, 744 | "outputs": [], 745 | "source": [ 746 | "# your code\n", 747 | "\n" 748 | ] 749 | }, 750 | { 751 | "cell_type": "markdown", 752 | "metadata": { 753 | "button": false, 754 | "deletable": true, 755 | "new_sheet": false, 756 | "run_control": { 757 | "read_only": false 758 | } 759 | }, 760 | "source": [ 761 | "
\n", 762 | "\n", 763 | "
\n", 764 | "

Modeling

\n", 765 | " We will first create an instance of the DecisionTreeClassifier called drugTree.
\n", 766 | " Inside of the classifier, specify criterion=\"entropy\" so we can see the information gain of each node.\n", 767 | "
" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 21, 773 | "metadata": { 774 | "button": false, 775 | "deletable": true, 776 | "new_sheet": false, 777 | "run_control": { 778 | "read_only": false 779 | } 780 | }, 781 | "outputs": [ 782 | { 783 | "data": { 784 | "text/plain": [ 785 | "DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,\n", 786 | " max_features=None, max_leaf_nodes=None,\n", 787 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 788 | " min_samples_leaf=1, min_samples_split=2,\n", 789 | " min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n", 790 | " splitter='best')" 791 | ] 792 | }, 793 | "execution_count": 21, 794 | "metadata": {}, 795 | "output_type": "execute_result" 796 | } 797 | ], 798 | "source": [ 799 | "drugTree = DecisionTreeClassifier(criterion=\"entropy\", max_depth = 4)\n", 800 | "drugTree # it shows the default parameters" 801 | ] 802 | }, 803 | { 804 | "cell_type": "markdown", 805 | "metadata": { 806 | "button": false, 807 | "deletable": true, 808 | "new_sheet": false, 809 | "run_control": { 810 | "read_only": false 811 | } 812 | }, 813 | "source": [ 814 | "Next, we will fit the data with the training feature matrix X_trainset and training response vector y_trainset " 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": 22, 820 | "metadata": { 821 | "button": false, 822 | "deletable": true, 823 | "new_sheet": false, 824 | "run_control": { 825 | "read_only": false 826 | } 827 | }, 828 | "outputs": [ 829 | { 830 | "data": { 831 | "text/plain": [ 832 | "DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,\n", 833 | " max_features=None, max_leaf_nodes=None,\n", 834 | " min_impurity_decrease=0.0, min_impurity_split=None,\n", 835 | " min_samples_leaf=1, min_samples_split=2,\n", 836 | " min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n", 837 | " splitter='best')" 838 | ] 839 | }, 840 | "execution_count": 22, 841 | "metadata": {}, 842 | "output_type": "execute_result" 843 | } 844 | ], 845 | "source": [ 846 | "drugTree.fit(X_trainset,y_trainset)" 847 | ] 848 | }, 849 | { 850 | "cell_type": "markdown", 851 | "metadata": { 852 | "button": false, 853 | "deletable": true, 854 | "new_sheet": false, 855 | "run_control": { 856 | "read_only": false 857 | } 858 | }, 859 | "source": [ 860 | "
\n", 861 | "\n", 862 | "
\n", 863 | "

Prediction

\n", 864 | " Let's make some predictions on the testing dataset and store it into a variable called predTree.\n", 865 | "
" 866 | ] 867 | }, 868 | { 869 | "cell_type": "code", 870 | "execution_count": 24, 871 | "metadata": { 872 | "button": false, 873 | "deletable": true, 874 | "new_sheet": false, 875 | "run_control": { 876 | "read_only": false 877 | } 878 | }, 879 | "outputs": [], 880 | "source": [ 881 | "predTree = drugTree.predict(X_testset)" 882 | ] 883 | }, 884 | { 885 | "cell_type": "markdown", 886 | "metadata": { 887 | "button": false, 888 | "deletable": true, 889 | "new_sheet": false, 890 | "run_control": { 891 | "read_only": false 892 | } 893 | }, 894 | "source": [ 895 | "You can print out predTree and y_testset if you want to visually compare the prediction to the actual values." 896 | ] 897 | }, 898 | { 899 | "cell_type": "code", 900 | "execution_count": 25, 901 | "metadata": { 902 | "button": false, 903 | "deletable": true, 904 | "new_sheet": false, 905 | "run_control": { 906 | "read_only": false 907 | }, 908 | "scrolled": true 909 | }, 910 | "outputs": [ 911 | { 912 | "name": "stdout", 913 | "output_type": "stream", 914 | "text": [ 915 | "['drugY' 'drugX' 'drugX' 'drugX' 'drugX']\n", 916 | "40 drugY\n", 917 | "51 drugX\n", 918 | "139 drugX\n", 919 | "197 drugX\n", 920 | "170 drugX\n", 921 | "Name: Drug, dtype: object\n" 922 | ] 923 | } 924 | ], 925 | "source": [ 926 | "print (predTree [0:5])\n", 927 | "print (y_testset [0:5])\n" 928 | ] 929 | }, 930 | { 931 | "cell_type": "markdown", 932 | "metadata": { 933 | "button": false, 934 | "deletable": true, 935 | "new_sheet": false, 936 | "run_control": { 937 | "read_only": false 938 | } 939 | }, 940 | "source": [ 941 | "
\n", 942 | "\n", 943 | "
\n", 944 | "

Evaluation

\n", 945 | " Next, let's import metrics from sklearn and check the accuracy of our model.\n", 946 | "
" 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": 1, 952 | "metadata": { 953 | "button": false, 954 | "deletable": true, 955 | "new_sheet": false, 956 | "run_control": { 957 | "read_only": false 958 | } 959 | }, 960 | "outputs": [ 961 | { 962 | "ename": "NameError", 963 | "evalue": "name 'y_testset' is not defined", 964 | "output_type": "error", 965 | "traceback": [ 966 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 967 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 968 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmetrics\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mmatplotlib\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpyplot\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mplt\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"DecisionTrees's Accuracy: \"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmetrics\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0maccuracy_score\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_testset\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpredTree\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 969 | "\u001b[0;31mNameError\u001b[0m: name 'y_testset' is not defined" 970 | ] 971 | } 972 | ], 973 | "source": [ 974 | "from sklearn import metrics\n", 975 | "import matplotlib.pyplot as plt\n", 976 | "print(\"DecisionTrees's Accuracy: \", metrics.accuracy_score(y_testset, predTree))" 977 | ] 978 | }, 979 | { 980 | "cell_type": "markdown", 981 | "metadata": { 982 | "button": false, 983 | "deletable": true, 984 | "new_sheet": false, 985 | "run_control": { 986 | "read_only": false 987 | } 988 | }, 989 | "source": [ 990 | "__Accuracy classification score__ computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true. \n", 991 | "\n", 992 | "In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.\n" 993 | ] 994 | }, 995 | { 996 | "cell_type": "markdown", 997 | "metadata": { 998 | "button": false, 999 | "deletable": true, 1000 | "new_sheet": false, 1001 | "run_control": { 1002 | "read_only": false 1003 | } 1004 | }, 1005 | "source": [ 1006 | "## Practice \n", 1007 | "Can you calculate the accuracy score without sklearn ?" 1008 | ] 1009 | }, 1010 | { 1011 | "cell_type": "code", 1012 | "execution_count": null, 1013 | "metadata": { 1014 | "button": false, 1015 | "collapsed": true, 1016 | "deletable": true, 1017 | "jupyter": { 1018 | "outputs_hidden": true 1019 | }, 1020 | "new_sheet": false, 1021 | "run_control": { 1022 | "read_only": false 1023 | } 1024 | }, 1025 | "outputs": [], 1026 | "source": [ 1027 | "# your code here\n" 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "markdown", 1032 | "metadata": {}, 1033 | "source": [ 1034 | "
\n", 1035 | "\n", 1036 | "
\n", 1037 | "

Visualization

\n", 1038 | " Lets visualize the tree\n", 1039 | "
" 1040 | ] 1041 | }, 1042 | { 1043 | "cell_type": "code", 1044 | "execution_count": null, 1045 | "metadata": {}, 1046 | "outputs": [], 1047 | "source": [ 1048 | "# Notice: You might need to uncomment and install the pydotplus and graphviz libraries if you have not installed these before\n", 1049 | "# !conda install -c conda-forge pydotplus -y\n", 1050 | "# !conda install -c conda-forge python-graphviz -y" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "execution_count": null, 1056 | "metadata": { 1057 | "button": false, 1058 | "collapsed": true, 1059 | "deletable": true, 1060 | "jupyter": { 1061 | "outputs_hidden": true 1062 | }, 1063 | "new_sheet": false, 1064 | "run_control": { 1065 | "read_only": false 1066 | } 1067 | }, 1068 | "outputs": [], 1069 | "source": [ 1070 | "from sklearn.externals.six import StringIO\n", 1071 | "import pydotplus\n", 1072 | "import matplotlib.image as mpimg\n", 1073 | "from sklearn import tree\n", 1074 | "%matplotlib inline " 1075 | ] 1076 | }, 1077 | { 1078 | "cell_type": "code", 1079 | "execution_count": null, 1080 | "metadata": { 1081 | "button": false, 1082 | "collapsed": true, 1083 | "deletable": true, 1084 | "jupyter": { 1085 | "outputs_hidden": true 1086 | }, 1087 | "new_sheet": false, 1088 | "run_control": { 1089 | "read_only": false 1090 | } 1091 | }, 1092 | "outputs": [], 1093 | "source": [ 1094 | "dot_data = StringIO()\n", 1095 | "filename = \"drugtree.png\"\n", 1096 | "featureNames = my_data.columns[0:5]\n", 1097 | "targetNames = my_data[\"Drug\"].unique().tolist()\n", 1098 | "out=tree.export_graphviz(drugTree,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_trainset), filled=True, special_characters=True,rotate=False) \n", 1099 | "graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) \n", 1100 | "graph.write_png(filename)\n", 1101 | "img = mpimg.imread(filename)\n", 1102 | "plt.figure(figsize=(100, 200))\n", 1103 | "plt.imshow(img,interpolation='nearest')" 1104 | ] 1105 | }, 1106 | { 1107 | "cell_type": "markdown", 1108 | "metadata": { 1109 | "button": false, 1110 | "deletable": true, 1111 | "new_sheet": false, 1112 | "run_control": { 1113 | "read_only": false 1114 | } 1115 | }, 1116 | "source": [ 1117 | "

Want to learn more?

\n", 1118 | "\n", 1119 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n", 1120 | "\n", 1121 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n", 1122 | "\n", 1123 | "

Thanks for completing this lesson!

\n", 1124 | "\n", 1125 | "

Author: Saeed Aghabozorgi

\n", 1126 | "

Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.

\n", 1127 | "\n", 1128 | "
\n", 1129 | "\n", 1130 | "

Copyright © 2018 Cognitive Class. This notebook and its source code are released under the terms of the MIT License.

" 1131 | ] 1132 | } 1133 | ], 1134 | "metadata": { 1135 | "anaconda-cloud": {}, 1136 | "kernelspec": { 1137 | "display_name": "Python 3", 1138 | "language": "python", 1139 | "name": "python3" 1140 | }, 1141 | "language_info": { 1142 | "codemirror_mode": { 1143 | "name": "ipython", 1144 | "version": 3 1145 | }, 1146 | "file_extension": ".py", 1147 | "mimetype": "text/x-python", 1148 | "name": "python", 1149 | "nbconvert_exporter": "python", 1150 | "pygments_lexer": "ipython3", 1151 | "version": "3.6.7" 1152 | }, 1153 | "widgets": { 1154 | "state": {}, 1155 | "version": "1.1.2" 1156 | } 1157 | }, 1158 | "nbformat": 4, 1159 | "nbformat_minor": 4 1160 | } 1161 | -------------------------------------------------------------------------------- /Classification Models/ML0101EN-Clas-Logistic-Reg-churn-py-v1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "button": false, 7 | "new_sheet": false, 8 | "run_control": { 9 | "read_only": false 10 | } 11 | }, 12 | "source": [ 13 | "\n", 14 | "\n", 15 | "

Logistic Regression with Python

" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "In this notebook, you will learn Logistic Regression, and then, you'll create a model for a telecommunication company, to predict when its customers will leave for a competitor, so that they can take some action to retain the customers." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "

Table of contents

\n", 30 | "\n", 31 | "
\n", 32 | "
    \n", 33 | "
  1. About the dataset
  2. \n", 34 | "
  3. Data pre-processing and selection
  4. \n", 35 | "
  5. Modeling (Logistic Regression with Scikit-learn)
  6. \n", 36 | "
  7. Evaluation
  8. \n", 37 | "
  9. Practice
  10. \n", 38 | "
\n", 39 | "
\n", 40 | "
\n", 41 | "
" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": { 47 | "button": false, 48 | "new_sheet": false, 49 | "run_control": { 50 | "read_only": false 51 | } 52 | }, 53 | "source": [ 54 | "\n", 55 | "## What is the difference between Linear and Logistic Regression?\n", 56 | "\n", 57 | "While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it is not the best tool for predicting the class of an observed data point. In order to estimate the class of a data point, we need some sort of guidance on what would be the most probable class for that data point. For this, we use Logistic Regression.\n", 58 | "\n", 59 | "
\n", 60 | "Recall linear regression:\n", 61 | "
\n", 62 | "
\n", 63 | " As you know, Linear regression finds a function that relates a continuous dependent variable, y, to some predictors (independent variables $x_1$, $x_2$, etc.). For example, Simple linear regression assumes a function of the form:\n", 64 | "

\n", 65 | "$$\n", 66 | "y = \\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 + \\cdots\n", 67 | "$$\n", 68 | "
\n", 69 | "and finds the values of parameters $\\theta_0, \\theta_1, \\theta_2$, etc, where the term $\\theta_0$ is the \"intercept\". It can be generally shown as:\n", 70 | "

\n", 71 | "$$\n", 72 | "ℎ_\\theta(𝑥) = \\theta^TX\n", 73 | "$$\n", 74 | "

\n", 75 | "\n", 76 | "
\n", 77 | "\n", 78 | "Logistic Regression is a variation of Linear Regression, useful when the observed dependent variable, y, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.\n", 79 | "\n", 80 | "Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function, which is called sigmoid function 𝜎:\n", 81 | "\n", 82 | "$$\n", 83 | "ℎ_\\theta(𝑥) = \\sigma({\\theta^TX}) = \\frac {e^{(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 +...)}}{1 + e^{(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 +\\cdots)}}\n", 84 | "$$\n", 85 | "Or:\n", 86 | "$$\n", 87 | "ProbabilityOfaClass_1 = P(Y=1|X) = \\sigma({\\theta^TX}) = \\frac{e^{\\theta^TX}}{1+e^{\\theta^TX}} \n", 88 | "$$\n", 89 | "\n", 90 | "In this equation, ${\\theta^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $\\sigma(\\theta^TX)$ is the sigmoid or [logistic function](http://en.wikipedia.org/wiki/Logistic_function), also called logistic curve. It is a common \"S\" shape (sigmoid curve).\n", 91 | "\n", 92 | "So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:\n", 93 | "\n", 94 | "\n", 96 | "\n", 97 | "\n", 98 | "The objective of __Logistic Regression__ algorithm, is to find the best parameters θ, for $ℎ_\\theta(𝑥)$ = $\\sigma({\\theta^TX})$, in such a way that the model best predicts the class of each case." 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "### Customer churn with Logistic Regression\n", 106 | "A telecommunications company is concerned about the number of customers leaving their land-line business for cable competitors. They need to understand who is leaving. Imagine that you are an analyst at this company and you have to find out who is leaving and why." 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": { 112 | "button": false, 113 | "new_sheet": false, 114 | "run_control": { 115 | "read_only": false 116 | } 117 | }, 118 | "source": [ 119 | "Lets first import required libraries:" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 2, 125 | "metadata": { 126 | "button": false, 127 | "new_sheet": false, 128 | "run_control": { 129 | "read_only": false 130 | } 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "import pandas as pd\n", 135 | "import pylab as pl\n", 136 | "import numpy as np\n", 137 | "import scipy.optimize as opt\n", 138 | "from sklearn import preprocessing\n", 139 | "%matplotlib inline \n", 140 | "import matplotlib.pyplot as plt" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": { 146 | "button": false, 147 | "new_sheet": false, 148 | "run_control": { 149 | "read_only": false 150 | } 151 | }, 152 | "source": [ 153 | "

About the dataset

\n", 154 | "We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company. \n", 155 | "\n", 156 | "\n", 157 | "This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.\n", 158 | "\n", 159 | "\n", 160 | "\n", 161 | "The dataset includes information about:\n", 162 | "\n", 163 | "- Customers who left within the last month – the column is called Churn\n", 164 | "- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies\n", 165 | "- Customer account information – how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges\n", 166 | "- Demographic info about customers – gender, age range, and if they have partners and dependents\n" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": { 172 | "button": false, 173 | "new_sheet": false, 174 | "run_control": { 175 | "read_only": false 176 | } 177 | }, 178 | "source": [ 179 | "### Load the Telco Churn data \n", 180 | "Telco Churn is a hypothetical data file that concerns a telecommunications company's efforts to reduce turnover in its customer base. Each case corresponds to a separate customer and it records various demographic and service usage information. Before you can work with the data, you must use the URL to get the ChurnData.csv.\n", 181 | "\n", 182 | "To download the data, we will use `!wget` to download it from IBM Object Storage." 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 3, 188 | "metadata": { 189 | "button": false, 190 | "new_sheet": false, 191 | "run_control": { 192 | "read_only": false 193 | } 194 | }, 195 | "outputs": [ 196 | { 197 | "name": "stdout", 198 | "output_type": "stream", 199 | "text": [ 200 | "--2019-07-11 02:13:17-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv\n", 201 | "Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193\n", 202 | "Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.\n", 203 | "HTTP request sent, awaiting response... 200 OK\n", 204 | "Length: 36144 (35K) [text/csv]\n", 205 | "Saving to: ‘ChurnData.csv’\n", 206 | "\n", 207 | "ChurnData.csv 100%[===================>] 35.30K --.-KB/s in 0.02s \n", 208 | "\n", 209 | "2019-07-11 02:13:17 (1.63 MB/s) - ‘ChurnData.csv’ saved [36144/36144]\n", 210 | "\n" 211 | ] 212 | } 213 | ], 214 | "source": [ 215 | "#Click here and press Shift+Enter\n", 216 | "!wget -O ChurnData.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ChurnData.csv" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": { 229 | "button": false, 230 | "new_sheet": false, 231 | "run_control": { 232 | "read_only": false 233 | } 234 | }, 235 | "source": [ 236 | "### Load Data From CSV File " 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 4, 242 | "metadata": { 243 | "button": false, 244 | "new_sheet": false, 245 | "run_control": { 246 | "read_only": false 247 | } 248 | }, 249 | "outputs": [ 250 | { 251 | "data": { 252 | "text/html": [ 253 | "
\n", 254 | "\n", 267 | "\n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | "
tenureageaddressincomeedemployequipcallcardwirelesslongmon...pagerinternetcallwaitconferebillloglonglogtolllninccustcatchurn
011.033.07.0136.05.05.00.01.01.04.40...1.00.01.01.00.01.4823.0334.9134.01.0
133.033.012.033.02.00.00.00.00.09.45...0.00.00.00.00.02.2463.2403.4971.01.0
223.030.09.030.01.02.00.00.00.06.30...0.00.00.01.00.01.8413.2403.4013.00.0
338.035.05.076.02.010.01.01.01.06.05...1.01.01.01.01.01.8003.8074.3314.00.0
47.035.014.080.02.015.00.01.00.07.10...0.00.01.01.00.01.9603.0914.3823.00.0
\n", 417 | "

5 rows × 28 columns

\n", 418 | "
" 419 | ], 420 | "text/plain": [ 421 | " tenure age address income ed employ equip callcard wireless \\\n", 422 | "0 11.0 33.0 7.0 136.0 5.0 5.0 0.0 1.0 1.0 \n", 423 | "1 33.0 33.0 12.0 33.0 2.0 0.0 0.0 0.0 0.0 \n", 424 | "2 23.0 30.0 9.0 30.0 1.0 2.0 0.0 0.0 0.0 \n", 425 | "3 38.0 35.0 5.0 76.0 2.0 10.0 1.0 1.0 1.0 \n", 426 | "4 7.0 35.0 14.0 80.0 2.0 15.0 0.0 1.0 0.0 \n", 427 | "\n", 428 | " longmon ... pager internet callwait confer ebill loglong logtoll \\\n", 429 | "0 4.40 ... 1.0 0.0 1.0 1.0 0.0 1.482 3.033 \n", 430 | "1 9.45 ... 0.0 0.0 0.0 0.0 0.0 2.246 3.240 \n", 431 | "2 6.30 ... 0.0 0.0 0.0 1.0 0.0 1.841 3.240 \n", 432 | "3 6.05 ... 1.0 1.0 1.0 1.0 1.0 1.800 3.807 \n", 433 | "4 7.10 ... 0.0 0.0 1.0 1.0 0.0 1.960 3.091 \n", 434 | "\n", 435 | " lninc custcat churn \n", 436 | "0 4.913 4.0 1.0 \n", 437 | "1 3.497 1.0 1.0 \n", 438 | "2 3.401 3.0 0.0 \n", 439 | "3 4.331 4.0 0.0 \n", 440 | "4 4.382 3.0 0.0 \n", 441 | "\n", 442 | "[5 rows x 28 columns]" 443 | ] 444 | }, 445 | "execution_count": 4, 446 | "metadata": {}, 447 | "output_type": "execute_result" 448 | } 449 | ], 450 | "source": [ 451 | "churn_df = pd.read_csv(\"ChurnData.csv\")\n", 452 | "churn_df.head()" 453 | ] 454 | }, 455 | { 456 | "cell_type": "markdown", 457 | "metadata": {}, 458 | "source": [ 459 | "

Data pre-processing and selection

" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "Lets select some features for the modeling. Also we change the target data type to be integer, as it is a requirement by the skitlearn algorithm:" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 6, 472 | "metadata": {}, 473 | "outputs": [ 474 | { 475 | "data": { 476 | "text/html": [ 477 | "
\n", 478 | "\n", 491 | "\n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | "
tenureageaddressincomeedemployequipcallcardwirelesschurn
011.033.07.0136.05.05.00.01.01.01
133.033.012.033.02.00.00.00.00.01
223.030.09.030.01.02.00.00.00.00
338.035.05.076.02.010.01.01.01.00
47.035.014.080.02.015.00.01.00.00
\n", 575 | "
" 576 | ], 577 | "text/plain": [ 578 | " tenure age address income ed employ equip callcard wireless \\\n", 579 | "0 11.0 33.0 7.0 136.0 5.0 5.0 0.0 1.0 1.0 \n", 580 | "1 33.0 33.0 12.0 33.0 2.0 0.0 0.0 0.0 0.0 \n", 581 | "2 23.0 30.0 9.0 30.0 1.0 2.0 0.0 0.0 0.0 \n", 582 | "3 38.0 35.0 5.0 76.0 2.0 10.0 1.0 1.0 1.0 \n", 583 | "4 7.0 35.0 14.0 80.0 2.0 15.0 0.0 1.0 0.0 \n", 584 | "\n", 585 | " churn \n", 586 | "0 1 \n", 587 | "1 1 \n", 588 | "2 0 \n", 589 | "3 0 \n", 590 | "4 0 " 591 | ] 592 | }, 593 | "execution_count": 6, 594 | "metadata": {}, 595 | "output_type": "execute_result" 596 | } 597 | ], 598 | "source": [ 599 | "churn_df = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless','churn']]\n", 600 | "churn_df['churn'] = churn_df['churn'].astype('int')\n", 601 | "churn_df.head()" 602 | ] 603 | }, 604 | { 605 | "cell_type": "markdown", 606 | "metadata": { 607 | "button": true, 608 | "new_sheet": true, 609 | "run_control": { 610 | "read_only": false 611 | } 612 | }, 613 | "source": [ 614 | "## Practice\n", 615 | "How many rows and columns are in this dataset in total? What are the name of columns?" 616 | ] 617 | }, 618 | { 619 | "cell_type": "code", 620 | "execution_count": 5, 621 | "metadata": { 622 | "button": false, 623 | "new_sheet": false, 624 | "run_control": { 625 | "read_only": false 626 | } 627 | }, 628 | "outputs": [ 629 | { 630 | "data": { 631 | "text/plain": [ 632 | "(200, 28)" 633 | ] 634 | }, 635 | "execution_count": 5, 636 | "metadata": {}, 637 | "output_type": "execute_result" 638 | } 639 | ], 640 | "source": [ 641 | "# write your code here\n", 642 | "churn_df.shape\n", 643 | "\n" 644 | ] 645 | }, 646 | { 647 | "cell_type": "markdown", 648 | "metadata": {}, 649 | "source": [ 650 | "Lets define X, and y for our dataset:" 651 | ] 652 | }, 653 | { 654 | "cell_type": "code", 655 | "execution_count": 11, 656 | "metadata": {}, 657 | "outputs": [ 658 | { 659 | "data": { 660 | "text/html": [ 661 | "
\n", 662 | "\n", 675 | "\n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | "
tenureageaddressincomeedemployequip
011.033.07.0136.05.05.00.0
133.033.012.033.02.00.00.0
223.030.09.030.01.02.00.0
338.035.05.076.02.010.01.0
47.035.014.080.02.015.00.0
\n", 741 | "
" 742 | ], 743 | "text/plain": [ 744 | " tenure age address income ed employ equip\n", 745 | "0 11.0 33.0 7.0 136.0 5.0 5.0 0.0\n", 746 | "1 33.0 33.0 12.0 33.0 2.0 0.0 0.0\n", 747 | "2 23.0 30.0 9.0 30.0 1.0 2.0 0.0\n", 748 | "3 38.0 35.0 5.0 76.0 2.0 10.0 1.0\n", 749 | "4 7.0 35.0 14.0 80.0 2.0 15.0 0.0" 750 | ] 751 | }, 752 | "execution_count": 11, 753 | "metadata": {}, 754 | "output_type": "execute_result" 755 | } 756 | ], 757 | "source": [ 758 | "X = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']]\n", 759 | "X[0:5]" 760 | ] 761 | }, 762 | { 763 | "cell_type": "code", 764 | "execution_count": 10, 765 | "metadata": {}, 766 | "outputs": [ 767 | { 768 | "data": { 769 | "text/plain": [ 770 | "array([1, 1, 0, 0, 0])" 771 | ] 772 | }, 773 | "execution_count": 10, 774 | "metadata": {}, 775 | "output_type": "execute_result" 776 | } 777 | ], 778 | "source": [ 779 | "y = churn_df['churn'].values\n", 780 | "y [0:5]" 781 | ] 782 | }, 783 | { 784 | "cell_type": "markdown", 785 | "metadata": {}, 786 | "source": [ 787 | "Also, we normalize the dataset:" 788 | ] 789 | }, 790 | { 791 | "cell_type": "code", 792 | "execution_count": 12, 793 | "metadata": {}, 794 | "outputs": [ 795 | { 796 | "data": { 797 | "text/plain": [ 798 | "array([[-1.13518441, -0.62595491, -0.4588971 , 0.4751423 , 1.6961288 ,\n", 799 | " -0.58477841, -0.85972695],\n", 800 | " [-0.11604313, -0.62595491, 0.03454064, -0.32886061, -0.6433592 ,\n", 801 | " -1.14437497, -0.85972695],\n", 802 | " [-0.57928917, -0.85594447, -0.261522 , -0.35227817, -1.42318853,\n", 803 | " -0.92053635, -0.85972695],\n", 804 | " [ 0.11557989, -0.47262854, -0.65627219, 0.00679109, -0.6433592 ,\n", 805 | " -0.02518185, 1.16316 ],\n", 806 | " [-1.32048283, -0.47262854, 0.23191574, 0.03801451, -0.6433592 ,\n", 807 | " 0.53441472, -0.85972695]])" 808 | ] 809 | }, 810 | "execution_count": 12, 811 | "metadata": {}, 812 | "output_type": "execute_result" 813 | } 814 | ], 815 | "source": [ 816 | "from sklearn import preprocessing\n", 817 | "X = preprocessing.StandardScaler().fit(X).transform(X)\n", 818 | "X[0:5]" 819 | ] 820 | }, 821 | { 822 | "cell_type": "markdown", 823 | "metadata": {}, 824 | "source": [ 825 | "## Train/Test dataset" 826 | ] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": {}, 831 | "source": [ 832 | "Okay, we split our dataset into train and test set:" 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": 14, 838 | "metadata": {}, 839 | "outputs": [ 840 | { 841 | "name": "stdout", 842 | "output_type": "stream", 843 | "text": [ 844 | "Train set: (160, 7) (160,)\n", 845 | "Test set: (40, 7) (40,)\n" 846 | ] 847 | } 848 | ], 849 | "source": [ 850 | "from sklearn.model_selection import train_test_split\n", 851 | "X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n", 852 | "print ('Train set:', X_train.shape, y_train.shape)\n", 853 | "print ('Test set:', X_test.shape, y_test.shape)" 854 | ] 855 | }, 856 | { 857 | "cell_type": "markdown", 858 | "metadata": {}, 859 | "source": [ 860 | "

Modeling (Logistic Regression with Scikit-learn)

" 861 | ] 862 | }, 863 | { 864 | "cell_type": "markdown", 865 | "metadata": {}, 866 | "source": [ 867 | "Lets build our model using __LogisticRegression__ from Scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers. You can find extensive information about the pros and cons of these optimizers if you search it in internet.\n", 868 | "\n", 869 | "The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem in machine learning models.\n", 870 | "__C__ parameter indicates __inverse of regularization strength__ which must be a positive float. Smaller values specify stronger regularization. \n", 871 | "Now lets fit our model with train set:" 872 | ] 873 | }, 874 | { 875 | "cell_type": "code", 876 | "execution_count": 16, 877 | "metadata": {}, 878 | "outputs": [ 879 | { 880 | "data": { 881 | "text/plain": [ 882 | "LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,\n", 883 | " intercept_scaling=1, max_iter=100, multi_class='warn',\n", 884 | " n_jobs=None, penalty='l2', random_state=None, solver='liblinear',\n", 885 | " tol=0.0001, verbose=0, warm_start=False)" 886 | ] 887 | }, 888 | "execution_count": 16, 889 | "metadata": {}, 890 | "output_type": "execute_result" 891 | } 892 | ], 893 | "source": [ 894 | "from sklearn.linear_model import LogisticRegression\n", 895 | "from sklearn.metrics import confusion_matrix\n", 896 | "LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)\n", 897 | "LR" 898 | ] 899 | }, 900 | { 901 | "cell_type": "markdown", 902 | "metadata": {}, 903 | "source": [ 904 | "Now we can predict using our test set:" 905 | ] 906 | }, 907 | { 908 | "cell_type": "code", 909 | "execution_count": 17, 910 | "metadata": {}, 911 | "outputs": [ 912 | { 913 | "data": { 914 | "text/plain": [ 915 | "array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,\n", 916 | " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0])" 917 | ] 918 | }, 919 | "execution_count": 17, 920 | "metadata": {}, 921 | "output_type": "execute_result" 922 | } 923 | ], 924 | "source": [ 925 | "yhat = LR.predict(X_test)\n", 926 | "yhat" 927 | ] 928 | }, 929 | { 930 | "cell_type": "markdown", 931 | "metadata": {}, 932 | "source": [ 933 | "__predict_proba__ returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 1, P(Y=1|X), and second column is probability of class 0, P(Y=0|X):" 934 | ] 935 | }, 936 | { 937 | "cell_type": "code", 938 | "execution_count": 18, 939 | "metadata": {}, 940 | "outputs": [ 941 | { 942 | "data": { 943 | "text/plain": [ 944 | "array([[0.54132919, 0.45867081],\n", 945 | " [0.60593357, 0.39406643],\n", 946 | " [0.56277713, 0.43722287],\n", 947 | " [0.63432489, 0.36567511],\n", 948 | " [0.56431839, 0.43568161],\n", 949 | " [0.55386646, 0.44613354],\n", 950 | " [0.52237207, 0.47762793],\n", 951 | " [0.60514349, 0.39485651],\n", 952 | " [0.41069572, 0.58930428],\n", 953 | " [0.6333873 , 0.3666127 ],\n", 954 | " [0.58068791, 0.41931209],\n", 955 | " [0.62768628, 0.37231372],\n", 956 | " [0.47559883, 0.52440117],\n", 957 | " [0.4267593 , 0.5732407 ],\n", 958 | " [0.66172417, 0.33827583],\n", 959 | " [0.55092315, 0.44907685],\n", 960 | " [0.51749946, 0.48250054],\n", 961 | " [0.485743 , 0.514257 ],\n", 962 | " [0.49011451, 0.50988549],\n", 963 | " [0.52423349, 0.47576651],\n", 964 | " [0.61619519, 0.38380481],\n", 965 | " [0.52696302, 0.47303698],\n", 966 | " [0.63957168, 0.36042832],\n", 967 | " [0.52205164, 0.47794836],\n", 968 | " [0.50572852, 0.49427148],\n", 969 | " [0.70706202, 0.29293798],\n", 970 | " [0.55266286, 0.44733714],\n", 971 | " [0.52271594, 0.47728406],\n", 972 | " [0.51638863, 0.48361137],\n", 973 | " [0.71331391, 0.28668609],\n", 974 | " [0.67862111, 0.32137889],\n", 975 | " [0.50896403, 0.49103597],\n", 976 | " [0.42348082, 0.57651918],\n", 977 | " [0.71495838, 0.28504162],\n", 978 | " [0.59711064, 0.40288936],\n", 979 | " [0.63808839, 0.36191161],\n", 980 | " [0.39957895, 0.60042105],\n", 981 | " [0.52127638, 0.47872362],\n", 982 | " [0.65975464, 0.34024536],\n", 983 | " [0.5114172 , 0.4885828 ]])" 984 | ] 985 | }, 986 | "execution_count": 18, 987 | "metadata": {}, 988 | "output_type": "execute_result" 989 | } 990 | ], 991 | "source": [ 992 | "yhat_prob = LR.predict_proba(X_test)\n", 993 | "yhat_prob" 994 | ] 995 | }, 996 | { 997 | "cell_type": "markdown", 998 | "metadata": {}, 999 | "source": [ 1000 | "

Evaluation

" 1001 | ] 1002 | }, 1003 | { 1004 | "cell_type": "markdown", 1005 | "metadata": {}, 1006 | "source": [ 1007 | "### jaccard index\n", 1008 | "Lets try jaccard index for accuracy evaluation. we can define jaccard as the size of the intersection divided by the size of the union of two label sets. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.\n", 1009 | "\n" 1010 | ] 1011 | }, 1012 | { 1013 | "cell_type": "code", 1014 | "execution_count": null, 1015 | "metadata": {}, 1016 | "outputs": [], 1017 | "source": [ 1018 | "from sklearn.metrics import jaccard_similarity_score\n", 1019 | "jaccard_similarity_score(y_test, yhat)" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "markdown", 1024 | "metadata": {}, 1025 | "source": [ 1026 | "### confusion matrix\n", 1027 | "Another way of looking at accuracy of classifier is to look at __confusion matrix__." 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "code", 1032 | "execution_count": null, 1033 | "metadata": {}, 1034 | "outputs": [], 1035 | "source": [ 1036 | "from sklearn.metrics import classification_report, confusion_matrix\n", 1037 | "import itertools\n", 1038 | "def plot_confusion_matrix(cm, classes,\n", 1039 | " normalize=False,\n", 1040 | " title='Confusion matrix',\n", 1041 | " cmap=plt.cm.Blues):\n", 1042 | " \"\"\"\n", 1043 | " This function prints and plots the confusion matrix.\n", 1044 | " Normalization can be applied by setting `normalize=True`.\n", 1045 | " \"\"\"\n", 1046 | " if normalize:\n", 1047 | " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", 1048 | " print(\"Normalized confusion matrix\")\n", 1049 | " else:\n", 1050 | " print('Confusion matrix, without normalization')\n", 1051 | "\n", 1052 | " print(cm)\n", 1053 | "\n", 1054 | " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", 1055 | " plt.title(title)\n", 1056 | " plt.colorbar()\n", 1057 | " tick_marks = np.arange(len(classes))\n", 1058 | " plt.xticks(tick_marks, classes, rotation=45)\n", 1059 | " plt.yticks(tick_marks, classes)\n", 1060 | "\n", 1061 | " fmt = '.2f' if normalize else 'd'\n", 1062 | " thresh = cm.max() / 2.\n", 1063 | " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", 1064 | " plt.text(j, i, format(cm[i, j], fmt),\n", 1065 | " horizontalalignment=\"center\",\n", 1066 | " color=\"white\" if cm[i, j] > thresh else \"black\")\n", 1067 | "\n", 1068 | " plt.tight_layout()\n", 1069 | " plt.ylabel('True label')\n", 1070 | " plt.xlabel('Predicted label')\n", 1071 | "print(confusion_matrix(y_test, yhat, labels=[1,0]))" 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "code", 1076 | "execution_count": null, 1077 | "metadata": {}, 1078 | "outputs": [], 1079 | "source": [ 1080 | "# Compute confusion matrix\n", 1081 | "cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])\n", 1082 | "np.set_printoptions(precision=2)\n", 1083 | "\n", 1084 | "\n", 1085 | "# Plot non-normalized confusion matrix\n", 1086 | "plt.figure()\n", 1087 | "plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')" 1088 | ] 1089 | }, 1090 | { 1091 | "cell_type": "markdown", 1092 | "metadata": {}, 1093 | "source": [ 1094 | "Look at first row. The first row is for customers whose actual churn value in test set is 1.\n", 1095 | "As you can calculate, out of 40 customers, the churn value of 15 of them is 1. \n", 1096 | "And out of these 15, the classifier correctly predicted 6 of them as 1, and 9 of them as 0. \n", 1097 | "\n", 1098 | "It means, for 6 customers, the actual churn value were 1 in test set, and classifier also correctly predicted those as 1. However, while the actual label of 9 customers were 1, the classifier predicted those as 0, which is not very good. We can consider it as error of the model for first row.\n", 1099 | "\n", 1100 | "What about the customers with churn value 0? Lets look at the second row.\n", 1101 | "It looks like there were 25 customers whom their churn value were 0. \n", 1102 | "\n", 1103 | "\n", 1104 | "The classifier correctly predicted 24 of them as 0, and one of them wrongly as 1. So, it has done a good job in predicting the customers with churn value 0. A good thing about confusion matrix is that shows the model’s ability to correctly predict or separate the classes. In specific case of binary classifier, such as this example, we can interpret these numbers as the count of true positives, false positives, true negatives, and false negatives. " 1105 | ] 1106 | }, 1107 | { 1108 | "cell_type": "code", 1109 | "execution_count": null, 1110 | "metadata": {}, 1111 | "outputs": [], 1112 | "source": [ 1113 | "print (classification_report(y_test, yhat))\n" 1114 | ] 1115 | }, 1116 | { 1117 | "cell_type": "markdown", 1118 | "metadata": {}, 1119 | "source": [ 1120 | "Based on the count of each section, we can calculate precision and recall of each label:\n", 1121 | "\n", 1122 | "\n", 1123 | "- __Precision__ is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)\n", 1124 | "\n", 1125 | "- __Recall__ is true positive rate. It is defined as: Recall =  TP / (TP + FN)\n", 1126 | "\n", 1127 | " \n", 1128 | "So, we can calculate precision and recall of each class.\n", 1129 | "\n", 1130 | "__F1 score:__\n", 1131 | "Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label. \n", 1132 | "\n", 1133 | "The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.\n", 1134 | "\n", 1135 | "\n", 1136 | "And finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.72 in our case." 1137 | ] 1138 | }, 1139 | { 1140 | "cell_type": "markdown", 1141 | "metadata": {}, 1142 | "source": [ 1143 | "### log loss\n", 1144 | "Now, lets try __log loss__ for evaluation. In logistic regression, the output can be the probability of customer churn is yes (or equals to 1). This probability is a value between 0 and 1.\n", 1145 | "Log loss( Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. \n" 1146 | ] 1147 | }, 1148 | { 1149 | "cell_type": "code", 1150 | "execution_count": 1, 1151 | "metadata": {}, 1152 | "outputs": [ 1153 | { 1154 | "ename": "NameError", 1155 | "evalue": "name 'y_test' is not defined", 1156 | "output_type": "error", 1157 | "traceback": [ 1158 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 1159 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 1160 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmetrics\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mlog_loss\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mlog_loss\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my_test\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0myhat_prob\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 1161 | "\u001b[0;31mNameError\u001b[0m: name 'y_test' is not defined" 1162 | ] 1163 | } 1164 | ], 1165 | "source": [ 1166 | "from sklearn.metrics import log_loss\n", 1167 | "log_loss(y_test, yhat_prob)" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "markdown", 1172 | "metadata": {}, 1173 | "source": [ 1174 | "

Practice

\n", 1175 | "Try to build Logistic Regression model again for the same dataset, but this time, use different __solver__ and __regularization__ values? What is new __logLoss__ value?" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "code", 1180 | "execution_count": null, 1181 | "metadata": {}, 1182 | "outputs": [], 1183 | "source": [ 1184 | "# write your code here\n", 1185 | "\n" 1186 | ] 1187 | }, 1188 | { 1189 | "cell_type": "markdown", 1190 | "metadata": {}, 1191 | "source": [ 1192 | "Double-click __here__ for the solution.\n", 1193 | "\n", 1194 | "" 1201 | ] 1202 | }, 1203 | { 1204 | "cell_type": "markdown", 1205 | "metadata": { 1206 | "button": false, 1207 | "new_sheet": false, 1208 | "run_control": { 1209 | "read_only": false 1210 | } 1211 | }, 1212 | "source": [ 1213 | "

Want to learn more?

\n", 1214 | "\n", 1215 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n", 1216 | "\n", 1217 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n", 1218 | "\n", 1219 | "

Thanks for completing this lesson!

\n", 1220 | "\n", 1221 | "

Author: Saeed Aghabozorgi

\n", 1222 | "

Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.

\n", 1223 | "\n", 1224 | "
\n", 1225 | "\n", 1226 | "

Copyright © 2018 Cognitive Class. This notebook and its source code are released under the terms of the MIT License.

" 1227 | ] 1228 | } 1229 | ], 1230 | "metadata": { 1231 | "kernelspec": { 1232 | "display_name": "Python 3", 1233 | "language": "python", 1234 | "name": "python3" 1235 | }, 1236 | "language_info": { 1237 | "codemirror_mode": { 1238 | "name": "ipython", 1239 | "version": 3 1240 | }, 1241 | "file_extension": ".py", 1242 | "mimetype": "text/x-python", 1243 | "name": "python", 1244 | "nbconvert_exporter": "python", 1245 | "pygments_lexer": "ipython3", 1246 | "version": "3.6.7" 1247 | }, 1248 | "widgets": { 1249 | "state": {}, 1250 | "version": "1.1.2" 1251 | } 1252 | }, 1253 | "nbformat": 4, 1254 | "nbformat_minor": 4 1255 | } 1256 | -------------------------------------------------------------------------------- /Classification Models/ML0101EN-Clas-SVM-cancer-py-v1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "

SVM (Support Vector Machines)

" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "In this notebook, you will use SVM (Support Vector Machines) to build and train a model using human cell records, and classify cells to whether the samples are benign or malignant.\n", 17 | "\n", 18 | "SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong." 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "

Table of contents

\n", 26 | "\n", 27 | "
\n", 28 | "
    \n", 29 | "
  1. Load the Cancer data
  2. \n", 30 | "
  3. Modeling
  4. \n", 31 | "
  5. Evaluation
  6. \n", 32 | "
  7. Practice
  8. \n", 33 | "
\n", 34 | "
\n", 35 | "
\n", 36 | "
" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "import pandas as pd\n", 46 | "import pylab as pl\n", 47 | "import numpy as np\n", 48 | "import scipy.optimize as opt\n", 49 | "from sklearn import preprocessing\n", 50 | "from sklearn.model_selection import train_test_split\n", 51 | "%matplotlib inline \n", 52 | "import matplotlib.pyplot as plt" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": { 58 | "button": false, 59 | "new_sheet": false, 60 | "run_control": { 61 | "read_only": false 62 | } 63 | }, 64 | "source": [ 65 | "

Load the Cancer data

\n", 66 | "The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)[http://mlearn.ics.uci.edu/MLRepository.html]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:\n", 67 | "\n", 68 | "|Field name|Description|\n", 69 | "|--- |--- |\n", 70 | "|ID|Clump thickness|\n", 71 | "|Clump|Clump thickness|\n", 72 | "|UnifSize|Uniformity of cell size|\n", 73 | "|UnifShape|Uniformity of cell shape|\n", 74 | "|MargAdh|Marginal adhesion|\n", 75 | "|SingEpiSize|Single epithelial cell size|\n", 76 | "|BareNuc|Bare nuclei|\n", 77 | "|BlandChrom|Bland chromatin|\n", 78 | "|NormNucl|Normal nucleoli|\n", 79 | "|Mit|Mitoses|\n", 80 | "|Class|Benign or malignant|\n", 81 | "\n", 82 | "
\n", 83 | "
\n", 84 | "\n", 85 | "For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record. To download the data, we will use `!wget` to download it from IBM Object Storage. \n", 86 | "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 3, 92 | "metadata": { 93 | "button": false, 94 | "new_sheet": false, 95 | "run_control": { 96 | "read_only": false 97 | } 98 | }, 99 | "outputs": [ 100 | { 101 | "name": "stdout", 102 | "output_type": "stream", 103 | "text": [ 104 | "--2019-07-11 02:35:27-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv\n", 105 | "Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193\n", 106 | "Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.\n", 107 | "HTTP request sent, awaiting response... 200 OK\n", 108 | "Length: 20675 (20K) [text/csv]\n", 109 | "Saving to: ‘cell_samples.csv’\n", 110 | "\n", 111 | "cell_samples.csv 100%[===================>] 20.19K --.-KB/s in 0.02s \n", 112 | "\n", 113 | "2019-07-11 02:35:27 (966 KB/s) - ‘cell_samples.csv’ saved [20675/20675]\n", 114 | "\n" 115 | ] 116 | } 117 | ], 118 | "source": [ 119 | "#Click here and press Shift+Enter\n", 120 | "!wget -O cell_samples.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv" 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": { 126 | "button": false, 127 | "new_sheet": false, 128 | "run_control": { 129 | "read_only": false 130 | } 131 | }, 132 | "source": [ 133 | "### Load Data From CSV File " 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 4, 139 | "metadata": { 140 | "button": false, 141 | "new_sheet": false, 142 | "run_control": { 143 | "read_only": false 144 | } 145 | }, 146 | "outputs": [ 147 | { 148 | "data": { 149 | "text/html": [ 150 | "
\n", 151 | "\n", 164 | "\n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | "
IDClumpUnifSizeUnifShapeMargAdhSingEpiSizeBareNucBlandChromNormNuclMitClass
010000255111213112
1100294554457103212
210154253111223112
310162776881343712
410170234113213112
\n", 254 | "
" 255 | ], 256 | "text/plain": [ 257 | " ID Clump UnifSize UnifShape MargAdh SingEpiSize BareNuc \\\n", 258 | "0 1000025 5 1 1 1 2 1 \n", 259 | "1 1002945 5 4 4 5 7 10 \n", 260 | "2 1015425 3 1 1 1 2 2 \n", 261 | "3 1016277 6 8 8 1 3 4 \n", 262 | "4 1017023 4 1 1 3 2 1 \n", 263 | "\n", 264 | " BlandChrom NormNucl Mit Class \n", 265 | "0 3 1 1 2 \n", 266 | "1 3 2 1 2 \n", 267 | "2 3 1 1 2 \n", 268 | "3 3 7 1 2 \n", 269 | "4 3 1 1 2 " 270 | ] 271 | }, 272 | "execution_count": 4, 273 | "metadata": {}, 274 | "output_type": "execute_result" 275 | } 276 | ], 277 | "source": [ 278 | "cell_df = pd.read_csv(\"cell_samples.csv\")\n", 279 | "cell_df.head()" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.\n", 287 | "\n", 288 | "The Class field contains the diagnosis, as confirmed by separate medical procedures, as to whether the samples are benign (value = 2) or malignant (value = 4).\n", 289 | "\n", 290 | "Lets look at the distribution of the classes based on Clump thickness and Uniformity of cell size:" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 5, 296 | "metadata": {}, 297 | "outputs": [ 298 | { 299 | "data": { 300 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX4AAAEGCAYAAABiq/5QAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAgAElEQVR4nO3dfXRU9b3v8fcXkpSJkmgh9nLEm6G9UsODRohZHKFHVJCuK1Xrsr2leq/SKF2tQVtrq21Xfeg6p8vj8bb2aG/vpY3IaUu0Bx9LfUB6dFWtbQhCFTIHrHVATrmHIXpzqolNQr73j5k8AoY8zN472Z/XWqw988tk7y+/2fnkl9/M7J+5OyIiEh8Twi5ARESCpeAXEYkZBb+ISMwo+EVEYkbBLyISMwVhF3Aspk6d6slkMuwyRETGlK1btx5097KB7WMi+JPJJI2NjWGXISIyppjZniO1a6pHRCRmFPwiIjGj4BcRiZkxMcd/JB0dHezbt4/33nsv7FLGjUmTJjF9+nQKCwvDLkVE8mjMBv++ffuYPHkyyWQSMwu7nDHP3Wlubmbfvn3MmDEj7HJEJI/yNtVjZveZ2QEz29Gn7YNm9oyZvZbbnjjc/b/33ntMmTJFoT9KzIwpU6bE6i+oTKaVLVv2k8m0hlpHKtXMunU7SKWaQ60jCqLSF1E4NzZufJ2rr36ajRtfH/V953PEfz9wL/BPfdpuBn7l7neY2c25+zcN9wAK/dEVp/6sr09RU/M0RUUTaG/voq5uGStWVARex+rVm7n33u0992trK7nnniWB1xEFUemLKJwbc+euZceO7C+/urpXmTt3Cq+8snLU9p+3Eb+7/xp4a0DzxcC63O11wCX5Or7I0WQyrdTUPE1bWyctLe20tXVSU/N04KO7VKq5X9AB3Hvv9tBHu2GISl9E4dzYuPH1ntDv9uqrzaM68g/6XT0fcvf9ALntSUd7oJmtMrNGM2vMZDKBFRiU5557juXLlwPw+OOPc8cddwR27O3bt/PEE08EdryoSadbKCrqf+oXFk4gnW4JtI6Ghv1Dah/PotIXUTg3Hn30D0NqH47Ivp3T3de4e5W7V5WVHfaJ43Hloosu4uabbw7seHEP/mSylPb2rn5tHR1dJJOlgdZRXT1tSO3jWVT6IgrnxiWX/JchtQ9H0MH/72Y2DSC3PRDkwUf7BZt0Os1pp53G1VdfzZw5c7j88svZvHkzCxcu5NRTT6WhoYGGhgbOPvtszjzzTM4++2x27dp12H7uv/9+amtrAXj99ddZsGABZ511FrfccgvHH388kP0LYfHixVx22WWcdtppXH755XSvnvbtb3+bs846izlz5rBq1aqe9sWLF3PTTTdRXV3NzJkzef7552lvb+eWW27hwQcfpLKykgcffHBU+mIsKSsrpq5uGYlEASUlRSQSBdTVLaOsrDjQOioqplBbW9mvrba2koqKKYHWEQVR6YsonBvLl3+EuXP7/7/nzp3C8uUfGb2DuHve/gFJYEef+/8A3Jy7fTNw57HsZ/78+T5QU1PTYW3vZ/36Jk8kvuelpd/3ROJ7vn790L7/SN544w2fOHGiv/LKK37o0CGfN2+er1y50ru6uvzRRx/1iy++2FtaWryjo8Pd3Z955hm/9NJL3d392Wef9QsvvNDd3deuXevXXnutu7tfeOGFvn79end3/+EPf+jHHXdcz+NLSkr8zTff9EOHDvmCBQv8+eefd3f35ubmnpquuOIKf/zxx93d/ZxzzvEbbrjB3d1/+ctf+vnnn3/Y8QYaar+OZQcOvOsNDX/yAwfeDbWOpqaDfv/9r3pT08FQ64iCqPRFFM6NX/ziD15T85T/4hd/GPY+gEY/Qqbm7V09ZlYPLAammtk+4FbgDuDnZlYD7AU+la/j99X3BZu2tmxbTc3TLFlSPuLf5DNmzGDu3LkAzJ49m/PPPx8zY+7cuaTTaVpaWrjyyit57bXXMDM6Ojred38vvfQSjz76KACf/exnufHGG3u+Vl1dzfTp0wGorKwknU6zaNEinn32We68805aW1t56623mD17Np/4xCcAuPTSSwGYP38+6XR6RP/X8aasrDjwUf6RVFRMieUo/0ii0hdRODeWL//I6I7y+8hb8Lv7iqN86fx8HfNoul+w6Q596H3BZqRP7gc+8IGe2xMmTOi5P2HCBDo7O/nWt77FueeeyyOPPEI6nWbx4sWjcqyJEyfS2dnJe++9xxe/+EUaGxs55ZRTuO222/q9F7/7e7ofLyIS2Rd3R1OYL9i0tLRw8sknA9m5/MEsWLCAhx56CIAHHnhg0Md3h/zUqVN555132LBhw6DfM3nyZP785z8P+jgRGZ9iEfxhvmDzta99ja9//essXLiQQ4cODfr4u+++m+9+97tUV1ezf/9+Skvf/5fTCSecwDXXXMPcuXO55JJLOOusswY9xrnnnktTU1NsX9wViTvz3DtAoqyqqsoHLsSSSqWoqBjap+kymVbS6RaSydLQ5++OprW1lUQigZnxwAMPUF9fz2OPPRbY8YfTryISTWa21d2rBraP2Yu0DUcUXrAZzNatW6mtrcXdOeGEE7jvvvvCLklExplYBf9Y8LGPfYzf//73YZchIuNYLOb4RUSkl4JfRCRmFPwiIjGj4BcRiRkF/wik02nmzJkz4v00NjZy3XXXjUJFIiKD07t6IqCqqoqqqsPeaisikhcxG/FngC257ejo7Ozkyiuv5PTTT+eyyy6jtbWVrVu3cs455zB//nyWLVvG/v3ZxSSOdJlk6L8oSyaTYenSpcybN4/Pf/7zlJeXc/DgQdLpNBUVFVxzzTXMnj2bCy64gLa+Fx8SETlGMQr+eqAcWJrb1o/KXnft2sWqVat45ZVXKCkp4Qc/+AGrV69mw4YNbN26lc997nN885vf7Hl8Z2cnDQ0N3H333dx+++2H7e/222/nvPPO4+WXX+aTn/wke/fu7fnaa6+9xrXXXsvOnTs54YQTeq7pIyIyFDGZ6skANUBb7h+5+0uAka3udcopp7Bw4UIArrjiCr7zne+wY8cOli5dCsChQ4eYNq13FaHBLpP8wgsv8MgjjwDw8Y9/nBNPPLHnazNmzKCysvJ9v19EZDAxCf40UERv6AMU5tpHFvxm1u/+5MmTmT17Ni+99NIRHz/YZZLf79pJAy/LrKkeERmOmEz1JIH2AW0dufaR2bt3b0/I19fXs2DBAjKZTE9bR0cHO3fuPOb9LVq0iJ///OcAbNq0ibfffnvENYqI9BWT4C8D6oAEUJLb1jHS0T5ARUUF69at4/TTT+ett97qmd+/6aabOOOMM6isrOQ3v/nNMe/v1ltvZdOmTcybN48nn3ySadOmMXny5BHXKSLSLVaXZc7O9afJjvRHHvr58Je//IWJEydSUFDASy+9xBe+8AW2b98e2PF1WWaR8UOXZQayYR/NwO+2d+9ePv3pT9PV1UVRURE/+tGPwi5JRMaZmAV/9J166qls27Yt7DJEZBwb03P8Y2GaaixRf4rEw5gN/kmTJtHc3KywGiXuTnNzM5MmTQq7FBHJszE71TN9+nT27dtHJjN6l1+Iu0mTJjF9+vSwyxCRPBuzwV9YWMiMGTPCLkNEZMwZs1M9IiIyPAp+EZGYUfCLiMSMgl9EJGYU/CIiMaPgFxGJGQW/iEjMKPhFRGJGwS8iEjMKfhGRmFHwi4jETCjBb2ZfNrOdZrbDzOrNTJeElBjLAFty25AqyLSyZct+MpnW0GqQ4AQe/GZ2MnAdUOXuc4CJwGeCrkMkGuqBcmBpblsffAX1KcrL17B06T9TXr6G+vpU4DVIsMKa6ikAEmZWABQDfwqpDpEQZYAaoA1oyW1rCHLkn8m0UlPzNG1tnbS0tNPW1klNzdMa+Y9zgQe/u/8bcBewF9gPtLj7poGPM7NVZtZoZo265r6MT2mgaEBbYa49oArSLRQV9Y+BwsIJpNMtgdUgwQtjqudE4GJgBvBXwHFmdsXAx7n7GnevcveqsrJoL5AuMjxJoH1AW0euPaAKkqW0t3f1r6Cji2SyNLAaJHhhTPUsAd5w94y7dwAPA2eHUIdIyMqAOiABlOS2dbn2gCooK6aubhmJRAElJUUkEgXU1S2jrKw4sBokeGGswLUXWGBmxWQnNc8HGkOoQyQCVpAdC6XJjvSD/+t2xYoKliwpJ51uIZksVejHQODB7+6/M7MNwMtAJ7ANWBN0HSLRUUYYgd+vgrJiBX6MhLLmrrvfCtwaxrFFROJOn9wVEYkZBb+ISMwo+EVEYkbBLyISMwp+EZGYUfCLiMSMgl9EJGYU/CIiMaPgFxGJGQW/iEjMKPhFRGJGwS8iEjMKfhGRmFHwx0wm08qWLfu1pqpEks7PXqlUM+vW7SCVah71fYdyWWYJR319ipqapykqmkB7exd1dctYsaIi7LJEAJ2ffa1evZl7793ec7+2tpJ77lkyavs3dx+1neVLVVWVNzZqka6RyGRaKS9fQ1tbZ09bIlHAnj2rtACHhE7nZ69UqplZs9Ye1t7UtJKKiilD2peZbXX3qoHtmuqJiXS6haKi/k93YeEE0umWkCoS6aXzs1dDw/4htQ+Hgj8mkslS2tu7+rV1dHSRTJaGVJFIL52fvaqrpw2pfTgU/DFRVlZMXd0yEokCSkqKSCQKqKtbFrs/oyWadH72qqiYQm1tZb+22trKIU/zvB/N8cdMJtNKOt1CMlkayx8qiTadn71SqWYaGvZTXT1t2KF/tDl+vasnZsrKimP/AyXRpfOzV0XFlFEd5felqR4RkZhR8IuIxIyCX0QkZhT8IiIxo+AXEYkZBb+ISMwo+EVEYkbBLyISMwp+EZGYUfCLiMTMkILfzI7LVyEiIhKMYwp+MzvbzJqAVO7+GWb2v/JamYiI5MWxjvi/BywDmgHc/ffA3+SrKBERyZ9jnupx9zcHNB0a7kHN7AQz22Bm/2pmKTP76+HuS4ZGi1n3ik5fZIAtua1EQRTOjSgstv6mmZ0NuJkVAdeRm/YZpu8DT7n7Zbn96TqsAdBi1r2i0xf1QA1QBLQDdcCKEOqQblE4NyKx2LqZTSUb1ksAAzYB17v7kH8VmVkJ8Hvgw36Mq8BoIZaR02LWvaLTFxmgHGjr05YA9gBlAdYh3aJwbkRpsfWEu1/u7h9y95Pc/QqgcEgV9Pow2TN+rZltM7MfH+ndQma2yswazawxk9GfwCOlxax7Racv0mRH+v0qybVLGKJwbkRpsfU3zKzezBJ92p4Y5jELgHnAD939TOBd4OaBD3L3Ne5e5e5VZWUa/YyUFrPuFZ2+SJKd3ulXSa5dwhCFcyNKi62/CjwPvGBmH8m12TCPuQ/Y5+6/y93fQPYXgeSRFrPuFZ2+KCM7p58ASnLbOjTNE54onBuRWWzdzF5293lmthD4EXATcLu7Dyuwzex54Gp332VmtwHHuftXj/Z4zfGPHi1m3Ss6fZEhO72TRKEfDVE4N/K52PqxBv+23LQMZjYNeBCocvdh9YiZVQI/JjvB+Udgpbu/fbTHK/hFRIbuaMF/rG/n/K/dN9x9v5mdB5w93GLcfTtwWDEiIpJ/7xv8ZnaFu/8UWGF2xCn9X+elKhERyZvBRvzdb7OcnO9CREQkGO8b/O7+f3Lb24MpR0RE8u19385pZteY2am522Zm95lZi5m9YmZnBlOiiIiMpsHex389vR8jXAGcQfaTtzcA/5i/skREJF8GC/5Od+/I3V4O/JO7N7v7Znrn/0VEZAwZLPi7zGyamU0Czgc29/la4ijfIyIiETbYu3q+BTQCE4HH3X0ngJmdQ/aDVyIiMsYMFvzFZK8be7q7v9ynvRH4b3mrSkRE8mawqZ6vu3sn2csr9HD3d939nfyVJSIi+TLYiL/ZzJ4FZpjZ4wO/6O4X5acsERHJl8GC/0Kyl0z+CfA/81+OiIjk22Cf3G0HfmtmZ7u7lsESERkHBrtI293u/iXgPjM77PrNmuoZmihc4zsKNUSF+qKX+qK/0bgWfpRrGGyq5ye57V2jetQYqq9PUVPzNEVFE2hv76KubhkrVlTEroaoUF/0Ul/0t3r1Zu69d3vP/draSu65Z8m4quGYFmIJ21hfiCWTaaW8fA1tbZ09bYlEAXv2rApsdBWFGqJCfdFLfdFfKtXMrFlrD2tvaloZ2Mh/NGs42kIsx7TmrpktNLNnzGy3mf3RzN4wM32A6xil0y0UFfXv6sLCCaTTLbGqISrUF73UF/01NOwfUvtYreFYV+CqA74MbAUOjdrRYyKZLKW9vatfW0dHF8lkaaxqiAr1RS/1RX/V1dOG1D5WazimET/Q4u5PuvuB3EXamt29edSqGOfKyoqpq1tGIlFASUkRiUQBdXXLAv1TOgo1RIX6opf6or+KiinU1lb2a6utrQz0Bd4gajjWxdbvIHu9noeBv3S3D7iMQ96M9Tn+blF450QUaogK9UUv9UV/4+VdPUeb4z/W4H82d7P7wQa4u583rGqGaLwEv4hIkI4W/IO9j/+G3M2Nua0DGeAFd39jdEsUEZEgDDbHPzn37/jcv8lAFfCkmX0mz7WJiEgeDHbJhiMusm5mHyS7KMsD+ShKRETy51jf1dOPu79Fdp5fRETGmGEFv5mdB7w9yrWIiEgABntx91V638nT7YPAn4D/ka+iREQkfwb75O7yAfcdaHb3d/NUj4iI5NlgL+7uCaoQEREJxrDm+EVEZOxS8IuIxIyCX0QkZhT8IiIxo+AXEYkZBb+ISMyEFvxmNtHMtpnZxsEfLTL6Dh7cQ1PTkxw8qHctZzKtbNmyn0ymVXVEpI5Uqpl163aQSo3+mldhjvivB1IhHl9i7MUX76K4+FROPvlSiotP5cUX7wq7pNDU16coL1/D0qX/THn5Gurrw/mxVB29Vq/ezKxZa7nqqqeYNWstq1dvHtX9H9NCLKPNzKYD64C/A25w94GfEO5HC7HIaDp4cA/FxadSXNzR09baWkhr62tMnVoeYmXBy2RaKS9fQ1tbZ09bIlHAnj2rAl2JS3X0SqWamTVr7WHtTU0rh7wS19EWYglrxH838DWg62gPMLNVZtZoZo2ZTCa4ymTcO3CgiY6Oif3aOjomcuBAU0gVhSedbqGoqH8MFBZOIJ1uUR0h1dHQsH9I7cMRePCb2XLggLtvfb/Hufsad69y96qysrKAqpM4OOmkWRQWHurXVlh4iJNOmhVSReFJJktpb+8//uro6CKZLFUdIdVRXT1tSO3DEcaIfyFwkZmlyS7kcp6Z/TSEOiSmpk4tZ9u279DaWkhLyyRaWwvZtu07sZvmASgrK6aubhmJRAElJUUkEgXU1S0LfMF11dGromIKtbWV/dpqaytHddH3UOb4ew5uthi4UXP8EoaDB/dw4EATJ500K5ah31cm00o63UIyWRp42KqOI0ulmmlo2E919bRhh/6wFlsXGc+mTi2PfeB3KysrDjVoVcfhKiqmjOoov69Qg9/dnwOeC7MGEZG40Sd3RURiRsEvIhIzCn4RkZhR8IuIxIyCX0QkZhT8IiIxo+AXEYkZBb+ISMwo+EVEYkbBLyISMwp+EZGYUfCLiMSMgl9EJGZiEvwZYEtuG2IVmVa2bNlPJtMa6xqiYtOmBv72b/+RTZsaQq0jCs9JKtXMunU7SKWaQ6sBotEXUZHX58TdI/9v/vz5Pnzr3T3h7qW57foR7GsEVaxv8kTie15a+n1PJL7n69c3xbKGqPjqV1f5u+8W+ttvT/J33y30r351VSh1ROE5qa19xuEfev7V1j4TeA3u0eiLqBit5wRo9CNkaqgrcB2r4a/AlQHKgbY+bQlgDxDcOr6ZTCvl5Wtoa+vsrSJRwJ49qwJb7CEKNUTFpk0NLFq0iOLijp621tZCXnjhBS64oDqwOqLwnKRSzcyatfaw9qamlXlbBORIotAXUTGaz8nRVuAa51M9aaBoQFthrj3AKtItFBX17+rCwgmk0y2xqiEqGhp+S3v7xH5tHR0TaGj4baB1ROE5aWjYP6T2fIlCX0RFEM/JOA/+JNA+oK0j1x5gFclS2tu7+lfR0UUyWRqrGqKiunoBRUWH+rUVFnZRXb0g0Dqi8JxUV08bUnu+RKEvoiKI52ScB38ZUEd2eqckt60jyGkeyK7fWVe3jESigJKSIhKJAurqlgX6J2wUaoiKCy6o5rbbVtLaWkhLywdobS3ktttWBjrNA9F4TioqplBbW9mvrba2MtBpHohGX0RFEM/JOJ/j75YhO72TJOjQ71dFppV0uoVksjS0EzoKNUTFpk0NNDT8lurqBYGHfl9ReE5SqWYaGvZTXT0t8NDvKwp9ERWj8ZwcbY4/JsEvIhI/MX1xV0REBlLwi4jEjIJfRCRmFPwiIjGj4BcRiRkFv4hIzCj4RURiRsEvIhIzCn4RkZhR8IuIxIyCX0QkZhT8IiIxo+AXEYmZwIPfzE4xs2fNLGVmO83s+qBrCE8UFn2PQg3RqGP37l1s3PgAu3fvCq2GrPD7Iiqisuj7eBfGiL8T+Iq7VwALgGvNbFYIdQSsnuz6v0tz2/qY1hCNOtau/QbTp89l0aKVTJ8+l7VrvxF4DVnh90VUrF69mVmz1nLVVU8xa9ZaVq/eHHZJ41bo1+M3s8eAe939maM9Zuxfjz8Ki75HoYZo1LF79y6mT5972GLr+/a9ysyZHw2khqzw+yIqorLo+3gTyevxm1kSOBP43RG+tsrMGs2sMZMZ638Cpwl/0fco1BCNOnbv3nbExdZ3794WWA1ZacLui6iIyqLvcRFa8JvZ8cBDwJfc/T8Gft3d17h7lbtXlZWN9dFPkvAXfY9CDdGoY+bMM4+42PrMmWcGVkNWkrD7Iiqisuh7XIQS/GZWSDb0f+buD4dRQ7CisOh7FGqIRh0zZ36UBx+8sd9i6w8+eGPA0zwQhb6Iiqgs+h4Xgc/xm5kB64C33P1Lx/I9Y3+Ov1sUFn2PQg3RqGP37l3s3r2NmTPPDCH0+wq/L6IiKou+jxeRWWzdzBYBzwOvAl255m+4+xNH+57xE/wiIsE5WvAXBF2Iu78AWNDHFRGRLH1yV0QkZhT8IiIxo+AXEYkZBb+ISMwo+EVEYkbBLyISMwp+EZGYUfCLiMSMgl9EJGYU/CIiMaPgFxGJGQW/iEjMKPhFRGImJsF/CXB8bhumr5BdY/UrIdZwJ9nVLu8MsYao1PEicGtuG6YMsCW3Fcm/0BdbPxYjux7/ka4AHcb/eSK9yw903+8MuIbjgNYB998JuIao1HEB8MyA+08HXANAPVBDdu3ddrIrcK0IoQ4ZjyK52Hr+HW2EH/TI/yv0D32AQwQ78r+T/mEL8C7Bj7ijUMeL9A99gE0EP/LPkA39NqAlt61BI3/Jt3Ee/JuH2J4vG4bYng/1Q2zPlyjUsWmI7fmSJjvS76sw1y6SP+M8+JcMsT1fLhtiez4cbfog6GmFKNRxwRDb8yVJdnqnr45cu0j+aI4/MAVkp3e6hTHHfzzZaZVuYc3xR6GOZfQf4Yc9x19INvQ1xy+jJ6Zz/JAN+YvJhsvFhBP6kA35G4D/nNsGHfqQDde/Bypz2zBCPyp1PA28ANyS24YR+pAN+T1kpx/3oNCXIMRgxC8iEk8xHvGLiEhfCn4RkZhR8IuIxIyCX0QkZhT8IiIxo+AXEYkZBb+ISMyMiffxm1mG7KdbxoOpwMGwi4gI9UUv9UUv9UWvkfZFubuXDWwcE8E/nphZ45E+UBFH6ote6ote6ote+eoLTfWIiMSMgl9EJGYU/MFbE3YBEaK+6KW+6KW+6JWXvtAcv4hIzGjELyISMwp+EZGYUfAHwMxOMbNnzSxlZjvN7PqwawqbmU00s21mtjHsWsJkZieY2QYz+9fc+fHXYdcUJjP7cu5nZIeZ1ZvZpLBrCoqZ3WdmB8xsR5+2D5rZM2b2Wm574mgcS8EfjE7gK+5eASwArjWzWSHXFLbrgVTYRUTA94Gn3P004Axi3CdmdjJwHVDl7nPIrk/6mXCrCtT9wMcHtN0M/MrdTwV+lbs/Ygr+ALj7fnd/OXf7z2R/uE8Ot6rwmNl04ELgx2HXEiYzKwH+huxCu7h7u7v/v3CrCl0BkDCzAqAY+FPI9QTG3X8NvDWg+WJgXe72OuCS0TiWgj9gZpYEzgR+F24lobob+BrQFXYhIfswkAHW5qa9fmxmx4VdVFjc/d+Au4C9wH6gxd03hVtV6D7k7vshO4AEThqNnSr4A2RmxwMPAV9y9/8Iu54wmNly4IC7bw27lggoAOYBP3T3M4F3GaU/5cei3Pz1xcAM4K+A48zsinCrGp8U/AExs0Kyof8zd3847HpCtBC4yMzSwAPAeWb203BLCs0+YJ+7d//1t4HsL4K4WgK84e4Zd+8AHgbODrmmsP27mU0DyG0PjMZOFfwBMDMjO4+bcvfvhl1PmNz96+4+3d2TZF+4+xd3j+Wozt3/L/CmmX0013Q+0BRiSWHbCywws+Lcz8z5xPjF7pzHgStzt68EHhuNnRaMxk5kUAuB/w68ambbc23fcPcnQqxJomE18DMzKwL+CKwMuZ7QuPvvzGwD8DLZd8JtI0aXbzCzemAxMNXM9gG3AncAPzezGrK/GD81KsfSJRtEROJFUz0iIjGj4BcRiRkFv4hIzCj4RURiRsEvIhIzCn4RwMz+k5k9YGavm1mTmT1hZjP7XilRZLzQ+/gl9nIfFnoEWOfun8m1VQIfCrUwkTzRiF8EzgU63P1/dze4+3bgze77ZnaVmd3b5/5GM1ucu/2Omf29mW01s81mVm1mz5nZH83soj7f/5iZPWVmu8zs1sD+dyIDKPhFYA4wkovGHQc85+7zgT8DfwssBT4JfLvP46qBy4FK4FNmVjWCY4oMm6Z6REauHXgqd/tV4C/u3mFmrwLJPo97xt2bAczsYWAR0BhkoSKgEb8IwE5g/iCP6aT/z0vfJQE7vPfaJ13AXwDcvYv+g6uB10fR9VIkFAp+EfgX4ANmdk13g5mdBZT3eUwaqDSzCWZ2Ctlpm6FamltDNUF2JaUXR1CzyLAp+CX2cqP1T5IN5tfNbCdwG/2X/XsReIPsVM5dZK8gOVQvAD8BtgMPubumeSQUujqnSADM7Cqyi4jXhl2LiEb8IhNTEpYAAAAqSURBVCIxoxG/iEjMaMQvIhIzCn4RkZhR8IuIxIyCX0QkZhT8IiIx8/8BvpdIkMWkrKgAAAAASUVORK5CYII=\n", 301 | "text/plain": [ 302 | "
" 303 | ] 304 | }, 305 | "metadata": { 306 | "needs_background": "light" 307 | }, 308 | "output_type": "display_data" 309 | } 310 | ], 311 | "source": [ 312 | "ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');\n", 313 | "cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);\n", 314 | "plt.show()" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "## Data pre-processing and selection" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "Lets first look at columns data types:" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 6, 334 | "metadata": {}, 335 | "outputs": [ 336 | { 337 | "data": { 338 | "text/plain": [ 339 | "ID int64\n", 340 | "Clump int64\n", 341 | "UnifSize int64\n", 342 | "UnifShape int64\n", 343 | "MargAdh int64\n", 344 | "SingEpiSize int64\n", 345 | "BareNuc object\n", 346 | "BlandChrom int64\n", 347 | "NormNucl int64\n", 348 | "Mit int64\n", 349 | "Class int64\n", 350 | "dtype: object" 351 | ] 352 | }, 353 | "execution_count": 6, 354 | "metadata": {}, 355 | "output_type": "execute_result" 356 | } 357 | ], 358 | "source": [ 359 | "cell_df.dtypes" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "metadata": {}, 365 | "source": [ 366 | "It looks like the __BareNuc__ column includes some values that are not numerical. We can drop those rows:" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 11, 372 | "metadata": {}, 373 | "outputs": [ 374 | { 375 | "data": { 376 | "text/plain": [ 377 | "ID int64\n", 378 | "Clump int64\n", 379 | "UnifSize int64\n", 380 | "UnifShape int64\n", 381 | "MargAdh int64\n", 382 | "SingEpiSize int64\n", 383 | "BareNuc int64\n", 384 | "BlandChrom int64\n", 385 | "NormNucl int64\n", 386 | "Mit int64\n", 387 | "Class int64\n", 388 | "dtype: object" 389 | ] 390 | }, 391 | "execution_count": 11, 392 | "metadata": {}, 393 | "output_type": "execute_result" 394 | } 395 | ], 396 | "source": [ 397 | "cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]\n", 398 | "cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')\n", 399 | "cell_df.dtypes" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": 16, 405 | "metadata": {}, 406 | "outputs": [ 407 | { 408 | "data": { 409 | "text/plain": [ 410 | "array([[ 5, 1, 1, 1, 2, 1, 3, 1, 1],\n", 411 | " [ 5, 4, 4, 5, 7, 10, 3, 2, 1],\n", 412 | " [ 3, 1, 1, 1, 2, 2, 3, 1, 1],\n", 413 | " [ 6, 8, 8, 1, 3, 4, 3, 7, 1],\n", 414 | " [ 4, 1, 1, 3, 2, 1, 3, 1, 1]])" 415 | ] 416 | }, 417 | "execution_count": 16, 418 | "metadata": {}, 419 | "output_type": "execute_result" 420 | } 421 | ], 422 | "source": [ 423 | "feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]\n", 424 | "X = np.asarray(feature_df)\n", 425 | "X[0:5]" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": {}, 431 | "source": [ 432 | "We want the model to predict the value of Class (that is, benign (=2) or malignant (=4)). As this field can have one of only two possible values, we need to change its measurement level to reflect this." 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": 15, 438 | "metadata": {}, 439 | "outputs": [ 440 | { 441 | "data": { 442 | "text/plain": [ 443 | "array([2, 2, 2, 2, 2])" 444 | ] 445 | }, 446 | "execution_count": 15, 447 | "metadata": {}, 448 | "output_type": "execute_result" 449 | } 450 | ], 451 | "source": [ 452 | "cell_df['Class'] = cell_df['Class'].astype('int')\n", 453 | "y = np.asarray(cell_df['Class'])\n", 454 | "y [0:5]" 455 | ] 456 | }, 457 | { 458 | "cell_type": "markdown", 459 | "metadata": {}, 460 | "source": [ 461 | "## Train/Test dataset" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "Okay, we split our dataset into train and test set:" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 14, 474 | "metadata": {}, 475 | "outputs": [ 476 | { 477 | "name": "stdout", 478 | "output_type": "stream", 479 | "text": [ 480 | "Train set: (559, 9) (559,)\n", 481 | "Test set: (140, 9) (140,)\n" 482 | ] 483 | } 484 | ], 485 | "source": [ 486 | "X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n", 487 | "print ('Train set:', X_train.shape, y_train.shape)\n", 488 | "print ('Test set:', X_test.shape, y_test.shape)" 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "metadata": {}, 494 | "source": [ 495 | "

Modeling (SVM with Scikit-learn)

" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": {}, 501 | "source": [ 502 | "The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:\n", 503 | "\n", 504 | " 1.Linear\n", 505 | " 2.Polynomial\n", 506 | " 3.Radial basis function (RBF)\n", 507 | " 4.Sigmoid\n", 508 | "Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab." 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": 17, 514 | "metadata": {}, 515 | "outputs": [ 516 | { 517 | "ename": "ValueError", 518 | "evalue": "could not convert string to float: '?'", 519 | "output_type": "error", 520 | "traceback": [ 521 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 522 | "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", 523 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0msvm\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mclf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msvm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSVC\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkernel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'rbf'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mclf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX_train\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_train\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 524 | "\u001b[0;32m~/conda/lib/python3.6/site-packages/sklearn/svm/base.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m 147\u001b[0m X, y = check_X_y(X, y, dtype=np.float64,\n\u001b[1;32m 148\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'C'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'csr'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 149\u001b[0;31m accept_large_sparse=False)\n\u001b[0m\u001b[1;32m 150\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_validate_targets\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 151\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 525 | "\u001b[0;32m~/conda/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_X_y\u001b[0;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m 754\u001b[0m \u001b[0mensure_min_features\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mensure_min_features\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 755\u001b[0m \u001b[0mwarn_on_dtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mwarn_on_dtype\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 756\u001b[0;31m estimator=estimator)\n\u001b[0m\u001b[1;32m 757\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmulti_output\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 758\u001b[0m y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,\n", 526 | "\u001b[0;32m~/conda/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m 525\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 526\u001b[0m \u001b[0mwarnings\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msimplefilter\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'error'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mComplexWarning\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 527\u001b[0;31m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 528\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 529\u001b[0m raise ValueError(\"Complex data not supported\\n\"\n", 527 | "\u001b[0;32m~/conda/lib/python3.6/site-packages/numpy/core/numeric.py\u001b[0m in \u001b[0;36masarray\u001b[0;34m(a, dtype, order)\u001b[0m\n\u001b[1;32m 499\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 500\u001b[0m \"\"\"\n\u001b[0;32m--> 501\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 502\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 503\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 528 | "\u001b[0;31mValueError\u001b[0m: could not convert string to float: '?'" 529 | ] 530 | } 531 | ], 532 | "source": [ 533 | "from sklearn import svm\n", 534 | "clf = svm.SVC(kernel='rbf')\n", 535 | "clf.fit(X_train, y_train) " 536 | ] 537 | }, 538 | { 539 | "cell_type": "markdown", 540 | "metadata": {}, 541 | "source": [ 542 | "After being fitted, the model can then be used to predict new values:" 543 | ] 544 | }, 545 | { 546 | "cell_type": "code", 547 | "execution_count": null, 548 | "metadata": {}, 549 | "outputs": [], 550 | "source": [ 551 | "yhat = clf.predict(X_test)\n", 552 | "yhat [0:5]" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "

Evaluation

" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": null, 565 | "metadata": {}, 566 | "outputs": [], 567 | "source": [ 568 | "from sklearn.metrics import classification_report, confusion_matrix\n", 569 | "import itertools" 570 | ] 571 | }, 572 | { 573 | "cell_type": "code", 574 | "execution_count": null, 575 | "metadata": {}, 576 | "outputs": [], 577 | "source": [ 578 | "def plot_confusion_matrix(cm, classes,\n", 579 | " normalize=False,\n", 580 | " title='Confusion matrix',\n", 581 | " cmap=plt.cm.Blues):\n", 582 | " \"\"\"\n", 583 | " This function prints and plots the confusion matrix.\n", 584 | " Normalization can be applied by setting `normalize=True`.\n", 585 | " \"\"\"\n", 586 | " if normalize:\n", 587 | " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", 588 | " print(\"Normalized confusion matrix\")\n", 589 | " else:\n", 590 | " print('Confusion matrix, without normalization')\n", 591 | "\n", 592 | " print(cm)\n", 593 | "\n", 594 | " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", 595 | " plt.title(title)\n", 596 | " plt.colorbar()\n", 597 | " tick_marks = np.arange(len(classes))\n", 598 | " plt.xticks(tick_marks, classes, rotation=45)\n", 599 | " plt.yticks(tick_marks, classes)\n", 600 | "\n", 601 | " fmt = '.2f' if normalize else 'd'\n", 602 | " thresh = cm.max() / 2.\n", 603 | " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", 604 | " plt.text(j, i, format(cm[i, j], fmt),\n", 605 | " horizontalalignment=\"center\",\n", 606 | " color=\"white\" if cm[i, j] > thresh else \"black\")\n", 607 | "\n", 608 | " plt.tight_layout()\n", 609 | " plt.ylabel('True label')\n", 610 | " plt.xlabel('Predicted label')" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": null, 616 | "metadata": {}, 617 | "outputs": [], 618 | "source": [ 619 | "# Compute confusion matrix\n", 620 | "cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])\n", 621 | "np.set_printoptions(precision=2)\n", 622 | "\n", 623 | "print (classification_report(y_test, yhat))\n", 624 | "\n", 625 | "# Plot non-normalized confusion matrix\n", 626 | "plt.figure()\n", 627 | "plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False, title='Confusion matrix')" 628 | ] 629 | }, 630 | { 631 | "cell_type": "markdown", 632 | "metadata": {}, 633 | "source": [ 634 | "You can also easily use the __f1_score__ from sklearn library:" 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": null, 640 | "metadata": {}, 641 | "outputs": [], 642 | "source": [ 643 | "from sklearn.metrics import f1_score\n", 644 | "f1_score(y_test, yhat, average='weighted') " 645 | ] 646 | }, 647 | { 648 | "cell_type": "markdown", 649 | "metadata": {}, 650 | "source": [ 651 | "Lets try jaccard index for accuracy:" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": null, 657 | "metadata": {}, 658 | "outputs": [], 659 | "source": [ 660 | "from sklearn.metrics import jaccard_similarity_score\n", 661 | "jaccard_similarity_score(y_test, yhat)" 662 | ] 663 | }, 664 | { 665 | "cell_type": "markdown", 666 | "metadata": {}, 667 | "source": [ 668 | "

Practice

\n", 669 | "Can you rebuild the model, but this time with a __linear__ kernel? You can use __kernel='linear'__ option, when you define the svm. How the accuracy changes with the new kernel function?" 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "execution_count": null, 675 | "metadata": {}, 676 | "outputs": [], 677 | "source": [ 678 | "# write your code here\n" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "Double-click __here__ for the solution.\n", 686 | "\n", 687 | "" 696 | ] 697 | }, 698 | { 699 | "cell_type": "markdown", 700 | "metadata": { 701 | "button": false, 702 | "new_sheet": false, 703 | "run_control": { 704 | "read_only": false 705 | } 706 | }, 707 | "source": [ 708 | "

Want to learn more?

\n", 709 | "\n", 710 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n", 711 | "\n", 712 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n", 713 | "\n", 714 | "

Thanks for completing this lesson!

\n", 715 | "\n", 716 | "

Author: Saeed Aghabozorgi

\n", 717 | "

Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.

\n", 718 | "\n", 719 | "
\n", 720 | "\n", 721 | "

Copyright © 2018 Cognitive Class. This notebook and its source code are released under the terms of the MIT License.

" 722 | ] 723 | } 724 | ], 725 | "metadata": { 726 | "kernelspec": { 727 | "display_name": "Python 3", 728 | "language": "python", 729 | "name": "python3" 730 | }, 731 | "language_info": { 732 | "codemirror_mode": { 733 | "name": "ipython", 734 | "version": 3 735 | }, 736 | "file_extension": ".py", 737 | "mimetype": "text/x-python", 738 | "name": "python", 739 | "nbconvert_exporter": "python", 740 | "pygments_lexer": "ipython3", 741 | "version": "3.6.7" 742 | } 743 | }, 744 | "nbformat": 4, 745 | "nbformat_minor": 4 746 | } 747 | -------------------------------------------------------------------------------- /Classification Models/ML0101EN-Reg-NoneLinearRegression-py-v1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "

Non Linear Regression Analysis

" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression because, as the name implies, linear regression presumes that the data is linear. \n", 17 | "Let's learn about non linear regressions and apply an example on python. In this notebook, we fit a non-linear model to the datapoints corrensponding to China's GDP from 1960 to 2014." 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "

Importing required libraries

" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": { 31 | "collapsed": false, 32 | "jupyter": { 33 | "outputs_hidden": false 34 | } 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "import numpy as np\n", 39 | "import matplotlib.pyplot as plt\n", 40 | "%matplotlib inline" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "Though Linear regression is very good to solve many problems, it cannot be used for all datasets. First recall how linear regression, could model a dataset. It models a linear relation between a dependent variable y and independent variable x. It had a simple equation, of degree 1, for example y = $2x$ + 3." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "x = np.arange(-5.0, 5.0, 0.1)\n", 57 | "\n", 58 | "##You can adjust the slope and intercept to verify the changes in the graph\n", 59 | "y = 2*(x) + 3\n", 60 | "y_noise = 2 * np.random.normal(size=x.size)\n", 61 | "ydata = y + y_noise\n", 62 | "#plt.figure(figsize=(8,6))\n", 63 | "plt.plot(x, ydata, 'bo')\n", 64 | "plt.plot(x,y, 'r') \n", 65 | "plt.ylabel('Dependent Variable')\n", 66 | "plt.xlabel('Indepdendent Variable')\n", 67 | "plt.show()" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "Non-linear regressions are a relationship between independent variables $x$ and a dependent variable $y$ which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of $k$ degrees (maximum power of $x$). \n", 75 | "\n", 76 | "$$ \\ y = a x^3 + b x^2 + c x + d \\ $$\n", 77 | "\n", 78 | "Non-linear functions can have elements like exponentials, logarithms, fractions, and others. For example: $$ y = \\log(x)$$\n", 79 | " \n", 80 | "Or even, more complicated such as :\n", 81 | "$$ y = \\log(a x^3 + b x^2 + c x + d)$$" 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "Let's take a look at a cubic function's graph." 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": { 95 | "collapsed": false, 96 | "jupyter": { 97 | "outputs_hidden": false 98 | } 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "x = np.arange(-5.0, 5.0, 0.1)\n", 103 | "\n", 104 | "##You can adjust the slope and intercept to verify the changes in the graph\n", 105 | "y = 1*(x**3) + 1*(x**2) + 1*x + 3\n", 106 | "y_noise = 20 * np.random.normal(size=x.size)\n", 107 | "ydata = y + y_noise\n", 108 | "plt.plot(x, ydata, 'bo')\n", 109 | "plt.plot(x,y, 'r') \n", 110 | "plt.ylabel('Dependent Variable')\n", 111 | "plt.xlabel('Indepdendent Variable')\n", 112 | "plt.show()" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "As you can see, this function has $x^3$ and $x^2$ as independent variables. Also, the graphic of this function is not a straight line over the 2D plane. So this is a non-linear function." 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "Some other types of non-linear functions are:" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "### Quadratic" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "$$ Y = X^2 $$" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": { 147 | "collapsed": false, 148 | "jupyter": { 149 | "outputs_hidden": false 150 | } 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "x = np.arange(-5.0, 5.0, 0.1)\n", 155 | "\n", 156 | "##You can adjust the slope and intercept to verify the changes in the graph\n", 157 | "\n", 158 | "y = np.power(x,2)\n", 159 | "y_noise = 2 * np.random.normal(size=x.size)\n", 160 | "ydata = y + y_noise\n", 161 | "plt.plot(x, ydata, 'bo')\n", 162 | "plt.plot(x,y, 'r') \n", 163 | "plt.ylabel('Dependent Variable')\n", 164 | "plt.xlabel('Indepdendent Variable')\n", 165 | "plt.show()" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "### Exponential" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "An exponential function with base c is defined by $$ Y = a + b c^X$$ where b ≠0, c > 0 , c ≠1, and x is any real number. The base, c, is constant and the exponent, x, is a variable. \n", 180 | "\n" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": { 187 | "collapsed": false, 188 | "jupyter": { 189 | "outputs_hidden": false 190 | } 191 | }, 192 | "outputs": [], 193 | "source": [ 194 | "X = np.arange(-5.0, 5.0, 0.1)\n", 195 | "\n", 196 | "##You can adjust the slope and intercept to verify the changes in the graph\n", 197 | "\n", 198 | "Y= np.exp(X)\n", 199 | "\n", 200 | "plt.plot(X,Y) \n", 201 | "plt.ylabel('Dependent Variable')\n", 202 | "plt.xlabel('Indepdendent Variable')\n", 203 | "plt.show()" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "### Logarithmic\n", 211 | "\n", 212 | "The response $y$ is a results of applying logarithmic map from input $x$'s to output variable $y$. It is one of the simplest form of __log()__: i.e. $$ y = \\log(x)$$\n", 213 | "\n", 214 | "Please consider that instead of $x$, we can use $X$, which can be polynomial representation of the $x$'s. In general form it would be written as \n", 215 | "\\begin{equation}\n", 216 | "y = \\log(X)\n", 217 | "\\end{equation}" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": null, 223 | "metadata": { 224 | "collapsed": false, 225 | "jupyter": { 226 | "outputs_hidden": false 227 | } 228 | }, 229 | "outputs": [], 230 | "source": [ 231 | "X = np.arange(-5.0, 5.0, 0.1)\n", 232 | "\n", 233 | "Y = np.log(X)\n", 234 | "\n", 235 | "plt.plot(X,Y) \n", 236 | "plt.ylabel('Dependent Variable')\n", 237 | "plt.xlabel('Indepdendent Variable')\n", 238 | "plt.show()" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "metadata": {}, 244 | "source": [ 245 | "### Sigmoidal/Logistic" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "$$ Y = a + \\frac{b}{1+ c^{(X-d)}}$$" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "X = np.arange(-5.0, 5.0, 0.1)\n", 262 | "\n", 263 | "\n", 264 | "Y = 1-4/(1+np.power(3, X-2))\n", 265 | "\n", 266 | "plt.plot(X,Y) \n", 267 | "plt.ylabel('Dependent Variable')\n", 268 | "plt.xlabel('Indepdendent Variable')\n", 269 | "plt.show()" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "\n", 277 | "# Non-Linear Regression example" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "For an example, we're going to try and fit a non-linear model to the datapoints corresponding to China's GDP from 1960 to 2014. We download a dataset with two columns, the first, a year between 1960 and 2014, the second, China's corresponding annual gross domestic income in US dollars for that year. " 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "metadata": { 291 | "collapsed": false, 292 | "jupyter": { 293 | "outputs_hidden": false 294 | } 295 | }, 296 | "outputs": [], 297 | "source": [ 298 | "import numpy as np\n", 299 | "import pandas as pd\n", 300 | "\n", 301 | "#downloading dataset\n", 302 | "!wget -nv -O china_gdp.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv\n", 303 | " \n", 304 | "df = pd.read_csv(\"china_gdp.csv\")\n", 305 | "df.head(10)" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)" 313 | ] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "metadata": {}, 318 | "source": [ 319 | "### Plotting the Dataset ###\n", 320 | "This is what the datapoints look like. It kind of looks like an either logistic or exponential function. The growth starts off slow, then from 2005 on forward, the growth is very significant. And finally, it decelerate slightly in the 2010s." 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": { 327 | "collapsed": false, 328 | "jupyter": { 329 | "outputs_hidden": false 330 | } 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "plt.figure(figsize=(8,5))\n", 335 | "x_data, y_data = (df[\"Year\"].values, df[\"Value\"].values)\n", 336 | "plt.plot(x_data, y_data, 'ro')\n", 337 | "plt.ylabel('GDP')\n", 338 | "plt.xlabel('Year')\n", 339 | "plt.show()" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "### Choosing a model ###\n", 347 | "\n", 348 | "From an initial look at the plot, we determine that the logistic function could be a good approximation,\n", 349 | "since it has the property of starting with a slow growth, increasing growth in the middle, and then decreasing again at the end; as illustrated below:" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": null, 355 | "metadata": { 356 | "collapsed": false, 357 | "jupyter": { 358 | "outputs_hidden": false 359 | } 360 | }, 361 | "outputs": [], 362 | "source": [ 363 | "X = np.arange(-5.0, 5.0, 0.1)\n", 364 | "Y = 1.0 / (1.0 + np.exp(-X))\n", 365 | "\n", 366 | "plt.plot(X,Y) \n", 367 | "plt.ylabel('Dependent Variable')\n", 368 | "plt.xlabel('Indepdendent Variable')\n", 369 | "plt.show()" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "\n", 377 | "\n", 378 | "The formula for the logistic function is the following:\n", 379 | "\n", 380 | "$$ \\hat{Y} = \\frac1{1+e^{\\beta_1(X-\\beta_2)}}$$\n", 381 | "\n", 382 | "$\\beta_1$: Controls the curve's steepness,\n", 383 | "\n", 384 | "$\\beta_2$: Slides the curve on the x-axis." 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "### Building The Model ###\n", 392 | "Now, let's build our regression model and initialize its parameters. " 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [ 401 | "def sigmoid(x, Beta_1, Beta_2):\n", 402 | " y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))\n", 403 | " return y" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "Lets look at a sample sigmoid line that might fit with the data:" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": null, 416 | "metadata": { 417 | "collapsed": false, 418 | "jupyter": { 419 | "outputs_hidden": false 420 | } 421 | }, 422 | "outputs": [], 423 | "source": [ 424 | "beta_1 = 0.10\n", 425 | "beta_2 = 1990.0\n", 426 | "\n", 427 | "#logistic function\n", 428 | "Y_pred = sigmoid(x_data, beta_1 , beta_2)\n", 429 | "\n", 430 | "#plot initial prediction against datapoints\n", 431 | "plt.plot(x_data, Y_pred*15000000000000.)\n", 432 | "plt.plot(x_data, y_data, 'ro')" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "Our task here is to find the best parameters for our model. Lets first normalize our x and y:" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": null, 445 | "metadata": {}, 446 | "outputs": [], 447 | "source": [ 448 | "# Lets normalize our data\n", 449 | "xdata =x_data/max(x_data)\n", 450 | "ydata =y_data/max(y_data)" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "#### How we find the best parameters for our fit line?\n", 458 | "we can use __curve_fit__ which uses non-linear least squares to fit our sigmoid function, to data. Optimal values for the parameters so that the sum of the squared residuals of sigmoid(xdata, *popt) - ydata is minimized.\n", 459 | "\n", 460 | "popt are our optimized parameters." 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": {}, 467 | "outputs": [], 468 | "source": [ 469 | "from scipy.optimize import curve_fit\n", 470 | "popt, pcov = curve_fit(sigmoid, xdata, ydata)\n", 471 | "#print the final parameters\n", 472 | "print(\" beta_1 = %f, beta_2 = %f\" % (popt[0], popt[1]))" 473 | ] 474 | }, 475 | { 476 | "cell_type": "markdown", 477 | "metadata": {}, 478 | "source": [ 479 | "Now we plot our resulting regression model." 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": null, 485 | "metadata": {}, 486 | "outputs": [], 487 | "source": [ 488 | "x = np.linspace(1960, 2015, 55)\n", 489 | "x = x/max(x)\n", 490 | "plt.figure(figsize=(8,5))\n", 491 | "y = sigmoid(x, *popt)\n", 492 | "plt.plot(xdata, ydata, 'ro', label='data')\n", 493 | "plt.plot(x,y, linewidth=3.0, label='fit')\n", 494 | "plt.legend(loc='best')\n", 495 | "plt.ylabel('GDP')\n", 496 | "plt.xlabel('Year')\n", 497 | "plt.show()" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "## Practice\n", 505 | "Can you calculate what is the accuracy of our model?" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": null, 511 | "metadata": {}, 512 | "outputs": [], 513 | "source": [ 514 | "# write your code here\n", 515 | "\n", 516 | "\n" 517 | ] 518 | }, 519 | { 520 | "cell_type": "markdown", 521 | "metadata": {}, 522 | "source": [ 523 | "Double-click __here__ for the solution.\n", 524 | "\n", 525 | "" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": {}, 552 | "source": [ 553 | "

Want to learn more?

\n", 554 | "\n", 555 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n", 556 | "\n", 557 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n", 558 | "\n", 559 | "

Thanks for completing this lesson!

\n", 560 | "\n", 561 | "

Author: Saeed Aghabozorgi

\n", 562 | "

Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.

\n", 563 | "\n", 564 | "
\n", 565 | "\n", 566 | "

Copyright © 2018 Cognitive Class. This notebook and its source code are released under the terms of the MIT License.

" 567 | ] 568 | } 569 | ], 570 | "metadata": { 571 | "kernelspec": { 572 | "display_name": "Python 3", 573 | "language": "python", 574 | "name": "python3" 575 | }, 576 | "language_info": { 577 | "codemirror_mode": { 578 | "name": "ipython", 579 | "version": 3 580 | }, 581 | "file_extension": ".py", 582 | "mimetype": "text/x-python", 583 | "name": "python", 584 | "nbconvert_exporter": "python", 585 | "pygments_lexer": "ipython3", 586 | "version": "3.6.7" 587 | } 588 | }, 589 | "nbformat": 4, 590 | "nbformat_minor": 4 591 | } 592 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Assignment Intructions: 2 | 3 | Now that you have been equipped with the skills to use different Machine Learning algorithms, over the course of five weeks, you will have the opportunity to practice and apply it on a dataset. In this project, you will complete a notebook where you will build a classifier to predict whether a loan case will be paid off or not. 4 | 5 | You load a historical dataset from previous loan applications, clean the data, and apply different classification algorithm on the data. You are expected to use the following algorithms to build your models: 6 | 7 | * k-Nearest Neighbour 8 | * Decision Tree 9 | * Support Vector Machine 10 | * Logistic Regression 11 | 12 | The results is reported as the accuracy of each classifier, using the following metrics when these are applicable: 13 | 14 | * Jaccard index 15 | * F1-score 16 | * LogLoass 17 | ------------ 18 | ## Setup Instructions: 19 | ### A-Create an account in Watson Studio if you dont have (If you already have it, jump to step B). 20 | 21 | * Browse into https://www.ibm.com/cloud/watson-studio 22 | * Click on 'Start your free trial' 23 | * Enter your email, and click 'Next' 24 | * Enter your Name, and choose a Password. Then click on 'Create Account' 25 | * Go to your email, and confirm your account. 26 | * Click on 'Proceed' 27 | * In "Select Organization and Space" form, leave everything as default, and click on 'Continue' 28 | * It is done. Click on 'Get started!' 29 | 30 | ### B-Sign in into Watson Studio and import your notebook 31 | 32 | * Sign in into https://www.ibm.com/cloud/watson-studio 33 | * Click on 'New Project' 34 | * Select 'Data Science' as type of project. 35 | * Give a name to your project, and a description for your reference, then setup your project as following and click "Create". 36 | 37 | > Notice 1: because you are going to share this project with your peer for evaluation, please make sure you have unchecked `Restrict who can be a collaborator` 38 | 39 | > Notice 2: You have to create an IBM Object Storage, if you dont have any IBM Object Storage (you can use the free Lite plan) 40 | 41 | * From the top-right, Click on 'Add to project', and then select 'Notebook'. C 42 | 43 | * In the 'New notebook' form, click on 'From URL', and enter the Notebook URL: https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/ML0101EN-Proj-Loan-py-v1.ipynb 44 | 45 | * Give the notebook a proper name and description and click on `Create Notebook` to initialize the notebook 46 | 47 | C. Complete the notebook 48 | 49 | * Start running the notebook 50 | * Complete the notebook based on the description in the notebook. 51 | -------------------------------------------------------------------------------- /Recommender System/ML0101EN-RecSys-Collaborative-Filtering-movies-py-v1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "button": false, 7 | "deletable": true, 8 | "new_sheet": false, 9 | "run_control": { 10 | "read_only": false 11 | } 12 | }, 13 | "source": [ 14 | "\n", 15 | "\n", 16 | "

COLLABORATIVE FILTERING

" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "button": false, 23 | "deletable": true, 24 | "new_sheet": false, 25 | "run_control": { 26 | "read_only": false 27 | } 28 | }, 29 | "source": [ 30 | "Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement simple version of one using Python and the Pandas library." 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": { 36 | "button": false, 37 | "deletable": true, 38 | "new_sheet": false, 39 | "run_control": { 40 | "read_only": false 41 | } 42 | }, 43 | "source": [ 44 | "

Table of contents

\n", 45 | "\n", 46 | "
\n", 47 | "
    \n", 48 | "
  1. Acquiring the Data
  2. \n", 49 | "
  3. Preprocessing
  4. \n", 50 | "
  5. Collaborative Filtering
  6. \n", 51 | "
\n", 52 | "
\n", 53 | "
\n", 54 | "
" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": { 60 | "button": false, 61 | "deletable": true, 62 | "new_sheet": false, 63 | "run_control": { 64 | "read_only": false 65 | } 66 | }, 67 | "source": [ 68 | "\n", 69 | "\n", 70 | "\n", 71 | "# Acquiring the Data" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": { 77 | "button": false, 78 | "deletable": true, 79 | "new_sheet": false, 80 | "run_control": { 81 | "read_only": false 82 | } 83 | }, 84 | "source": [ 85 | "To acquire and extract the data, simply run the following Bash scripts: \n", 86 | "Dataset acquired from [GroupLens](http://grouplens.org/datasets/movielens/). Lets download the dataset. To download the data, we will use **`!wget`** to download it from IBM Object Storage. \n", 87 | "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": { 94 | "button": false, 95 | "collapsed": false, 96 | "deletable": true, 97 | "jupyter": { 98 | "outputs_hidden": false 99 | }, 100 | "new_sheet": false, 101 | "run_control": { 102 | "read_only": false 103 | } 104 | }, 105 | "outputs": [], 106 | "source": [ 107 | "!wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip\n", 108 | "print('unziping ...')\n", 109 | "!unzip -o -j moviedataset.zip " 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": { 115 | "button": false, 116 | "deletable": true, 117 | "new_sheet": false, 118 | "run_control": { 119 | "read_only": false 120 | } 121 | }, 122 | "source": [ 123 | "Now you're ready to start working with the data!" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": { 129 | "button": false, 130 | "deletable": true, 131 | "new_sheet": false, 132 | "run_control": { 133 | "read_only": false 134 | } 135 | }, 136 | "source": [ 137 | "
\n", 138 | "\n", 139 | "\n", 140 | "# Preprocessing" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": { 146 | "button": false, 147 | "deletable": true, 148 | "new_sheet": false, 149 | "run_control": { 150 | "read_only": false 151 | } 152 | }, 153 | "source": [ 154 | "First, let's get all of the imports out of the way:" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": { 161 | "button": false, 162 | "collapsed": false, 163 | "deletable": true, 164 | "jupyter": { 165 | "outputs_hidden": false 166 | }, 167 | "new_sheet": false, 168 | "run_control": { 169 | "read_only": false 170 | } 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "#Dataframe manipulation library\n", 175 | "import pandas as pd\n", 176 | "#Math functions, we'll only need the sqrt function so let's import only that\n", 177 | "from math import sqrt\n", 178 | "import numpy as np\n", 179 | "import matplotlib.pyplot as plt\n", 180 | "%matplotlib inline" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": { 186 | "button": false, 187 | "deletable": true, 188 | "new_sheet": false, 189 | "run_control": { 190 | "read_only": false 191 | } 192 | }, 193 | "source": [ 194 | "Now let's read each file into their Dataframes:" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": null, 200 | "metadata": { 201 | "button": false, 202 | "collapsed": false, 203 | "deletable": true, 204 | "jupyter": { 205 | "outputs_hidden": false 206 | }, 207 | "new_sheet": false, 208 | "run_control": { 209 | "read_only": false 210 | } 211 | }, 212 | "outputs": [], 213 | "source": [ 214 | "#Storing the movie information into a pandas dataframe\n", 215 | "movies_df = pd.read_csv('movies.csv')\n", 216 | "#Storing the user information into a pandas dataframe\n", 217 | "ratings_df = pd.read_csv('ratings.csv')" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": { 223 | "button": false, 224 | "deletable": true, 225 | "new_sheet": false, 226 | "run_control": { 227 | "read_only": false 228 | } 229 | }, 230 | "source": [ 231 | "Let's also take a peek at how each of them are organized:" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": null, 237 | "metadata": { 238 | "button": false, 239 | "collapsed": false, 240 | "deletable": true, 241 | "jupyter": { 242 | "outputs_hidden": false 243 | }, 244 | "new_sheet": false, 245 | "run_control": { 246 | "read_only": false 247 | } 248 | }, 249 | "outputs": [], 250 | "source": [ 251 | "#Head is a function that gets the first N rows of a dataframe. N's default is 5.\n", 252 | "movies_df.head()" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": { 258 | "button": false, 259 | "deletable": true, 260 | "new_sheet": false, 261 | "run_control": { 262 | "read_only": false 263 | } 264 | }, 265 | "source": [ 266 | "So each movie has a unique ID, a title with its release year along with it (Which may contain unicode characters) and several different genres in the same field. Let's remove the year from the title column and place it into its own one by using the handy [extract](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html#pandas.Series.str.extract) function that Pandas has." 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": { 272 | "button": false, 273 | "deletable": true, 274 | "new_sheet": false, 275 | "run_control": { 276 | "read_only": false 277 | } 278 | }, 279 | "source": [ 280 | "Let's remove the year from the __title__ column by using pandas' replace function and store in a new __year__ column." 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": { 287 | "button": false, 288 | "collapsed": false, 289 | "deletable": true, 290 | "jupyter": { 291 | "outputs_hidden": false 292 | }, 293 | "new_sheet": false, 294 | "run_control": { 295 | "read_only": false 296 | } 297 | }, 298 | "outputs": [], 299 | "source": [ 300 | "#Using regular expressions to find a year stored between parentheses\n", 301 | "#We specify the parantheses so we don't conflict with movies that have years in their titles\n", 302 | "movies_df['year'] = movies_df.title.str.extract('(\\(\\d\\d\\d\\d\\))',expand=False)\n", 303 | "#Removing the parentheses\n", 304 | "movies_df['year'] = movies_df.year.str.extract('(\\d\\d\\d\\d)',expand=False)\n", 305 | "#Removing the years from the 'title' column\n", 306 | "movies_df['title'] = movies_df.title.str.replace('(\\(\\d\\d\\d\\d\\))', '')\n", 307 | "#Applying the strip function to get rid of any ending whitespace characters that may have appeared\n", 308 | "movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": { 314 | "button": false, 315 | "deletable": true, 316 | "new_sheet": false, 317 | "run_control": { 318 | "read_only": false 319 | } 320 | }, 321 | "source": [ 322 | "Let's look at the result!" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": { 329 | "button": false, 330 | "collapsed": false, 331 | "deletable": true, 332 | "jupyter": { 333 | "outputs_hidden": false 334 | }, 335 | "new_sheet": false, 336 | "run_control": { 337 | "read_only": false 338 | } 339 | }, 340 | "outputs": [], 341 | "source": [ 342 | "movies_df.head()" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": { 348 | "button": false, 349 | "deletable": true, 350 | "new_sheet": false, 351 | "run_control": { 352 | "read_only": false 353 | } 354 | }, 355 | "source": [ 356 | "With that, let's also drop the genres column since we won't need it for this particular recommendation system." 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "metadata": { 363 | "button": false, 364 | "collapsed": false, 365 | "deletable": true, 366 | "jupyter": { 367 | "outputs_hidden": false 368 | }, 369 | "new_sheet": false, 370 | "run_control": { 371 | "read_only": false 372 | } 373 | }, 374 | "outputs": [], 375 | "source": [ 376 | "#Dropping the genres column\n", 377 | "movies_df = movies_df.drop('genres', 1)" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": { 383 | "button": false, 384 | "deletable": true, 385 | "new_sheet": false, 386 | "run_control": { 387 | "read_only": false 388 | } 389 | }, 390 | "source": [ 391 | "Here's the final movies dataframe:" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": { 398 | "button": false, 399 | "collapsed": false, 400 | "deletable": true, 401 | "jupyter": { 402 | "outputs_hidden": false 403 | }, 404 | "new_sheet": false, 405 | "run_control": { 406 | "read_only": false 407 | } 408 | }, 409 | "outputs": [], 410 | "source": [ 411 | "movies_df.head()" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": { 417 | "button": false, 418 | "deletable": true, 419 | "new_sheet": false, 420 | "run_control": { 421 | "read_only": false 422 | } 423 | }, 424 | "source": [ 425 | "
" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": { 431 | "button": false, 432 | "deletable": true, 433 | "new_sheet": false, 434 | "run_control": { 435 | "read_only": false 436 | } 437 | }, 438 | "source": [ 439 | "Next, let's look at the ratings dataframe." 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": null, 445 | "metadata": { 446 | "button": false, 447 | "collapsed": false, 448 | "deletable": true, 449 | "jupyter": { 450 | "outputs_hidden": false 451 | }, 452 | "new_sheet": false, 453 | "run_control": { 454 | "read_only": false 455 | } 456 | }, 457 | "outputs": [], 458 | "source": [ 459 | "ratings_df.head()" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": { 465 | "button": false, 466 | "deletable": true, 467 | "new_sheet": false, 468 | "run_control": { 469 | "read_only": false 470 | } 471 | }, 472 | "source": [ 473 | "Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won't be needing the timestamp column, so let's drop it to save on memory." 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": null, 479 | "metadata": { 480 | "button": false, 481 | "collapsed": false, 482 | "deletable": true, 483 | "jupyter": { 484 | "outputs_hidden": false 485 | }, 486 | "new_sheet": false, 487 | "run_control": { 488 | "read_only": false 489 | } 490 | }, 491 | "outputs": [], 492 | "source": [ 493 | "#Drop removes a specified row or column from a dataframe\n", 494 | "ratings_df = ratings_df.drop('timestamp', 1)" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": { 500 | "button": false, 501 | "deletable": true, 502 | "new_sheet": false, 503 | "run_control": { 504 | "read_only": false 505 | } 506 | }, 507 | "source": [ 508 | "Here's how the final ratings Dataframe looks like:" 509 | ] 510 | }, 511 | { 512 | "cell_type": "code", 513 | "execution_count": null, 514 | "metadata": { 515 | "button": false, 516 | "collapsed": false, 517 | "deletable": true, 518 | "jupyter": { 519 | "outputs_hidden": false 520 | }, 521 | "new_sheet": false, 522 | "run_control": { 523 | "read_only": false 524 | }, 525 | "scrolled": true 526 | }, 527 | "outputs": [], 528 | "source": [ 529 | "ratings_df.head()" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": { 535 | "button": false, 536 | "deletable": true, 537 | "new_sheet": false, 538 | "run_control": { 539 | "read_only": false 540 | } 541 | }, 542 | "source": [ 543 | "
\n", 544 | "\n", 545 | "\n", 546 | "# Collaborative Filtering" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": { 552 | "button": false, 553 | "deletable": true, 554 | "new_sheet": false, 555 | "run_control": { 556 | "read_only": false 557 | } 558 | }, 559 | "source": [ 560 | "Now, time to start our work on recommendation systems. \n", 561 | "\n", 562 | "The first technique we're going to take a look at is called __Collaborative Filtering__, which is also known as __User-User Filtering__. As hinted by its alternate name, this technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one we will be using here is going to be based on the __Pearson Correlation Function__.\n", 563 | "\n", 564 | "\n", 565 | "\n", 566 | "\n", 567 | "The process for creating a User Based recommendation system is as follows:\n", 568 | "- Select a user with the movies the user has watched\n", 569 | "- Based on his rating to movies, find the top X neighbours \n", 570 | "- Get the watched movie record of the user for each neighbour.\n", 571 | "- Calculate a similarity score using some formula\n", 572 | "- Recommend the items with the highest score\n", 573 | "\n", 574 | "\n", 575 | "Let's begin by creating an input user to recommend movies to:\n", 576 | "\n", 577 | "Notice: To add more movies, simply increase the amount of elements in the userInput. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a \"The\", like \"The Matrix\" then write it in like this: 'Matrix, The' ." 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": null, 583 | "metadata": { 584 | "button": false, 585 | "collapsed": false, 586 | "deletable": true, 587 | "jupyter": { 588 | "outputs_hidden": false 589 | }, 590 | "new_sheet": false, 591 | "run_control": { 592 | "read_only": false 593 | } 594 | }, 595 | "outputs": [], 596 | "source": [ 597 | "userInput = [\n", 598 | " {'title':'Breakfast Club, The', 'rating':5},\n", 599 | " {'title':'Toy Story', 'rating':3.5},\n", 600 | " {'title':'Jumanji', 'rating':2},\n", 601 | " {'title':\"Pulp Fiction\", 'rating':5},\n", 602 | " {'title':'Akira', 'rating':4.5}\n", 603 | " ] \n", 604 | "inputMovies = pd.DataFrame(userInput)\n", 605 | "inputMovies" 606 | ] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "metadata": { 611 | "button": false, 612 | "deletable": true, 613 | "new_sheet": false, 614 | "run_control": { 615 | "read_only": false 616 | } 617 | }, 618 | "source": [ 619 | "#### Add movieId to input user\n", 620 | "With the input complete, let's extract the input movies's ID's from the movies dataframe and add them into it.\n", 621 | "\n", 622 | "We can achieve this by first filtering out the rows that contain the input movies' title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space." 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": null, 628 | "metadata": { 629 | "button": false, 630 | "collapsed": false, 631 | "deletable": true, 632 | "jupyter": { 633 | "outputs_hidden": false 634 | }, 635 | "new_sheet": false, 636 | "run_control": { 637 | "read_only": false 638 | }, 639 | "scrolled": true 640 | }, 641 | "outputs": [], 642 | "source": [ 643 | "#Filtering out the movies by title\n", 644 | "inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]\n", 645 | "#Then merging it so we can get the movieId. It's implicitly merging it by title.\n", 646 | "inputMovies = pd.merge(inputId, inputMovies)\n", 647 | "#Dropping information we won't use from the input dataframe\n", 648 | "inputMovies = inputMovies.drop('year', 1)\n", 649 | "#Final input dataframe\n", 650 | "#If a movie you added in above isn't here, then it might not be in the original \n", 651 | "#dataframe or it might spelled differently, please check capitalisation.\n", 652 | "inputMovies" 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "metadata": { 658 | "button": false, 659 | "deletable": true, 660 | "new_sheet": false, 661 | "run_control": { 662 | "read_only": false 663 | } 664 | }, 665 | "source": [ 666 | "#### The users who has seen the same movies\n", 667 | "Now with the movie ID's in our input, we can now get the subset of users that have watched and reviewed the movies in our input.\n" 668 | ] 669 | }, 670 | { 671 | "cell_type": "code", 672 | "execution_count": null, 673 | "metadata": { 674 | "button": false, 675 | "collapsed": false, 676 | "deletable": true, 677 | "jupyter": { 678 | "outputs_hidden": false 679 | }, 680 | "new_sheet": false, 681 | "run_control": { 682 | "read_only": false 683 | } 684 | }, 685 | "outputs": [], 686 | "source": [ 687 | "#Filtering out users that have watched movies that the input has watched and storing it\n", 688 | "userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]\n", 689 | "userSubset.head()" 690 | ] 691 | }, 692 | { 693 | "cell_type": "markdown", 694 | "metadata": { 695 | "button": false, 696 | "deletable": true, 697 | "new_sheet": false, 698 | "run_control": { 699 | "read_only": false 700 | } 701 | }, 702 | "source": [ 703 | "We now group up the rows by user ID." 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": null, 709 | "metadata": { 710 | "button": false, 711 | "collapsed": false, 712 | "deletable": true, 713 | "jupyter": { 714 | "outputs_hidden": false 715 | }, 716 | "new_sheet": false, 717 | "run_control": { 718 | "read_only": false 719 | } 720 | }, 721 | "outputs": [], 722 | "source": [ 723 | "#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter\n", 724 | "userSubsetGroup = userSubset.groupby(['userId'])" 725 | ] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": { 730 | "button": false, 731 | "deletable": true, 732 | "new_sheet": false, 733 | "run_control": { 734 | "read_only": false 735 | } 736 | }, 737 | "source": [ 738 | "lets look at one of the users, e.g. the one with userID=1130" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": { 745 | "button": false, 746 | "collapsed": false, 747 | "deletable": true, 748 | "jupyter": { 749 | "outputs_hidden": false 750 | }, 751 | "new_sheet": false, 752 | "run_control": { 753 | "read_only": false 754 | } 755 | }, 756 | "outputs": [], 757 | "source": [ 758 | "userSubsetGroup.get_group(1130)" 759 | ] 760 | }, 761 | { 762 | "cell_type": "markdown", 763 | "metadata": { 764 | "button": false, 765 | "deletable": true, 766 | "new_sheet": false, 767 | "run_control": { 768 | "read_only": false 769 | } 770 | }, 771 | "source": [ 772 | "Let's also sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user." 773 | ] 774 | }, 775 | { 776 | "cell_type": "code", 777 | "execution_count": null, 778 | "metadata": { 779 | "button": false, 780 | "collapsed": false, 781 | "deletable": true, 782 | "jupyter": { 783 | "outputs_hidden": false 784 | }, 785 | "new_sheet": false, 786 | "run_control": { 787 | "read_only": false 788 | } 789 | }, 790 | "outputs": [], 791 | "source": [ 792 | "#Sorting it so users with movie most in common with the input will have priority\n", 793 | "userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)" 794 | ] 795 | }, 796 | { 797 | "cell_type": "markdown", 798 | "metadata": { 799 | "button": false, 800 | "deletable": true, 801 | "new_sheet": false, 802 | "run_control": { 803 | "read_only": false 804 | } 805 | }, 806 | "source": [ 807 | "Now lets look at the first user" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": null, 813 | "metadata": { 814 | "button": false, 815 | "collapsed": false, 816 | "deletable": true, 817 | "jupyter": { 818 | "outputs_hidden": false 819 | }, 820 | "new_sheet": false, 821 | "run_control": { 822 | "read_only": false 823 | } 824 | }, 825 | "outputs": [], 826 | "source": [ 827 | "userSubsetGroup[0:3]" 828 | ] 829 | }, 830 | { 831 | "cell_type": "markdown", 832 | "metadata": { 833 | "button": false, 834 | "deletable": true, 835 | "new_sheet": false, 836 | "run_control": { 837 | "read_only": false 838 | } 839 | }, 840 | "source": [ 841 | "#### Similarity of users to input user\n", 842 | "Next, we are going to compare all users (not really all !!!) to our specified user and find the one that is most similar. \n", 843 | "we're going to find out how similar each user is to the input through the __Pearson Correlation Coefficient__. It is used to measure the strength of a linear association between two variables. The formula for finding this coefficient between sets X and Y with N values can be seen in the image below. \n", 844 | "\n", 845 | "Why Pearson Correlation?\n", 846 | "\n", 847 | "Pearson correlation is invariant to scaling, i.e. multiplying all elements by a nonzero constant or adding any constant to all elements. For example, if you have two vectors X and Y,then, pearson(X, Y) == pearson(X, 2 * Y + 3). This is a pretty important property in recommendation systems because for example two users might rate two series of items totally different in terms of absolute rates, but they would be similar users (i.e. with similar ideas) with similar rates in various scales .\n", 848 | "\n", 849 | "![alt text](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd1ccc2979b0fd1c1aec96e386f686ae874f9ec0 \"Pearson Correlation\")\n", 850 | "\n", 851 | "The values given by the formula vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation. \n", 852 | "\n", 853 | "In our case, a 1 means that the two users have similar tastes while a -1 means the opposite." 854 | ] 855 | }, 856 | { 857 | "cell_type": "markdown", 858 | "metadata": { 859 | "button": false, 860 | "deletable": true, 861 | "new_sheet": false, 862 | "run_control": { 863 | "read_only": false 864 | } 865 | }, 866 | "source": [ 867 | "We will select a subset of users to iterate through. This limit is imposed because we don't want to waste too much time going through every single user." 868 | ] 869 | }, 870 | { 871 | "cell_type": "code", 872 | "execution_count": null, 873 | "metadata": { 874 | "button": false, 875 | "collapsed": false, 876 | "deletable": true, 877 | "jupyter": { 878 | "outputs_hidden": false 879 | }, 880 | "new_sheet": false, 881 | "run_control": { 882 | "read_only": false 883 | } 884 | }, 885 | "outputs": [], 886 | "source": [ 887 | "userSubsetGroup = userSubsetGroup[0:100]" 888 | ] 889 | }, 890 | { 891 | "cell_type": "markdown", 892 | "metadata": { 893 | "button": false, 894 | "deletable": true, 895 | "new_sheet": false, 896 | "run_control": { 897 | "read_only": false 898 | } 899 | }, 900 | "source": [ 901 | "Now, we calculate the Pearson Correlation between input user and subset group, and store it in a dictionary, where the key is the user Id and the value is the coefficient\n" 902 | ] 903 | }, 904 | { 905 | "cell_type": "code", 906 | "execution_count": null, 907 | "metadata": { 908 | "button": false, 909 | "collapsed": false, 910 | "deletable": true, 911 | "jupyter": { 912 | "outputs_hidden": false 913 | }, 914 | "new_sheet": false, 915 | "run_control": { 916 | "read_only": false 917 | }, 918 | "scrolled": true 919 | }, 920 | "outputs": [], 921 | "source": [ 922 | "#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient\n", 923 | "pearsonCorrelationDict = {}\n", 924 | "\n", 925 | "#For every user group in our subset\n", 926 | "for name, group in userSubsetGroup:\n", 927 | " #Let's start by sorting the input and current user group so the values aren't mixed up later on\n", 928 | " group = group.sort_values(by='movieId')\n", 929 | " inputMovies = inputMovies.sort_values(by='movieId')\n", 930 | " #Get the N for the formula\n", 931 | " nRatings = len(group)\n", 932 | " #Get the review scores for the movies that they both have in common\n", 933 | " temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]\n", 934 | " #And then store them in a temporary buffer variable in a list format to facilitate future calculations\n", 935 | " tempRatingList = temp_df['rating'].tolist()\n", 936 | " #Let's also put the current user group reviews in a list format\n", 937 | " tempGroupList = group['rating'].tolist()\n", 938 | " #Now let's calculate the pearson correlation between two users, so called, x and y\n", 939 | " Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)\n", 940 | " Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)\n", 941 | " Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)\n", 942 | " \n", 943 | " #If the denominator is different than zero, then divide, else, 0 correlation.\n", 944 | " if Sxx != 0 and Syy != 0:\n", 945 | " pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)\n", 946 | " else:\n", 947 | " pearsonCorrelationDict[name] = 0\n" 948 | ] 949 | }, 950 | { 951 | "cell_type": "code", 952 | "execution_count": null, 953 | "metadata": {}, 954 | "outputs": [], 955 | "source": [ 956 | "pearsonCorrelationDict.items()" 957 | ] 958 | }, 959 | { 960 | "cell_type": "code", 961 | "execution_count": null, 962 | "metadata": {}, 963 | "outputs": [], 964 | "source": [ 965 | "pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')\n", 966 | "pearsonDF.columns = ['similarityIndex']\n", 967 | "pearsonDF['userId'] = pearsonDF.index\n", 968 | "pearsonDF.index = range(len(pearsonDF))\n", 969 | "pearsonDF.head()" 970 | ] 971 | }, 972 | { 973 | "cell_type": "markdown", 974 | "metadata": { 975 | "button": false, 976 | "deletable": true, 977 | "new_sheet": false, 978 | "run_control": { 979 | "read_only": false 980 | } 981 | }, 982 | "source": [ 983 | "#### The top x similar users to input user\n", 984 | "Now let's get the top 50 users that are most similar to the input." 985 | ] 986 | }, 987 | { 988 | "cell_type": "code", 989 | "execution_count": null, 990 | "metadata": { 991 | "button": false, 992 | "collapsed": false, 993 | "deletable": true, 994 | "jupyter": { 995 | "outputs_hidden": false 996 | }, 997 | "new_sheet": false, 998 | "run_control": { 999 | "read_only": false 1000 | } 1001 | }, 1002 | "outputs": [], 1003 | "source": [ 1004 | "topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]\n", 1005 | "topUsers.head()" 1006 | ] 1007 | }, 1008 | { 1009 | "cell_type": "markdown", 1010 | "metadata": { 1011 | "button": false, 1012 | "deletable": true, 1013 | "new_sheet": false, 1014 | "run_control": { 1015 | "read_only": false 1016 | } 1017 | }, 1018 | "source": [ 1019 | "Now, let's start recommending movies to the input user.\n", 1020 | "\n", 1021 | "#### Rating of selected users to all movies\n", 1022 | "We're going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our __pearsonDF__ from the ratings dataframe and then store their correlation in a new column called _similarityIndex\". This is achieved below by merging of these two tables." 1023 | ] 1024 | }, 1025 | { 1026 | "cell_type": "code", 1027 | "execution_count": null, 1028 | "metadata": { 1029 | "button": false, 1030 | "collapsed": false, 1031 | "deletable": true, 1032 | "jupyter": { 1033 | "outputs_hidden": false 1034 | }, 1035 | "new_sheet": false, 1036 | "run_control": { 1037 | "read_only": false 1038 | }, 1039 | "scrolled": true 1040 | }, 1041 | "outputs": [], 1042 | "source": [ 1043 | "topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')\n", 1044 | "topUsersRating.head()" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "markdown", 1049 | "metadata": { 1050 | "button": false, 1051 | "deletable": true, 1052 | "new_sheet": false, 1053 | "run_control": { 1054 | "read_only": false 1055 | } 1056 | }, 1057 | "source": [ 1058 | "Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.\n", 1059 | "\n", 1060 | "We can easily do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns:\n", 1061 | "\n", 1062 | "It shows the idea of all similar users to candidate movies for the input user:" 1063 | ] 1064 | }, 1065 | { 1066 | "cell_type": "code", 1067 | "execution_count": null, 1068 | "metadata": { 1069 | "button": false, 1070 | "collapsed": false, 1071 | "deletable": true, 1072 | "jupyter": { 1073 | "outputs_hidden": false 1074 | }, 1075 | "new_sheet": false, 1076 | "run_control": { 1077 | "read_only": false 1078 | } 1079 | }, 1080 | "outputs": [], 1081 | "source": [ 1082 | "#Multiplies the similarity by the user's ratings\n", 1083 | "topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']\n", 1084 | "topUsersRating.head()" 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "code", 1089 | "execution_count": null, 1090 | "metadata": { 1091 | "button": false, 1092 | "collapsed": false, 1093 | "deletable": true, 1094 | "jupyter": { 1095 | "outputs_hidden": false 1096 | }, 1097 | "new_sheet": false, 1098 | "run_control": { 1099 | "read_only": false 1100 | } 1101 | }, 1102 | "outputs": [], 1103 | "source": [ 1104 | "#Applies a sum to the topUsers after grouping it up by userId\n", 1105 | "tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]\n", 1106 | "tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']\n", 1107 | "tempTopUsersRating.head()" 1108 | ] 1109 | }, 1110 | { 1111 | "cell_type": "code", 1112 | "execution_count": null, 1113 | "metadata": { 1114 | "button": false, 1115 | "collapsed": false, 1116 | "deletable": true, 1117 | "jupyter": { 1118 | "outputs_hidden": false 1119 | }, 1120 | "new_sheet": false, 1121 | "run_control": { 1122 | "read_only": false 1123 | } 1124 | }, 1125 | "outputs": [], 1126 | "source": [ 1127 | "#Creates an empty dataframe\n", 1128 | "recommendation_df = pd.DataFrame()\n", 1129 | "#Now we take the weighted average\n", 1130 | "recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']\n", 1131 | "recommendation_df['movieId'] = tempTopUsersRating.index\n", 1132 | "recommendation_df.head()" 1133 | ] 1134 | }, 1135 | { 1136 | "cell_type": "markdown", 1137 | "metadata": { 1138 | "button": false, 1139 | "deletable": true, 1140 | "new_sheet": false, 1141 | "run_control": { 1142 | "read_only": false 1143 | } 1144 | }, 1145 | "source": [ 1146 | "Now let's sort it and see the top 20 movies that the algorithm recommended!" 1147 | ] 1148 | }, 1149 | { 1150 | "cell_type": "code", 1151 | "execution_count": null, 1152 | "metadata": { 1153 | "button": false, 1154 | "collapsed": false, 1155 | "deletable": true, 1156 | "jupyter": { 1157 | "outputs_hidden": false 1158 | }, 1159 | "new_sheet": false, 1160 | "run_control": { 1161 | "read_only": false 1162 | } 1163 | }, 1164 | "outputs": [], 1165 | "source": [ 1166 | "recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)\n", 1167 | "recommendation_df.head(10)" 1168 | ] 1169 | }, 1170 | { 1171 | "cell_type": "code", 1172 | "execution_count": null, 1173 | "metadata": { 1174 | "button": false, 1175 | "collapsed": false, 1176 | "deletable": true, 1177 | "jupyter": { 1178 | "outputs_hidden": false 1179 | }, 1180 | "new_sheet": false, 1181 | "run_control": { 1182 | "read_only": false 1183 | }, 1184 | "scrolled": true 1185 | }, 1186 | "outputs": [], 1187 | "source": [ 1188 | "movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]" 1189 | ] 1190 | }, 1191 | { 1192 | "cell_type": "markdown", 1193 | "metadata": { 1194 | "button": false, 1195 | "deletable": true, 1196 | "new_sheet": false, 1197 | "run_control": { 1198 | "read_only": false 1199 | } 1200 | }, 1201 | "source": [ 1202 | "### Advantages and Disadvantages of Collaborative Filtering\n", 1203 | "\n", 1204 | "##### Advantages\n", 1205 | "* Takes other user's ratings into consideration\n", 1206 | "* Doesn't need to study or extract information from the recommended item\n", 1207 | "* Adapts to the user's interests which might change over time\n", 1208 | "\n", 1209 | "##### Disadvantages\n", 1210 | "* Approximation function can be slow\n", 1211 | "* There might be a low of amount of users to approximate\n", 1212 | "* Privacy issues when trying to learn the user's preferences" 1213 | ] 1214 | }, 1215 | { 1216 | "cell_type": "markdown", 1217 | "metadata": { 1218 | "button": false, 1219 | "deletable": true, 1220 | "new_sheet": false, 1221 | "run_control": { 1222 | "read_only": false 1223 | } 1224 | }, 1225 | "source": [ 1226 | "

Want to learn more?

\n", 1227 | "\n", 1228 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n", 1229 | "\n", 1230 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n", 1231 | "\n", 1232 | "

Thanks for completing this lesson!

\n", 1233 | "\n", 1234 | "

Author: Saeed Aghabozorgi

\n", 1235 | "

Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.

\n", 1236 | "\n", 1237 | "
\n", 1238 | "\n", 1239 | "

Copyright © 2018 Cognitive Class. This notebook and its source code are released under the terms of the MIT License.

" 1240 | ] 1241 | } 1242 | ], 1243 | "metadata": { 1244 | "kernelspec": { 1245 | "display_name": "Python 3", 1246 | "language": "python", 1247 | "name": "python3" 1248 | }, 1249 | "language_info": { 1250 | "codemirror_mode": { 1251 | "name": "ipython", 1252 | "version": 3 1253 | }, 1254 | "file_extension": ".py", 1255 | "mimetype": "text/x-python", 1256 | "name": "python", 1257 | "nbconvert_exporter": "python", 1258 | "pygments_lexer": "ipython3", 1259 | "version": "3.6.7" 1260 | }, 1261 | "widgets": { 1262 | "state": {}, 1263 | "version": "1.1.2" 1264 | } 1265 | }, 1266 | "nbformat": 4, 1267 | "nbformat_minor": 4 1268 | } 1269 | -------------------------------------------------------------------------------- /Recommender System/ML0101EN-RecSys-Content-Based-movies-py-v1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "

CONTENT-BASED FILTERING

" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous, and can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore Content-based recommendation systems and implement a simple version of one using Python and the Pandas library." 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "### Table of contents\n", 24 | "\n", 25 | "
\n", 26 | "
    \n", 27 | "
  1. Acquiring the Data
  2. \n", 28 | "
  3. Preprocessing
  4. \n", 29 | "
  5. Content-Based Filtering
  6. \n", 30 | "
\n", 31 | "
\n", 32 | "
" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "\n", 40 | "# Acquiring the Data" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "To acquire and extract the data, simply run the following Bash scripts: \n", 48 | "Dataset acquired from [GroupLens](http://grouplens.org/datasets/movielens/). Lets download the dataset. To download the data, we will use **`!wget`** to download it from IBM Object Storage. \n", 49 | "__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 1, 55 | "metadata": { 56 | "collapsed": false, 57 | "jupyter": { 58 | "outputs_hidden": false 59 | } 60 | }, 61 | "outputs": [ 62 | { 63 | "name": "stdout", 64 | "output_type": "stream", 65 | "text": [ 66 | "--2019-07-11 16:36:32-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip\n", 67 | "Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193\n", 68 | "Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.\n", 69 | "HTTP request sent, awaiting response... 200 OK\n", 70 | "Length: 160301210 (153M) [application/zip]\n", 71 | "Saving to: ‘moviedataset.zip’\n", 72 | "\n", 73 | "moviedataset.zip 100%[===================>] 152.88M 19.4MB/s in 8.0s \n", 74 | "\n", 75 | "2019-07-11 16:36:41 (19.2 MB/s) - ‘moviedataset.zip’ saved [160301210/160301210]\n", 76 | "\n", 77 | "unziping ...\n", 78 | "Archive: moviedataset.zip\n", 79 | " inflating: links.csv \n", 80 | " inflating: movies.csv \n", 81 | " inflating: ratings.csv \n", 82 | " inflating: README.txt \n", 83 | " inflating: tags.csv \n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "!wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip\n", 89 | "print('unziping ...')\n", 90 | "!unzip -o -j moviedataset.zip " 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "Now you're ready to start working with the data!" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "\n", 105 | "# Preprocessing" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "First, let's get all of the imports out of the way:" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 7, 118 | "metadata": { 119 | "collapsed": false, 120 | "jupyter": { 121 | "outputs_hidden": false 122 | } 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "#Dataframe manipulation library\n", 127 | "import pandas as pd\n", 128 | "#Math functions, we'll only need the sqrt function so let's import only that\n", 129 | "from math import sqrt\n", 130 | "import numpy as np\n", 131 | "import matplotlib.pyplot as plt\n", 132 | "%matplotlib inline" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "Now let's read each file into their Dataframes:" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 14, 145 | "metadata": { 146 | "collapsed": false, 147 | "jupyter": { 148 | "outputs_hidden": false 149 | } 150 | }, 151 | "outputs": [ 152 | { 153 | "data": { 154 | "text/html": [ 155 | "
\n", 156 | "\n", 169 | "\n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | "
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
\n", 211 | "
" 212 | ], 213 | "text/plain": [ 214 | " movieId title \\\n", 215 | "0 1 Toy Story (1995) \n", 216 | "1 2 Jumanji (1995) \n", 217 | "2 3 Grumpier Old Men (1995) \n", 218 | "3 4 Waiting to Exhale (1995) \n", 219 | "4 5 Father of the Bride Part II (1995) \n", 220 | "\n", 221 | " genres \n", 222 | "0 Adventure|Animation|Children|Comedy|Fantasy \n", 223 | "1 Adventure|Children|Fantasy \n", 224 | "2 Comedy|Romance \n", 225 | "3 Comedy|Drama|Romance \n", 226 | "4 Comedy " 227 | ] 228 | }, 229 | "execution_count": 14, 230 | "metadata": {}, 231 | "output_type": "execute_result" 232 | } 233 | ], 234 | "source": [ 235 | "#Storing the movie information into a pandas dataframe\n", 236 | "movies_df = pd.read_csv('movies.csv')\n", 237 | "#Storing the user information into a pandas dataframe\n", 238 | "ratings_df = pd.read_csv('ratings.csv')\n", 239 | "#Head is a function that gets the first N rows of a dataframe. N's default is 5.\n", 240 | "movies_df.head()" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "Let's also remove the year from the __title__ column by using pandas' replace function and store in a new __year__ column." 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 15, 253 | "metadata": { 254 | "collapsed": false, 255 | "jupyter": { 256 | "outputs_hidden": false 257 | } 258 | }, 259 | "outputs": [ 260 | { 261 | "data": { 262 | "text/html": [ 263 | "
\n", 264 | "\n", 277 | "\n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | "
movieIdtitlegenresyear
01Toy StoryAdventure|Animation|Children|Comedy|Fantasy1995
12JumanjiAdventure|Children|Fantasy1995
23Grumpier Old MenComedy|Romance1995
34Waiting to ExhaleComedy|Drama|Romance1995
45Father of the Bride Part IIComedy1995
\n", 325 | "
" 326 | ], 327 | "text/plain": [ 328 | " movieId title \\\n", 329 | "0 1 Toy Story \n", 330 | "1 2 Jumanji \n", 331 | "2 3 Grumpier Old Men \n", 332 | "3 4 Waiting to Exhale \n", 333 | "4 5 Father of the Bride Part II \n", 334 | "\n", 335 | " genres year \n", 336 | "0 Adventure|Animation|Children|Comedy|Fantasy 1995 \n", 337 | "1 Adventure|Children|Fantasy 1995 \n", 338 | "2 Comedy|Romance 1995 \n", 339 | "3 Comedy|Drama|Romance 1995 \n", 340 | "4 Comedy 1995 " 341 | ] 342 | }, 343 | "execution_count": 15, 344 | "metadata": {}, 345 | "output_type": "execute_result" 346 | } 347 | ], 348 | "source": [ 349 | "#Using regular expressions to find a year stored between parentheses\n", 350 | "#We specify the parantheses so we don't conflict with movies that have years in their titles\n", 351 | "movies_df['year'] = movies_df.title.str.extract('(\\(\\d\\d\\d\\d\\))',expand=False)\n", 352 | "#Removing the parentheses\n", 353 | "movies_df['year'] = movies_df.year.str.extract('(\\d\\d\\d\\d)',expand=False)\n", 354 | "#Removing the years from the 'title' column\n", 355 | "movies_df['title'] = movies_df.title.str.replace('(\\(\\d\\d\\d\\d\\))', '')\n", 356 | "#Applying the strip function to get rid of any ending whitespace characters that may have appeared\n", 357 | "movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())\n", 358 | "movies_df.head()" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "With that, let's also split the values in the __Genres__ column into a __list of Genres__ to simplify future use. This can be achieved by applying Python's split string function on the correct column." 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": 16, 371 | "metadata": { 372 | "collapsed": false, 373 | "jupyter": { 374 | "outputs_hidden": false 375 | } 376 | }, 377 | "outputs": [ 378 | { 379 | "data": { 380 | "text/html": [ 381 | "
\n", 382 | "\n", 395 | "\n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | "
movieIdtitlegenresyear
01Toy Story[Adventure, Animation, Children, Comedy, Fantasy]1995
12Jumanji[Adventure, Children, Fantasy]1995
23Grumpier Old Men[Comedy, Romance]1995
34Waiting to Exhale[Comedy, Drama, Romance]1995
45Father of the Bride Part II[Comedy]1995
\n", 443 | "
" 444 | ], 445 | "text/plain": [ 446 | " movieId title \\\n", 447 | "0 1 Toy Story \n", 448 | "1 2 Jumanji \n", 449 | "2 3 Grumpier Old Men \n", 450 | "3 4 Waiting to Exhale \n", 451 | "4 5 Father of the Bride Part II \n", 452 | "\n", 453 | " genres year \n", 454 | "0 [Adventure, Animation, Children, Comedy, Fantasy] 1995 \n", 455 | "1 [Adventure, Children, Fantasy] 1995 \n", 456 | "2 [Comedy, Romance] 1995 \n", 457 | "3 [Comedy, Drama, Romance] 1995 \n", 458 | "4 [Comedy] 1995 " 459 | ] 460 | }, 461 | "execution_count": 16, 462 | "metadata": {}, 463 | "output_type": "execute_result" 464 | } 465 | ], 466 | "source": [ 467 | "#Every genre is separated by a | so we simply have to call the split function on |\n", 468 | "movies_df['genres'] = movies_df.genres.str.split('|')\n", 469 | "movies_df.head()" 470 | ] 471 | }, 472 | { 473 | "cell_type": "markdown", 474 | "metadata": {}, 475 | "source": [ 476 | "Since keeping genres in a list format isn't optimal for the content-based recommendation system technique, we will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. In this case, we store every different genre in columns that contain either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't. Let's also store this dataframe in another variable since genres won't be important for our first recommendation system." 477 | ] 478 | }, 479 | { 480 | "cell_type": "code", 481 | "execution_count": 18, 482 | "metadata": { 483 | "collapsed": false, 484 | "jupyter": { 485 | "outputs_hidden": false 486 | } 487 | }, 488 | "outputs": [ 489 | { 490 | "data": { 491 | "text/html": [ 492 | "
\n", 493 | "\n", 506 | "\n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | "
movieIdtitlegenresyearAdventureAnimationChildrenComedyFantasyRomance...HorrorMysterySci-FiIMAXDocumentaryWarMusicalWesternFilm-Noir(no genres listed)
01Toy Story[Adventure, Animation, Children, Comedy, Fantasy]19951.01.01.01.01.00.0...0.00.00.00.00.00.00.00.00.00.0
12Jumanji[Adventure, Children, Fantasy]19951.00.01.00.01.00.0...0.00.00.00.00.00.00.00.00.00.0
23Grumpier Old Men[Comedy, Romance]19950.00.00.01.00.01.0...0.00.00.00.00.00.00.00.00.00.0
34Waiting to Exhale[Comedy, Drama, Romance]19950.00.00.01.00.01.0...0.00.00.00.00.00.00.00.00.00.0
45Father of the Bride Part II[Comedy]19950.00.00.01.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", 656 | "

5 rows × 24 columns

\n", 657 | "
" 658 | ], 659 | "text/plain": [ 660 | " movieId title \\\n", 661 | "0 1 Toy Story \n", 662 | "1 2 Jumanji \n", 663 | "2 3 Grumpier Old Men \n", 664 | "3 4 Waiting to Exhale \n", 665 | "4 5 Father of the Bride Part II \n", 666 | "\n", 667 | " genres year Adventure \\\n", 668 | "0 [Adventure, Animation, Children, Comedy, Fantasy] 1995 1.0 \n", 669 | "1 [Adventure, Children, Fantasy] 1995 1.0 \n", 670 | "2 [Comedy, Romance] 1995 0.0 \n", 671 | "3 [Comedy, Drama, Romance] 1995 0.0 \n", 672 | "4 [Comedy] 1995 0.0 \n", 673 | "\n", 674 | " Animation Children Comedy Fantasy Romance ... Horror Mystery \\\n", 675 | "0 1.0 1.0 1.0 1.0 0.0 ... 0.0 0.0 \n", 676 | "1 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 \n", 677 | "2 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 \n", 678 | "3 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 \n", 679 | "4 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 \n", 680 | "\n", 681 | " Sci-Fi IMAX Documentary War Musical Western Film-Noir \\\n", 682 | "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 683 | "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 684 | "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 685 | "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 686 | "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", 687 | "\n", 688 | " (no genres listed) \n", 689 | "0 0.0 \n", 690 | "1 0.0 \n", 691 | "2 0.0 \n", 692 | "3 0.0 \n", 693 | "4 0.0 \n", 694 | "\n", 695 | "[5 rows x 24 columns]" 696 | ] 697 | }, 698 | "execution_count": 18, 699 | "metadata": {}, 700 | "output_type": "execute_result" 701 | } 702 | ], 703 | "source": [ 704 | "#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.\n", 705 | "moviesWithGenres_df = movies_df.copy()\n", 706 | "\n", 707 | "#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column\n", 708 | "for index, row in movies_df.iterrows():\n", 709 | " for genre in row['genres']:\n", 710 | " moviesWithGenres_df.at[index, genre] = 1\n", 711 | "#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre\n", 712 | "moviesWithGenres_df = moviesWithGenres_df.fillna(0)\n", 713 | "moviesWithGenres_df.head()" 714 | ] 715 | }, 716 | { 717 | "cell_type": "markdown", 718 | "metadata": {}, 719 | "source": [ 720 | "Next, let's look at the ratings dataframe." 721 | ] 722 | }, 723 | { 724 | "cell_type": "code", 725 | "execution_count": null, 726 | "metadata": { 727 | "collapsed": false, 728 | "jupyter": { 729 | "outputs_hidden": false 730 | } 731 | }, 732 | "outputs": [], 733 | "source": [ 734 | "ratings_df.head()" 735 | ] 736 | }, 737 | { 738 | "cell_type": "markdown", 739 | "metadata": {}, 740 | "source": [ 741 | "Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won't be needing the timestamp column, so let's drop it to save on memory." 742 | ] 743 | }, 744 | { 745 | "cell_type": "code", 746 | "execution_count": null, 747 | "metadata": { 748 | "collapsed": false, 749 | "jupyter": { 750 | "outputs_hidden": false 751 | } 752 | }, 753 | "outputs": [], 754 | "source": [ 755 | "#Drop removes a specified row or column from a dataframe\n", 756 | "ratings_df = ratings_df.drop('timestamp', 1)\n", 757 | "ratings_df.head()" 758 | ] 759 | }, 760 | { 761 | "cell_type": "markdown", 762 | "metadata": {}, 763 | "source": [ 764 | "\n", 765 | "# Content-Based recommendation system" 766 | ] 767 | }, 768 | { 769 | "cell_type": "markdown", 770 | "metadata": {}, 771 | "source": [ 772 | "Now, let's take a look at how to implement __Content-Based__ or __Item-Item recommendation systems__. This technique attempts to figure out what a user's favourite aspects of an item is, and then recommends items that present those aspects. In our case, we're going to try to figure out the input's favorite genres from the movies and ratings given.\n", 773 | "\n", 774 | "Let's begin by creating an input user to recommend movies to:\n", 775 | "\n", 776 | "Notice: To add more movies, simply increase the amount of elements in the __userInput__. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a \"The\", like \"The Matrix\" then write it in like this: 'Matrix, The' ." 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "execution_count": null, 782 | "metadata": { 783 | "collapsed": false, 784 | "jupyter": { 785 | "outputs_hidden": false 786 | } 787 | }, 788 | "outputs": [], 789 | "source": [ 790 | "userInput = [\n", 791 | " {'title':'Breakfast Club, The', 'rating':5},\n", 792 | " {'title':'Toy Story', 'rating':3.5},\n", 793 | " {'title':'Jumanji', 'rating':2},\n", 794 | " {'title':\"Pulp Fiction\", 'rating':5},\n", 795 | " {'title':'Akira', 'rating':4.5}\n", 796 | " ] \n", 797 | "inputMovies = pd.DataFrame(userInput)\n", 798 | "inputMovies" 799 | ] 800 | }, 801 | { 802 | "cell_type": "markdown", 803 | "metadata": {}, 804 | "source": [ 805 | "#### Add movieId to input user\n", 806 | "With the input complete, let's extract the input movie's ID's from the movies dataframe and add them into it.\n", 807 | "\n", 808 | "We can achieve this by first filtering out the rows that contain the input movie's title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space." 809 | ] 810 | }, 811 | { 812 | "cell_type": "code", 813 | "execution_count": null, 814 | "metadata": { 815 | "collapsed": false, 816 | "jupyter": { 817 | "outputs_hidden": false 818 | }, 819 | "scrolled": true 820 | }, 821 | "outputs": [], 822 | "source": [ 823 | "#Filtering out the movies by title\n", 824 | "inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]\n", 825 | "#Then merging it so we can get the movieId. It's implicitly merging it by title.\n", 826 | "inputMovies = pd.merge(inputId, inputMovies)\n", 827 | "#Dropping information we won't use from the input dataframe\n", 828 | "inputMovies = inputMovies.drop('genres', 1).drop('year', 1)\n", 829 | "#Final input dataframe\n", 830 | "#If a movie you added in above isn't here, then it might not be in the original \n", 831 | "#dataframe or it might spelled differently, please check capitalisation.\n", 832 | "inputMovies" 833 | ] 834 | }, 835 | { 836 | "cell_type": "markdown", 837 | "metadata": {}, 838 | "source": [ 839 | "We're going to start by learning the input's preferences, so let's get the subset of movies that the input has watched from the Dataframe containing genres defined with binary values." 840 | ] 841 | }, 842 | { 843 | "cell_type": "code", 844 | "execution_count": null, 845 | "metadata": { 846 | "collapsed": false, 847 | "jupyter": { 848 | "outputs_hidden": false 849 | } 850 | }, 851 | "outputs": [], 852 | "source": [ 853 | "#Filtering out the movies from the input\n", 854 | "userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]\n", 855 | "userMovies" 856 | ] 857 | }, 858 | { 859 | "cell_type": "markdown", 860 | "metadata": {}, 861 | "source": [ 862 | "We'll only need the actual genre table, so let's clean this up a bit by resetting the index and dropping the movieId, title, genres and year columns." 863 | ] 864 | }, 865 | { 866 | "cell_type": "code", 867 | "execution_count": null, 868 | "metadata": { 869 | "collapsed": false, 870 | "jupyter": { 871 | "outputs_hidden": false 872 | } 873 | }, 874 | "outputs": [], 875 | "source": [ 876 | "#Resetting the index to avoid future issues\n", 877 | "userMovies = userMovies.reset_index(drop=True)\n", 878 | "#Dropping unnecessary issues due to save memory and to avoid issues\n", 879 | "userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)\n", 880 | "userGenreTable" 881 | ] 882 | }, 883 | { 884 | "cell_type": "markdown", 885 | "metadata": {}, 886 | "source": [ 887 | "Now we're ready to start learning the input's preferences!\n", 888 | "\n", 889 | "To do this, we're going to turn each genre into weights. We can do this by using the input's reviews and multiplying them into the input's genre table and then summing up the resulting table by column. This operation is actually a dot product between a matrix and a vector, so we can simply accomplish by calling Pandas's \"dot\" function." 890 | ] 891 | }, 892 | { 893 | "cell_type": "code", 894 | "execution_count": null, 895 | "metadata": { 896 | "collapsed": false, 897 | "jupyter": { 898 | "outputs_hidden": false 899 | } 900 | }, 901 | "outputs": [], 902 | "source": [ 903 | "inputMovies['rating']" 904 | ] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": null, 909 | "metadata": { 910 | "collapsed": false, 911 | "jupyter": { 912 | "outputs_hidden": false 913 | } 914 | }, 915 | "outputs": [], 916 | "source": [ 917 | "#Dot produt to get weights\n", 918 | "userProfile = userGenreTable.transpose().dot(inputMovies['rating'])\n", 919 | "#The user profile\n", 920 | "userProfile" 921 | ] 922 | }, 923 | { 924 | "cell_type": "markdown", 925 | "metadata": {}, 926 | "source": [ 927 | "Now, we have the weights for every of the user's preferences. This is known as the User Profile. Using this, we can recommend movies that satisfy the user's preferences." 928 | ] 929 | }, 930 | { 931 | "cell_type": "markdown", 932 | "metadata": {}, 933 | "source": [ 934 | "Let's start by extracting the genre table from the original dataframe:" 935 | ] 936 | }, 937 | { 938 | "cell_type": "code", 939 | "execution_count": null, 940 | "metadata": { 941 | "collapsed": false, 942 | "jupyter": { 943 | "outputs_hidden": false 944 | } 945 | }, 946 | "outputs": [], 947 | "source": [ 948 | "#Now let's get the genres of every movie in our original dataframe\n", 949 | "genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])\n", 950 | "#And drop the unnecessary information\n", 951 | "genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)\n", 952 | "genreTable.head()" 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": null, 958 | "metadata": { 959 | "collapsed": false, 960 | "jupyter": { 961 | "outputs_hidden": false 962 | } 963 | }, 964 | "outputs": [], 965 | "source": [ 966 | "genreTable.shape" 967 | ] 968 | }, 969 | { 970 | "cell_type": "markdown", 971 | "metadata": {}, 972 | "source": [ 973 | "With the input's profile and the complete list of movies and their genres in hand, we're going to take the weighted average of every movie based on the input profile and recommend the top twenty movies that most satisfy it." 974 | ] 975 | }, 976 | { 977 | "cell_type": "code", 978 | "execution_count": null, 979 | "metadata": { 980 | "collapsed": false, 981 | "jupyter": { 982 | "outputs_hidden": false 983 | } 984 | }, 985 | "outputs": [], 986 | "source": [ 987 | "#Multiply the genres by the weights and then take the weighted average\n", 988 | "recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())\n", 989 | "recommendationTable_df.head()" 990 | ] 991 | }, 992 | { 993 | "cell_type": "code", 994 | "execution_count": null, 995 | "metadata": { 996 | "collapsed": false, 997 | "jupyter": { 998 | "outputs_hidden": false 999 | } 1000 | }, 1001 | "outputs": [], 1002 | "source": [ 1003 | "#Sort our recommendations in descending order\n", 1004 | "recommendationTable_df = recommendationTable_df.sort_values(ascending=False)\n", 1005 | "#Just a peek at the values\n", 1006 | "recommendationTable_df.head()" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "markdown", 1011 | "metadata": {}, 1012 | "source": [ 1013 | "Now here's the recommendation table!" 1014 | ] 1015 | }, 1016 | { 1017 | "cell_type": "code", 1018 | "execution_count": null, 1019 | "metadata": { 1020 | "collapsed": false, 1021 | "jupyter": { 1022 | "outputs_hidden": false 1023 | }, 1024 | "scrolled": true 1025 | }, 1026 | "outputs": [], 1027 | "source": [ 1028 | "#The final recommendation table\n", 1029 | "movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]" 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "markdown", 1034 | "metadata": {}, 1035 | "source": [ 1036 | "### Advantages and Disadvantages of Content-Based Filtering\n", 1037 | "\n", 1038 | "##### Advantages\n", 1039 | "* Learns user's preferences\n", 1040 | "* Highly personalized for the user\n", 1041 | "\n", 1042 | "##### Disadvantages\n", 1043 | "* Doesn't take into account what others think of the item, so low quality item recommendations might happen\n", 1044 | "* Extracting data is not always intuitive\n", 1045 | "* Determining what characteristics of the item the user dislikes or likes is not always obvious" 1046 | ] 1047 | }, 1048 | { 1049 | "cell_type": "markdown", 1050 | "metadata": {}, 1051 | "source": [ 1052 | "

Want to learn more?

\n", 1053 | "\n", 1054 | "IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler\n", 1055 | "\n", 1056 | "Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio\n", 1057 | "\n", 1058 | "

Thanks for completing this lesson!

\n", 1059 | "\n", 1060 | "

Author: Saeed Aghabozorgi

\n", 1061 | "

Saeed Aghabozorgi, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.

\n", 1062 | "\n", 1063 | "
\n", 1064 | "\n", 1065 | "

Copyright © 2018 Cognitive Class. This notebook and its source code are released under the terms of the MIT License.

" 1066 | ] 1067 | } 1068 | ], 1069 | "metadata": { 1070 | "kernelspec": { 1071 | "display_name": "Python 3", 1072 | "language": "python", 1073 | "name": "python3" 1074 | }, 1075 | "language_info": { 1076 | "codemirror_mode": { 1077 | "name": "ipython", 1078 | "version": 3 1079 | }, 1080 | "file_extension": ".py", 1081 | "mimetype": "text/x-python", 1082 | "name": "python", 1083 | "nbconvert_exporter": "python", 1084 | "pygments_lexer": "ipython3", 1085 | "version": "3.6.7" 1086 | }, 1087 | "widgets": { 1088 | "state": {}, 1089 | "version": "1.1.2" 1090 | } 1091 | }, 1092 | "nbformat": 4, 1093 | "nbformat_minor": 4 1094 | } 1095 | --------------------------------------------------------------------------------