├── Multiple Linear Regression in Python ├── init └── Housing+Case+Study+using+RFE (2).ipynb ├── Industry Relevance of Linear Regression └── init ├── Simple Linear Regression in Python └── init └── README.md /Multiple Linear Regression in Python/init: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Industry Relevance of Linear Regression/init: -------------------------------------------------------------------------------- 1 | Initialise 2 | -------------------------------------------------------------------------------- /Simple Linear Regression in Python/init: -------------------------------------------------------------------------------- 1 | initialising folder 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Welcome to Linear Regression Module 2 | 3 | ## TOC: 4 | - How to download files? 5 | - What is where? 6 | 7 | ### How to download files? 8 | ![Screenshot (390)](https://user-images.githubusercontent.com/82654736/141263640-9c257566-a78d-421d-972f-c6990e7b1269.png) 9 | 10 | Click on Code button and then click on Download ZIP 11 | OR 12 | Use `git clone https://github.com/ContentUpgrad/Linear-Regression.git` command on your terminal if git is installed in your machine. 13 | 14 | 15 | ### What is where? 16 | The folder structure is given below: 17 | 18 | ![Screenshot (392)](https://user-images.githubusercontent.com/82654736/141263874-50c65df3-0f97-417d-a6d9-57c0157c138a.png) 19 | 20 | 21 | As you can see there are three main folders when you log in: 22 | 23 | 1. **Industry Relevance of Linear Regression** This is where all the code files regarding Industry Relevance of Linear Regression sessions are kept 24 | 2. **Multiple Linear Regression in Python** This is where all the code files regarding Multiple Linear Regression in Python session are kept 25 | 3. **Simple Linear Regression in Python**This is where all the code files regarding Simple Linear Regression in Python session are kept 26 | 27 | When you click on any folder you will find the code and data folders as shown below: 28 | ![Screenshot (394)](https://user-images.githubusercontent.com/82654736/141264101-99f161db-9d64-492d-acde-13770881c108.png) 29 | 30 | You will find all the code files of the session in code folder and data folder will be empty. Please note that you need to follow the instructions given in the segment for downloading data files and keep it in the data folder manually. 31 | 32 | #### Industry Relevance of Linear Regression 33 | You will find the following files in the code folder of Industry Relevance of Linear Regression 34 | ![Screenshot (394)](https://user-images.githubusercontent.com/82654736/141264294-44581d9d-fdfd-41b9-a95b-ebff540c2b1f.png) 35 | 36 | 37 | #### Multiple Linear Regression in Python 38 | You will find the following files in the code folder of Multiple Linear Regression in Python 39 | ![Screenshot (397)](https://user-images.githubusercontent.com/82654736/141264551-e250616e-c4e2-419e-8357-9abc2ec42372.png) 40 | 41 | 42 | #### Simple Linear Regression in Python 43 | You will find the following files in the code folder of Simple Linear Regression in Python 44 | ![Screenshot (398)](https://user-images.githubusercontent.com/82654736/141264584-dd3831db-b4d4-4207-ae8f-103a64f9b27c.png) 45 | 46 | -------------------------------------------------------------------------------- /Multiple Linear Regression in Python/Housing+Case+Study+using+RFE (2).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Model Selection using RFE (Housing Case Study)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Importing and Understanding Data" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "# Supress Warnings\n", 24 | "\n", 25 | "import warnings\n", 26 | "warnings.filterwarnings('ignore')" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "\n", 36 | "# # Recommended versions\n", 37 | "# numpy \t1.26.4\n", 38 | "# pandas\t2.2.2\n", 39 | "# matplotlib\t3.7.1\n", 40 | "# seaborn\t0.10.0\n", 41 | "# statsmodels\t0.14.4\n", 42 | "# sklearn\t1.5.2" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "import pandas as pd\n", 52 | "import numpy as np" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 3, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "# Importing Housing.csv\n", 62 | "housing = pd.read_csv('Housing.csv')" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 4, 68 | "metadata": {}, 69 | "outputs": [ 70 | { 71 | "data": { 72 | "text/html": [ 73 | "
\n", 74 | "\n", 87 | "\n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | "
priceareabedroomsbathroomsstoriesmainroadguestroombasementhotwaterheatingairconditioningparkingprefareafurnishingstatus
0133000007420423yesnononoyes2yesfurnished
1122500008960444yesnononoyes3nofurnished
2122500009960322yesnoyesnono2yessemi-furnished
3122150007500422yesnoyesnoyes3yesfurnished
4114100007420412yesyesyesnoyes2nofurnished
\n", 189 | "
" 190 | ], 191 | "text/plain": [ 192 | " price area bedrooms bathrooms stories mainroad guestroom basement \\\n", 193 | "0 13300000 7420 4 2 3 yes no no \n", 194 | "1 12250000 8960 4 4 4 yes no no \n", 195 | "2 12250000 9960 3 2 2 yes no yes \n", 196 | "3 12215000 7500 4 2 2 yes no yes \n", 197 | "4 11410000 7420 4 1 2 yes yes yes \n", 198 | "\n", 199 | " hotwaterheating airconditioning parking prefarea furnishingstatus \n", 200 | "0 no yes 2 yes furnished \n", 201 | "1 no yes 3 no furnished \n", 202 | "2 no no 2 yes semi-furnished \n", 203 | "3 no yes 3 yes furnished \n", 204 | "4 no yes 2 no furnished " 205 | ] 206 | }, 207 | "execution_count": 4, 208 | "metadata": {}, 209 | "output_type": "execute_result" 210 | } 211 | ], 212 | "source": [ 213 | "# Looking at the first five rows\n", 214 | "housing.head()" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "### Data Preparation" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 5, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "# List of variables to map\n", 231 | "\n", 232 | "varlist = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']\n", 233 | "\n", 234 | "# Defining the map function\n", 235 | "def binary_map(x):\n", 236 | " return x.map({'yes': 1, \"no\": 0})\n", 237 | "\n", 238 | "# Applying the function to the housing list\n", 239 | "housing[varlist] = housing[varlist].apply(binary_map)" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": 6, 245 | "metadata": { 246 | "scrolled": false 247 | }, 248 | "outputs": [ 249 | { 250 | "data": { 251 | "text/html": [ 252 | "
\n", 253 | "\n", 266 | "\n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | "
priceareabedroomsbathroomsstoriesmainroadguestroombasementhotwaterheatingairconditioningparkingprefareafurnishingstatus
01330000074204231000121furnished
11225000089604441000130furnished
21225000099603221010021semi-furnished
31221500075004221010131furnished
41141000074204121110120furnished
\n", 368 | "
" 369 | ], 370 | "text/plain": [ 371 | " price area bedrooms bathrooms stories mainroad guestroom \\\n", 372 | "0 13300000 7420 4 2 3 1 0 \n", 373 | "1 12250000 8960 4 4 4 1 0 \n", 374 | "2 12250000 9960 3 2 2 1 0 \n", 375 | "3 12215000 7500 4 2 2 1 0 \n", 376 | "4 11410000 7420 4 1 2 1 1 \n", 377 | "\n", 378 | " basement hotwaterheating airconditioning parking prefarea \\\n", 379 | "0 0 0 1 2 1 \n", 380 | "1 0 0 1 3 0 \n", 381 | "2 1 0 0 2 1 \n", 382 | "3 1 0 1 3 1 \n", 383 | "4 1 0 1 2 0 \n", 384 | "\n", 385 | " furnishingstatus \n", 386 | "0 furnished \n", 387 | "1 furnished \n", 388 | "2 semi-furnished \n", 389 | "3 furnished \n", 390 | "4 furnished " 391 | ] 392 | }, 393 | "execution_count": 6, 394 | "metadata": {}, 395 | "output_type": "execute_result" 396 | } 397 | ], 398 | "source": [ 399 | "# Check the housing dataframe now\n", 400 | "\n", 401 | "housing.head()" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "### Dummy Variables" 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": {}, 414 | "source": [ 415 | "The variable `furnishingstatus` has three levels. We need to convert these levels into integer as well. For this, we will use something called `dummy variables`." 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": 7, 421 | "metadata": {}, 422 | "outputs": [ 423 | { 424 | "data": { 425 | "text/html": [ 426 | "
\n", 427 | "\n", 440 | "\n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | "
furnishedsemi-furnishedunfurnished
0100
1100
2010
3100
4100
\n", 482 | "
" 483 | ], 484 | "text/plain": [ 485 | " furnished semi-furnished unfurnished\n", 486 | "0 1 0 0\n", 487 | "1 1 0 0\n", 488 | "2 0 1 0\n", 489 | "3 1 0 0\n", 490 | "4 1 0 0" 491 | ] 492 | }, 493 | "execution_count": 7, 494 | "metadata": {}, 495 | "output_type": "execute_result" 496 | } 497 | ], 498 | "source": [ 499 | "# Get the dummy variables for the feature 'furnishingstatus' and store it in a new variable - 'status'\n", 500 | "\n", 501 | "status = pd.get_dummies(housing['furnishingstatus'])\n", 502 | "\n", 503 | "# Check what the dataset 'status' looks like\n", 504 | "status.head()" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": {}, 510 | "source": [ 511 | "Now, you don't need three columns. You can drop the `furnished` column, as the type of furnishing can be identified with just the last two columns where — \n", 512 | "- `00` will correspond to `furnished`\n", 513 | "- `01` will correspond to `unfurnished`\n", 514 | "- `10` will correspond to `semi-furnished`" 515 | ] 516 | }, 517 | { 518 | "cell_type": "code", 519 | "execution_count": 8, 520 | "metadata": {}, 521 | "outputs": [ 522 | { 523 | "data": { 524 | "text/html": [ 525 | "
\n", 526 | "\n", 539 | "\n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | "
priceareabedroomsbathroomsstoriesmainroadguestroombasementhotwaterheatingairconditioningparkingprefareafurnishingstatussemi-furnishedunfurnished
01330000074204231000121furnished00
11225000089604441000130furnished00
21225000099603221010021semi-furnished10
31221500075004221010131furnished00
41141000074204121110120furnished00
\n", 653 | "
" 654 | ], 655 | "text/plain": [ 656 | " price area bedrooms bathrooms stories mainroad guestroom \\\n", 657 | "0 13300000 7420 4 2 3 1 0 \n", 658 | "1 12250000 8960 4 4 4 1 0 \n", 659 | "2 12250000 9960 3 2 2 1 0 \n", 660 | "3 12215000 7500 4 2 2 1 0 \n", 661 | "4 11410000 7420 4 1 2 1 1 \n", 662 | "\n", 663 | " basement hotwaterheating airconditioning parking prefarea \\\n", 664 | "0 0 0 1 2 1 \n", 665 | "1 0 0 1 3 0 \n", 666 | "2 1 0 0 2 1 \n", 667 | "3 1 0 1 3 1 \n", 668 | "4 1 0 1 2 0 \n", 669 | "\n", 670 | " furnishingstatus semi-furnished unfurnished \n", 671 | "0 furnished 0 0 \n", 672 | "1 furnished 0 0 \n", 673 | "2 semi-furnished 1 0 \n", 674 | "3 furnished 0 0 \n", 675 | "4 furnished 0 0 " 676 | ] 677 | }, 678 | "execution_count": 8, 679 | "metadata": {}, 680 | "output_type": "execute_result" 681 | } 682 | ], 683 | "source": [ 684 | "# Let's drop the first column from status df using 'drop_first = True'\n", 685 | "status = pd.get_dummies(housing['furnishingstatus'], drop_first = True)\n", 686 | "\n", 687 | "# Add the results to the original housing dataframe\n", 688 | "housing = pd.concat([housing, status], axis = 1)\n", 689 | "\n", 690 | "# Now let's see the head of our dataframe.\n", 691 | "housing.head()" 692 | ] 693 | }, 694 | { 695 | "cell_type": "code", 696 | "execution_count": 9, 697 | "metadata": {}, 698 | "outputs": [ 699 | { 700 | "data": { 701 | "text/html": [ 702 | "
\n", 703 | "\n", 716 | "\n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | "
priceareabedroomsbathroomsstoriesmainroadguestroombasementhotwaterheatingairconditioningparkingprefareasemi-furnishedunfurnished
0133000007420423100012100
1122500008960444100013000
2122500009960322101002110
3122150007500422101013100
4114100007420412111012000
\n", 824 | "
" 825 | ], 826 | "text/plain": [ 827 | " price area bedrooms bathrooms stories mainroad guestroom \\\n", 828 | "0 13300000 7420 4 2 3 1 0 \n", 829 | "1 12250000 8960 4 4 4 1 0 \n", 830 | "2 12250000 9960 3 2 2 1 0 \n", 831 | "3 12215000 7500 4 2 2 1 0 \n", 832 | "4 11410000 7420 4 1 2 1 1 \n", 833 | "\n", 834 | " basement hotwaterheating airconditioning parking prefarea \\\n", 835 | "0 0 0 1 2 1 \n", 836 | "1 0 0 1 3 0 \n", 837 | "2 1 0 0 2 1 \n", 838 | "3 1 0 1 3 1 \n", 839 | "4 1 0 1 2 0 \n", 840 | "\n", 841 | " semi-furnished unfurnished \n", 842 | "0 0 0 \n", 843 | "1 0 0 \n", 844 | "2 1 0 \n", 845 | "3 0 0 \n", 846 | "4 0 0 " 847 | ] 848 | }, 849 | "execution_count": 9, 850 | "metadata": {}, 851 | "output_type": "execute_result" 852 | } 853 | ], 854 | "source": [ 855 | "# Drop 'furnishingstatus' as we have created the dummies for it\n", 856 | "housing.drop(['furnishingstatus'], axis = 1, inplace = True)\n", 857 | "\n", 858 | "housing.head()" 859 | ] 860 | }, 861 | { 862 | "cell_type": "markdown", 863 | "metadata": {}, 864 | "source": [ 865 | "## Splitting the Data into Training and Testing Sets" 866 | ] 867 | }, 868 | { 869 | "cell_type": "code", 870 | "execution_count": 10, 871 | "metadata": {}, 872 | "outputs": [], 873 | "source": [ 874 | "from sklearn.model_selection import train_test_split\n", 875 | "\n", 876 | "# We specify this so that the train and test data set always have the same rows, respectively\n", 877 | "\n", 878 | "df_train, df_test = train_test_split(housing, train_size = 0.7, test_size = 0.3, random_state = 100)" 879 | ] 880 | }, 881 | { 882 | "cell_type": "markdown", 883 | "metadata": {}, 884 | "source": [ 885 | "### Rescaling the Features \n", 886 | "\n", 887 | "We will use MinMax scaling." 888 | ] 889 | }, 890 | { 891 | "cell_type": "code", 892 | "execution_count": 11, 893 | "metadata": {}, 894 | "outputs": [], 895 | "source": [ 896 | "from sklearn.preprocessing import MinMaxScaler\n", 897 | "scaler = MinMaxScaler()" 898 | ] 899 | }, 900 | { 901 | "cell_type": "code", 902 | "execution_count": 12, 903 | "metadata": {}, 904 | "outputs": [ 905 | { 906 | "data": { 907 | "text/html": [ 908 | "
\n", 909 | "\n", 922 | "\n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | "
priceareabedroomsbathroomsstoriesmainroadguestroombasementhotwaterheatingairconditioningparkingprefareasemi-furnishedunfurnished
3590.1696970.1552270.40.00.000000100000.333333001
190.6151520.4033790.40.50.333333100010.333333110
1590.3212120.1156280.40.50.000000111010.000000000
350.5481330.4544170.40.51.000000100010.666667000
280.5757580.5380150.80.50.333333101100.666667001
\n", 1030 | "
" 1031 | ], 1032 | "text/plain": [ 1033 | " price area bedrooms bathrooms stories mainroad guestroom \\\n", 1034 | "359 0.169697 0.155227 0.4 0.0 0.000000 1 0 \n", 1035 | "19 0.615152 0.403379 0.4 0.5 0.333333 1 0 \n", 1036 | "159 0.321212 0.115628 0.4 0.5 0.000000 1 1 \n", 1037 | "35 0.548133 0.454417 0.4 0.5 1.000000 1 0 \n", 1038 | "28 0.575758 0.538015 0.8 0.5 0.333333 1 0 \n", 1039 | "\n", 1040 | " basement hotwaterheating airconditioning parking prefarea \\\n", 1041 | "359 0 0 0 0.333333 0 \n", 1042 | "19 0 0 1 0.333333 1 \n", 1043 | "159 1 0 1 0.000000 0 \n", 1044 | "35 0 0 1 0.666667 0 \n", 1045 | "28 1 1 0 0.666667 0 \n", 1046 | "\n", 1047 | " semi-furnished unfurnished \n", 1048 | "359 0 1 \n", 1049 | "19 1 0 \n", 1050 | "159 0 0 \n", 1051 | "35 0 0 \n", 1052 | "28 0 1 " 1053 | ] 1054 | }, 1055 | "execution_count": 12, 1056 | "metadata": {}, 1057 | "output_type": "execute_result" 1058 | } 1059 | ], 1060 | "source": [ 1061 | "# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables\n", 1062 | "num_vars = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking','price']\n", 1063 | "\n", 1064 | "df_train[num_vars] = scaler.fit_transform(df_train[num_vars])\n", 1065 | "\n", 1066 | "df_train.head()" 1067 | ] 1068 | }, 1069 | { 1070 | "cell_type": "markdown", 1071 | "metadata": {}, 1072 | "source": [ 1073 | "### Dividing into X and Y sets for the model building" 1074 | ] 1075 | }, 1076 | { 1077 | "cell_type": "code", 1078 | "execution_count": 13, 1079 | "metadata": {}, 1080 | "outputs": [], 1081 | "source": [ 1082 | "y_train = df_train.pop('price')\n", 1083 | "X_train = df_train" 1084 | ] 1085 | }, 1086 | { 1087 | "cell_type": "markdown", 1088 | "metadata": {}, 1089 | "source": [ 1090 | "## Building our model\n", 1091 | "\n", 1092 | "This time, we will be using the **LinearRegression function from SciKit Learn** for its compatibility with RFE (which is a utility from sklearn)" 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "markdown", 1097 | "metadata": {}, 1098 | "source": [ 1099 | "### RFE\n", 1100 | "Recursive feature elimination" 1101 | ] 1102 | }, 1103 | { 1104 | "cell_type": "code", 1105 | "execution_count": 14, 1106 | "metadata": {}, 1107 | "outputs": [], 1108 | "source": [ 1109 | "# Importing RFE and LinearRegression\n", 1110 | "from sklearn.feature_selection import RFE\n", 1111 | "from sklearn.linear_model import LinearRegression" 1112 | ] 1113 | }, 1114 | { 1115 | "cell_type": "code", 1116 | "execution_count": 15, 1117 | "metadata": {}, 1118 | "outputs": [], 1119 | "source": [ 1120 | "# Running RFE with the output number of the variable equal to 10\n", 1121 | "lm = LinearRegression()\n", 1122 | "lm.fit(X_train, y_train)\n", 1123 | "\n", 1124 | "rfe = RFE(lm, 10) # running RFE\n", 1125 | "rfe = rfe.fit(X_train, y_train)" 1126 | ] 1127 | }, 1128 | { 1129 | "cell_type": "code", 1130 | "execution_count": 16, 1131 | "metadata": {}, 1132 | "outputs": [ 1133 | { 1134 | "data": { 1135 | "text/plain": [ 1136 | "[('area', True, 1),\n", 1137 | " ('bedrooms', True, 1),\n", 1138 | " ('bathrooms', True, 1),\n", 1139 | " ('stories', True, 1),\n", 1140 | " ('mainroad', True, 1),\n", 1141 | " ('guestroom', True, 1),\n", 1142 | " ('basement', False, 3),\n", 1143 | " ('hotwaterheating', True, 1),\n", 1144 | " ('airconditioning', True, 1),\n", 1145 | " ('parking', True, 1),\n", 1146 | " ('prefarea', True, 1),\n", 1147 | " ('semi-furnished', False, 4),\n", 1148 | " ('unfurnished', False, 2)]" 1149 | ] 1150 | }, 1151 | "execution_count": 16, 1152 | "metadata": {}, 1153 | "output_type": "execute_result" 1154 | } 1155 | ], 1156 | "source": [ 1157 | "list(zip(X_train.columns,rfe.support_,rfe.ranking_))" 1158 | ] 1159 | }, 1160 | { 1161 | "cell_type": "code", 1162 | "execution_count": 17, 1163 | "metadata": { 1164 | "scrolled": false 1165 | }, 1166 | "outputs": [ 1167 | { 1168 | "data": { 1169 | "text/plain": [ 1170 | "Index(['area', 'bedrooms', 'bathrooms', 'stories', 'mainroad', 'guestroom',\n", 1171 | " 'hotwaterheating', 'airconditioning', 'parking', 'prefarea'],\n", 1172 | " dtype='object')" 1173 | ] 1174 | }, 1175 | "execution_count": 17, 1176 | "metadata": {}, 1177 | "output_type": "execute_result" 1178 | } 1179 | ], 1180 | "source": [ 1181 | "col = X_train.columns[rfe.support_]\n", 1182 | "col" 1183 | ] 1184 | }, 1185 | { 1186 | "cell_type": "code", 1187 | "execution_count": 18, 1188 | "metadata": {}, 1189 | "outputs": [ 1190 | { 1191 | "data": { 1192 | "text/plain": [ 1193 | "Index(['basement', 'semi-furnished', 'unfurnished'], dtype='object')" 1194 | ] 1195 | }, 1196 | "execution_count": 18, 1197 | "metadata": {}, 1198 | "output_type": "execute_result" 1199 | } 1200 | ], 1201 | "source": [ 1202 | "X_train.columns[~rfe.support_]" 1203 | ] 1204 | }, 1205 | { 1206 | "cell_type": "markdown", 1207 | "metadata": {}, 1208 | "source": [ 1209 | "### Building model using statsmodel, for the detailed statistics" 1210 | ] 1211 | }, 1212 | { 1213 | "cell_type": "code", 1214 | "execution_count": 19, 1215 | "metadata": {}, 1216 | "outputs": [], 1217 | "source": [ 1218 | "# Creating X_test dataframe with RFE selected variables\n", 1219 | "X_train_rfe = X_train[col]" 1220 | ] 1221 | }, 1222 | { 1223 | "cell_type": "code", 1224 | "execution_count": 20, 1225 | "metadata": {}, 1226 | "outputs": [], 1227 | "source": [ 1228 | "# Adding a constant variable \n", 1229 | "import statsmodels.api as sm \n", 1230 | "X_train_rfe = sm.add_constant(X_train_rfe)" 1231 | ] 1232 | }, 1233 | { 1234 | "cell_type": "code", 1235 | "execution_count": 21, 1236 | "metadata": {}, 1237 | "outputs": [], 1238 | "source": [ 1239 | "lm = sm.OLS(y_train,X_train_rfe).fit() # Running the linear model" 1240 | ] 1241 | }, 1242 | { 1243 | "cell_type": "code", 1244 | "execution_count": 22, 1245 | "metadata": {}, 1246 | "outputs": [ 1247 | { 1248 | "name": "stdout", 1249 | "output_type": "stream", 1250 | "text": [ 1251 | " OLS Regression Results \n", 1252 | "==============================================================================\n", 1253 | "Dep. Variable: price R-squared: 0.669\n", 1254 | "Model: OLS Adj. R-squared: 0.660\n", 1255 | "Method: Least Squares F-statistic: 74.89\n", 1256 | "Date: Tue, 09 Oct 2018 Prob (F-statistic): 1.28e-82\n", 1257 | "Time: 13:15:31 Log-Likelihood: 374.65\n", 1258 | "No. Observations: 381 AIC: -727.3\n", 1259 | "Df Residuals: 370 BIC: -683.9\n", 1260 | "Df Model: 10 \n", 1261 | "Covariance Type: nonrobust \n", 1262 | "===================================================================================\n", 1263 | " coef std err t P>|t| [0.025 0.975]\n", 1264 | "-----------------------------------------------------------------------------------\n", 1265 | "const 0.0027 0.018 0.151 0.880 -0.033 0.038\n", 1266 | "area 0.2363 0.030 7.787 0.000 0.177 0.296\n", 1267 | "bedrooms 0.0661 0.037 1.794 0.074 -0.006 0.139\n", 1268 | "bathrooms 0.1982 0.022 8.927 0.000 0.155 0.242\n", 1269 | "stories 0.0977 0.019 5.251 0.000 0.061 0.134\n", 1270 | "mainroad 0.0556 0.014 3.848 0.000 0.027 0.084\n", 1271 | "guestroom 0.0381 0.013 2.934 0.004 0.013 0.064\n", 1272 | "hotwaterheating 0.0897 0.022 4.104 0.000 0.047 0.133\n", 1273 | "airconditioning 0.0711 0.011 6.235 0.000 0.049 0.093\n", 1274 | "parking 0.0637 0.018 3.488 0.001 0.028 0.100\n", 1275 | "prefarea 0.0643 0.012 5.445 0.000 0.041 0.088\n", 1276 | "==============================================================================\n", 1277 | "Omnibus: 86.105 Durbin-Watson: 2.098\n", 1278 | "Prob(Omnibus): 0.000 Jarque-Bera (JB): 286.069\n", 1279 | "Skew: 0.992 Prob(JB): 7.60e-63\n", 1280 | "Kurtosis: 6.753 Cond. No. 13.2\n", 1281 | "==============================================================================\n", 1282 | "\n", 1283 | "Warnings:\n", 1284 | "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n" 1285 | ] 1286 | } 1287 | ], 1288 | "source": [ 1289 | "#Let's see the summary of our linear model\n", 1290 | "print(lm.summary())" 1291 | ] 1292 | }, 1293 | { 1294 | "cell_type": "markdown", 1295 | "metadata": {}, 1296 | "source": [ 1297 | "`Bedrooms` is insignificant in presence of other variables; can be dropped" 1298 | ] 1299 | }, 1300 | { 1301 | "cell_type": "code", 1302 | "execution_count": 23, 1303 | "metadata": {}, 1304 | "outputs": [], 1305 | "source": [ 1306 | "X_train_new = X_train_rfe.drop([\"bedrooms\"], axis = 1)" 1307 | ] 1308 | }, 1309 | { 1310 | "cell_type": "markdown", 1311 | "metadata": {}, 1312 | "source": [ 1313 | "Rebuilding the model without `bedrooms`" 1314 | ] 1315 | }, 1316 | { 1317 | "cell_type": "code", 1318 | "execution_count": 24, 1319 | "metadata": {}, 1320 | "outputs": [], 1321 | "source": [ 1322 | "# Adding a constant variable \n", 1323 | "import statsmodels.api as sm \n", 1324 | "X_train_lm = sm.add_constant(X_train_new)" 1325 | ] 1326 | }, 1327 | { 1328 | "cell_type": "code", 1329 | "execution_count": 25, 1330 | "metadata": {}, 1331 | "outputs": [], 1332 | "source": [ 1333 | "lm = sm.OLS(y_train,X_train_lm).fit() # Running the linear model" 1334 | ] 1335 | }, 1336 | { 1337 | "cell_type": "code", 1338 | "execution_count": 26, 1339 | "metadata": {}, 1340 | "outputs": [ 1341 | { 1342 | "name": "stdout", 1343 | "output_type": "stream", 1344 | "text": [ 1345 | " OLS Regression Results \n", 1346 | "==============================================================================\n", 1347 | "Dep. Variable: price R-squared: 0.666\n", 1348 | "Model: OLS Adj. R-squared: 0.658\n", 1349 | "Method: Least Squares F-statistic: 82.37\n", 1350 | "Date: Tue, 09 Oct 2018 Prob (F-statistic): 6.67e-83\n", 1351 | "Time: 13:15:31 Log-Likelihood: 373.00\n", 1352 | "No. Observations: 381 AIC: -726.0\n", 1353 | "Df Residuals: 371 BIC: -686.6\n", 1354 | "Df Model: 9 \n", 1355 | "Covariance Type: nonrobust \n", 1356 | "===================================================================================\n", 1357 | " coef std err t P>|t| [0.025 0.975]\n", 1358 | "-----------------------------------------------------------------------------------\n", 1359 | "const 0.0242 0.013 1.794 0.074 -0.002 0.051\n", 1360 | "area 0.2367 0.030 7.779 0.000 0.177 0.297\n", 1361 | "bathrooms 0.2070 0.022 9.537 0.000 0.164 0.250\n", 1362 | "stories 0.1096 0.017 6.280 0.000 0.075 0.144\n", 1363 | "mainroad 0.0536 0.014 3.710 0.000 0.025 0.082\n", 1364 | "guestroom 0.0390 0.013 2.991 0.003 0.013 0.065\n", 1365 | "hotwaterheating 0.0921 0.022 4.213 0.000 0.049 0.135\n", 1366 | "airconditioning 0.0710 0.011 6.212 0.000 0.049 0.094\n", 1367 | "parking 0.0669 0.018 3.665 0.000 0.031 0.103\n", 1368 | "prefarea 0.0653 0.012 5.513 0.000 0.042 0.089\n", 1369 | "==============================================================================\n", 1370 | "Omnibus: 91.542 Durbin-Watson: 2.107\n", 1371 | "Prob(Omnibus): 0.000 Jarque-Bera (JB): 315.402\n", 1372 | "Skew: 1.044 Prob(JB): 3.25e-69\n", 1373 | "Kurtosis: 6.938 Cond. No. 10.0\n", 1374 | "==============================================================================\n", 1375 | "\n", 1376 | "Warnings:\n", 1377 | "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n" 1378 | ] 1379 | } 1380 | ], 1381 | "source": [ 1382 | "#Let's see the summary of our linear model\n", 1383 | "print(lm.summary())" 1384 | ] 1385 | }, 1386 | { 1387 | "cell_type": "code", 1388 | "execution_count": 27, 1389 | "metadata": {}, 1390 | "outputs": [ 1391 | { 1392 | "data": { 1393 | "text/plain": [ 1394 | "Index(['const', 'area', 'bathrooms', 'stories', 'mainroad', 'guestroom',\n", 1395 | " 'hotwaterheating', 'airconditioning', 'parking', 'prefarea'],\n", 1396 | " dtype='object')" 1397 | ] 1398 | }, 1399 | "execution_count": 27, 1400 | "metadata": {}, 1401 | "output_type": "execute_result" 1402 | } 1403 | ], 1404 | "source": [ 1405 | "X_train_new.columns" 1406 | ] 1407 | }, 1408 | { 1409 | "cell_type": "code", 1410 | "execution_count": 28, 1411 | "metadata": {}, 1412 | "outputs": [], 1413 | "source": [ 1414 | "X_train_new = X_train_new.drop(['const'], axis=1)" 1415 | ] 1416 | }, 1417 | { 1418 | "cell_type": "code", 1419 | "execution_count": 29, 1420 | "metadata": {}, 1421 | "outputs": [ 1422 | { 1423 | "data": { 1424 | "text/html": [ 1425 | "
\n", 1426 | "\n", 1439 | "\n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | " \n", 1478 | " \n", 1479 | " \n", 1480 | " \n", 1481 | " \n", 1482 | " \n", 1483 | " \n", 1484 | " \n", 1485 | " \n", 1486 | " \n", 1487 | " \n", 1488 | " \n", 1489 | " \n", 1490 | " \n", 1491 | " \n", 1492 | " \n", 1493 | " \n", 1494 | "
FeaturesVIF
0area4.52
3mainroad4.26
2stories2.12
7parking2.10
6airconditioning1.75
1bathrooms1.58
8prefarea1.47
4guestroom1.30
5hotwaterheating1.12
\n", 1495 | "
" 1496 | ], 1497 | "text/plain": [ 1498 | " Features VIF\n", 1499 | "0 area 4.52\n", 1500 | "3 mainroad 4.26\n", 1501 | "2 stories 2.12\n", 1502 | "7 parking 2.10\n", 1503 | "6 airconditioning 1.75\n", 1504 | "1 bathrooms 1.58\n", 1505 | "8 prefarea 1.47\n", 1506 | "4 guestroom 1.30\n", 1507 | "5 hotwaterheating 1.12" 1508 | ] 1509 | }, 1510 | "execution_count": 29, 1511 | "metadata": {}, 1512 | "output_type": "execute_result" 1513 | } 1514 | ], 1515 | "source": [ 1516 | "# Calculate the VIFs for the new model\n", 1517 | "from statsmodels.stats.outliers_influence import variance_inflation_factor\n", 1518 | "\n", 1519 | "vif = pd.DataFrame()\n", 1520 | "X = X_train_new\n", 1521 | "vif['Features'] = X.columns\n", 1522 | "vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]\n", 1523 | "vif['VIF'] = round(vif['VIF'], 2)\n", 1524 | "vif = vif.sort_values(by = \"VIF\", ascending = False)\n", 1525 | "vif" 1526 | ] 1527 | }, 1528 | { 1529 | "cell_type": "markdown", 1530 | "metadata": {}, 1531 | "source": [ 1532 | "## Residual Analysis of the train data\n", 1533 | "\n", 1534 | "So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like." 1535 | ] 1536 | }, 1537 | { 1538 | "cell_type": "code", 1539 | "execution_count": 30, 1540 | "metadata": {}, 1541 | "outputs": [], 1542 | "source": [ 1543 | "y_train_price = lm.predict(X_train_lm)" 1544 | ] 1545 | }, 1546 | { 1547 | "cell_type": "code", 1548 | "execution_count": 31, 1549 | "metadata": {}, 1550 | "outputs": [], 1551 | "source": [ 1552 | "# Importing the required libraries for plots.\n", 1553 | "import matplotlib.pyplot as plt\n", 1554 | "import seaborn as sns\n", 1555 | "%matplotlib inline" 1556 | ] 1557 | }, 1558 | { 1559 | "cell_type": "code", 1560 | "execution_count": 32, 1561 | "metadata": {}, 1562 | "outputs": [ 1563 | { 1564 | "name": "stderr", 1565 | "output_type": "stream", 1566 | "text": [ 1567 | "C:\\Users\\admin\\Anaconda3\\lib\\site-packages\\matplotlib\\axes\\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.\n", 1568 | " warnings.warn(\"The 'normed' kwarg is deprecated, and has been \"\n" 1569 | ] 1570 | }, 1571 | { 1572 | "data": { 1573 | "text/plain": [ 1574 | "Text(0.5,0,'Errors')" 1575 | ] 1576 | }, 1577 | "execution_count": 32, 1578 | "metadata": {}, 1579 | "output_type": "execute_result" 1580 | }, 1581 | { 1582 | "data": { 1583 | "image/png": "\n", 1584 | "text/plain": [ 1585 | "
" 1586 | ] 1587 | }, 1588 | "metadata": {}, 1589 | "output_type": "display_data" 1590 | } 1591 | ], 1592 | "source": [ 1593 | "# Plot the histogram of the error terms\n", 1594 | "fig = plt.figure()\n", 1595 | "sns.distplot((y_train - y_train_price), bins = 20)\n", 1596 | "fig.suptitle('Error Terms', fontsize = 20) # Plot heading \n", 1597 | "plt.xlabel('Errors', fontsize = 18) # X-label" 1598 | ] 1599 | }, 1600 | { 1601 | "cell_type": "markdown", 1602 | "metadata": {}, 1603 | "source": [ 1604 | "## Making Predictions" 1605 | ] 1606 | }, 1607 | { 1608 | "cell_type": "markdown", 1609 | "metadata": {}, 1610 | "source": [ 1611 | "#### Applying the scaling on the test sets" 1612 | ] 1613 | }, 1614 | { 1615 | "cell_type": "code", 1616 | "execution_count": 33, 1617 | "metadata": {}, 1618 | "outputs": [], 1619 | "source": [ 1620 | "num_vars = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking','price']\n", 1621 | "\n", 1622 | "df_test[num_vars] = scaler.transform(df_test[num_vars])" 1623 | ] 1624 | }, 1625 | { 1626 | "cell_type": "markdown", 1627 | "metadata": {}, 1628 | "source": [ 1629 | "#### Dividing into X_test and y_test" 1630 | ] 1631 | }, 1632 | { 1633 | "cell_type": "code", 1634 | "execution_count": 34, 1635 | "metadata": {}, 1636 | "outputs": [], 1637 | "source": [ 1638 | "y_test = df_test.pop('price')\n", 1639 | "X_test = df_test" 1640 | ] 1641 | }, 1642 | { 1643 | "cell_type": "code", 1644 | "execution_count": 35, 1645 | "metadata": {}, 1646 | "outputs": [], 1647 | "source": [ 1648 | "# Now let's use our model to make predictions.\n", 1649 | "\n", 1650 | "# Creating X_test_new dataframe by dropping variables from X_test\n", 1651 | "X_test_new = X_test[X_train_new.columns]\n", 1652 | "\n", 1653 | "# Adding a constant variable \n", 1654 | "X_test_new = sm.add_constant(X_test_new)" 1655 | ] 1656 | }, 1657 | { 1658 | "cell_type": "code", 1659 | "execution_count": 36, 1660 | "metadata": {}, 1661 | "outputs": [], 1662 | "source": [ 1663 | "# Making predictions\n", 1664 | "y_pred = lm.predict(X_test_new)" 1665 | ] 1666 | }, 1667 | { 1668 | "cell_type": "markdown", 1669 | "metadata": {}, 1670 | "source": [ 1671 | "## Model Evaluation" 1672 | ] 1673 | }, 1674 | { 1675 | "cell_type": "code", 1676 | "execution_count": 37, 1677 | "metadata": {}, 1678 | "outputs": [ 1679 | { 1680 | "data": { 1681 | "text/plain": [ 1682 | "Text(0,0.5,'y_pred')" 1683 | ] 1684 | }, 1685 | "execution_count": 37, 1686 | "metadata": {}, 1687 | "output_type": "execute_result" 1688 | }, 1689 | { 1690 | "data": { 1691 | "image/png": "\n", 1692 | "text/plain": [ 1693 | "
" 1694 | ] 1695 | }, 1696 | "metadata": {}, 1697 | "output_type": "display_data" 1698 | } 1699 | ], 1700 | "source": [ 1701 | "# Plotting y_test and y_pred to understand the spread.\n", 1702 | "fig = plt.figure()\n", 1703 | "plt.scatter(y_test,y_pred)\n", 1704 | "fig.suptitle('y_test vs y_pred', fontsize=20) # Plot heading \n", 1705 | "plt.xlabel('y_test', fontsize=18) # X-label\n", 1706 | "plt.ylabel('y_pred', fontsize=16) # Y-label" 1707 | ] 1708 | } 1709 | ], 1710 | "metadata": { 1711 | "kernelspec": { 1712 | "display_name": "Python 3", 1713 | "language": "python", 1714 | "name": "python3" 1715 | }, 1716 | "language_info": { 1717 | "codemirror_mode": { 1718 | "name": "ipython", 1719 | "version": 3 1720 | }, 1721 | "file_extension": ".py", 1722 | "mimetype": "text/x-python", 1723 | "name": "python", 1724 | "nbconvert_exporter": "python", 1725 | "pygments_lexer": "ipython3", 1726 | "version": "3.6.5" 1727 | } 1728 | }, 1729 | "nbformat": 4, 1730 | "nbformat_minor": 2 1731 | } 1732 | --------------------------------------------------------------------------------