├── Housing Price Prediction.md ├── README.md ├── output_29_1.png └── output_31_1.png /Housing Price Prediction.md: -------------------------------------------------------------------------------- 1 | 2 |

3 | 4 |

Data Analysis with Python

5 | 6 | # House Sales in King County, USA 7 | 8 | This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. 9 | 10 | id :a notation for a house 11 | 12 | date: Date house was sold 13 | 14 | 15 | price: Price is prediction target 16 | 17 | 18 | bedrooms: Number of Bedrooms/House 19 | 20 | 21 | bathrooms: Number of bathrooms/bedrooms 22 | 23 | sqft_living: square footage of the home 24 | 25 | sqft_lot: square footage of the lot 26 | 27 | 28 | floors :Total floors (levels) in house 29 | 30 | 31 | waterfront :House which has a view to a waterfront 32 | 33 | 34 | view: Has been viewed 35 | 36 | 37 | condition :How good the condition is Overall 38 | 39 | grade: overall grade given to the housing unit, based on King County grading system 40 | 41 | 42 | sqft_above :square footage of house apart from basement 43 | 44 | 45 | sqft_basement: square footage of the basement 46 | 47 | yr_built :Built Year 48 | 49 | 50 | yr_renovated :Year when house was renovated 51 | 52 | zipcode:zip code 53 | 54 | 55 | lat: Latitude coordinate 56 | 57 | long: Longitude coordinate 58 | 59 | sqft_living15 :Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area 60 | 61 | 62 | sqft_lot15 :lotSize area in 2015(implies-- some renovations) 63 | 64 | You will require the following libraries 65 | 66 | 67 | ```python 68 | import pandas as pd 69 | import matplotlib.pyplot as plt 70 | import numpy as np 71 | import seaborn as sns 72 | from sklearn.pipeline import Pipeline 73 | from sklearn.preprocessing import StandardScaler,PolynomialFeatures 74 | %matplotlib inline 75 | ``` 76 | 77 | # 1.0 Importing the Data 78 | 79 | Load the csv: 80 | 81 | 82 | ```python 83 | file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv' 84 | df=pd.read_csv(file_name) 85 | ``` 86 | 87 | 88 | we use the method head to display the first 5 columns of the dataframe. 89 | 90 | 91 | ```python 92 | df.head(5) 93 | ``` 94 | 95 | 96 | 97 | 98 |

99 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 |

	Unnamed: 0	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	0	7129300520	20141013T000000	221900.0	3.0	1.00	1180	5650	1.0	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
1	1	6414100192	20141209T000000	538000.0	3.0	2.25	2570	7242	2.0	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
2	2	5631500400	20150225T000000	180000.0	2.0	1.00	770	10000	1.0	...	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
3	3	2487200875	20141209T000000	604000.0	4.0	3.00	1960	5000	1.0	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
4	4	1954400510	20150218T000000	510000.0	3.0	2.00	1680	8080	1.0	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800	7503

262 |

5 rows × 22 columns

263 |

264 | 265 | 266 | 267 | #### Question 1 268 | Display the data types of each column using the attribute dtype, then take a screenshot and submit it, include your code in the image. 269 | 270 | 271 | ```python 272 | df.dtypes 273 | ``` 274 | 275 | 276 | 277 | 278 | Unnamed: 0 int64 279 | id int64 280 | date object 281 | price float64 282 | bedrooms float64 283 | bathrooms float64 284 | sqft_living int64 285 | sqft_lot int64 286 | floors float64 287 | waterfront int64 288 | view int64 289 | condition int64 290 | grade int64 291 | sqft_above int64 292 | sqft_basement int64 293 | yr_built int64 294 | yr_renovated int64 295 | zipcode int64 296 | lat float64 297 | long float64 298 | sqft_living15 int64 299 | sqft_lot15 int64 300 | dtype: object 301 | 302 | 303 | 304 | We use the method describe to obtain a statistical summary of the dataframe. 305 | 306 | 307 | ```python 308 | df.describe() 309 | ``` 310 | 311 | 312 | 313 | 314 |

315 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 |

	Unnamed: 0	id	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
count	21613.00000	2.161300e+04	2.161300e+04	21600.000000	21603.000000	21613.000000	2.161300e+04	21613.000000	21613.000000	21613.000000	...	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000
mean	10806.00000	4.580302e+09	5.400881e+05	3.372870	2.115736	2079.899736	1.510697e+04	1.494309	0.007542	0.234303	...	7.656873	1788.390691	291.509045	1971.005136	84.402258	98077.939805	47.560053	-122.213896	1986.552492	12768.455652
std	6239.28002	2.876566e+09	3.671272e+05	0.926657	0.768996	918.440897	4.142051e+04	0.539989	0.086517	0.766318	...	1.175459	828.090978	442.575043	29.373411	401.679240	53.505026	0.138564	0.140828	685.391304	27304.179631
min	0.00000	1.000102e+06	7.500000e+04	1.000000	0.500000	290.000000	5.200000e+02	1.000000	0.000000	0.000000	...	1.000000	290.000000	0.000000	1900.000000	0.000000	98001.000000	47.155900	-122.519000	399.000000	651.000000
25%	5403.00000	2.123049e+09	3.219500e+05	3.000000	1.750000	1427.000000	5.040000e+03	1.000000	0.000000	0.000000	...	7.000000	1190.000000	0.000000	1951.000000	0.000000	98033.000000	47.471000	-122.328000	1490.000000	5100.000000
50%	10806.00000	3.904930e+09	4.500000e+05	3.000000	2.250000	1910.000000	7.618000e+03	1.500000	0.000000	0.000000	...	7.000000	1560.000000	0.000000	1975.000000	0.000000	98065.000000	47.571800	-122.230000	1840.000000	7620.000000
75%	16209.00000	7.308900e+09	6.450000e+05	4.000000	2.500000	2550.000000	1.068800e+04	2.000000	0.000000	0.000000	...	8.000000	2210.000000	560.000000	1997.000000	0.000000	98118.000000	47.678000	-122.125000	2360.000000	10083.000000
max	21612.00000	9.900000e+09	7.700000e+06	33.000000	8.000000	13540.000000	1.651359e+06	3.500000	1.000000	4.000000	...	13.000000	9410.000000	4820.000000	2015.000000	2015.000000	98199.000000	47.777600	-121.315000	6210.000000	871200.000000

550 |

8 rows × 21 columns

551 |

552 | 553 | 554 | 555 | # 2.0 Data Wrangling 556 | 557 | #### Question 2 558 | Drop the columns "id" and "Unnamed: 0" from axis 1 using the method drop(), then use the method describe() to obtain a statistical summary of the data. Take a screenshot and submit it, make sure the inplace parameter is set to True 559 | 560 | 561 | ```python 562 | df.drop('id', axis=1, inplace=True) 563 | df.drop('Unnamed: 0', axis=1, inplace=True) 564 | df.describe() 565 | ``` 566 | 567 | 568 | 569 | 570 |

571 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 | 704 | 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 715 | 716 | 717 | 718 | 719 | 720 | 721 | 722 | 723 | 724 | 725 | 726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 748 | 749 | 750 | 751 | 752 | 753 | 754 | 755 | 756 | 757 | 758 | 759 | 760 | 761 | 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 |

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
count	2.161300e+04	21600.000000	21603.000000	21613.000000	2.161300e+04	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000	21613.000000
mean	5.400881e+05	3.372870	2.115736	2079.899736	1.510697e+04	1.494309	0.007542	0.234303	3.409430	7.656873	1788.390691	291.509045	1971.005136	84.402258	98077.939805	47.560053	-122.213896	1986.552492	12768.455652
std	3.671272e+05	0.926657	0.768996	918.440897	4.142051e+04	0.539989	0.086517	0.766318	0.650743	1.175459	828.090978	442.575043	29.373411	401.679240	53.505026	0.138564	0.140828	685.391304	27304.179631
min	7.500000e+04	1.000000	0.500000	290.000000	5.200000e+02	1.000000	0.000000	0.000000	1.000000	1.000000	290.000000	0.000000	1900.000000	0.000000	98001.000000	47.155900	-122.519000	399.000000	651.000000
25%	3.219500e+05	3.000000	1.750000	1427.000000	5.040000e+03	1.000000	0.000000	0.000000	3.000000	7.000000	1190.000000	0.000000	1951.000000	0.000000	98033.000000	47.471000	-122.328000	1490.000000	5100.000000
50%	4.500000e+05	3.000000	2.250000	1910.000000	7.618000e+03	1.500000	0.000000	0.000000	3.000000	7.000000	1560.000000	0.000000	1975.000000	0.000000	98065.000000	47.571800	-122.230000	1840.000000	7620.000000
75%	6.450000e+05	4.000000	2.500000	2550.000000	1.068800e+04	2.000000	0.000000	0.000000	4.000000	8.000000	2210.000000	560.000000	1997.000000	0.000000	98118.000000	47.678000	-122.125000	2360.000000	10083.000000
max	7.700000e+06	33.000000	8.000000	13540.000000	1.651359e+06	3.500000	1.000000	4.000000	5.000000	13.000000	9410.000000	4820.000000	2015.000000	2015.000000	98199.000000	47.777600	-121.315000	6210.000000	871200.000000

788 |

789 | 790 | 791 | 792 | we can see we have missing values for the columns bedrooms and bathrooms 793 | 794 | 795 | ```python 796 | print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum()) 797 | print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum()) 798 | 799 | ``` 800 | 801 | number of NaN values for the column bedrooms : 13 802 | number of NaN values for the column bathrooms : 10 803 | 804 | 805 | 806 | We can replace the missing values of the column 'bedrooms' with the mean of the column 'bedrooms' using the method replace. Don't forget to set the inplace parameter top True 807 | 808 | 809 | ```python 810 | mean=df['bedrooms'].mean() 811 | df['bedrooms'].replace(np.nan,mean, inplace=True) 812 | ``` 813 | 814 | 815 | We also replace the missing values of the column 'bathrooms' with the mean of the column

'bedrooms'  using the method replace.Don't forget to set the  inplace   parameter top  Ture 
 816 | 
 817 | 
 818 | ```python
 819 | mean=df['bathrooms'].mean()
 820 | df['bathrooms'].replace(np.nan,mean, inplace=True)
 821 | ```
 822 | 
 823 | 
 824 | ```python
 825 | print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
 826 | print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
 827 | ```
 828 | 
 829 |     number of NaN values for the column bedrooms : 0
 830 |     number of NaN values for the column bathrooms : 0
 831 | 
 832 | 
 833 | # 3.0 Exploratory data analysis
 834 | 
 835 | #### Question 3
 836 | Use the method value_counts to count the number of houses with unique floor values, use the method .to_frame() to convert it to a dataframe.
 837 | 
 838 | 
 839 | 
 840 | ```python
 841 | df['floors'].value_counts().to_frame()
 842 | ```
 843 | 
 844 | 
 845 | 
 846 | 
 847 | 
 848 | 
 861 | 
 862 |   
 863 |     
 864 |       
 865 |       
 866 |     
 867 |   
 868 |   
 869 |     
 870 |       
 871 |       
 872 |     
 873 |     
 874 |       
 875 |       
 876 |     
 877 |     
 878 |       
 879 |       
 880 |     
 881 |     
 882 |       
 883 |       
 884 |     
 885 |     
 886 |       
 887 |       
 888 |     
 889 |     
 890 |       
 891 |       
 892 |     
 893 |   
 894 | floors
1.0 10680
2.0 8241
1.5 1910
3.0 613
2.5 161
3.5 8
 895 | 
 896 | 
 897 | 
 898 | 
 899 | ### Question 4
 900 | Use the function boxplot in the seaborn library  to  determine whether houses with a waterfront view or without a waterfront view have more price outliers .
 901 | 
 902 | 
 903 | ```python
 904 | sns.boxplot(x='waterfront', y='price', data=df)
 905 | ```
 906 | 
 907 | 
 908 | 
 909 | 
 910 |     
 911 | 
 912 | 
 913 | 
 914 | 
 915 | ![png](output_29_1.png)
 916 | 
 917 | 
 918 | ### Question 5
 919 | Use the function  regplot  in the seaborn library  to  determine if the feature sqft_above is negatively or positively correlated with price.
 920 | 
 921 | 
 922 | ```python
 923 | sns.regplot(x='sqft_above', y='price', data=df)
 924 | ```
 925 | 
 926 | 
 927 | 
 928 | 
 929 |     
 930 | 
 931 | 
 932 | 
 933 | 
 934 | ![png](output_31_1.png)
 935 | 
 936 | 
 937 | 
 938 | We can use the Pandas method corr()  to find the feature other than price that is most correlated with price.
 939 | 
 940 | 
 941 | ```python
 942 | df.corr()['price'].sort_values()
 943 | ```
 944 | 
 945 | # Module 4: Model Development
 946 | 
 947 | Import libraries 
 948 | 
 949 | 
 950 | ```python
 951 | import matplotlib.pyplot as plt
 952 | from sklearn.linear_model import LinearRegression
 953 | 
 954 | ```
 955 | 
 956 | 
 957 | We can Fit a linear regression model using the  longitude feature  'long' and  caculate the R^2.
 958 | 
 959 | 
 960 | ```python
 961 | X = df[['long']]
 962 | Y = df['price']
 963 | lm = LinearRegression()
 964 | lm
 965 | lm.fit(X,Y)
 966 | lm.score(X, Y)
 967 | ```
 968 | 
 969 | ### Question  6
 970 | Fit a linear regression model to predict the 'price' using the feature 'sqft_living' then calculate the R^2. Take a screenshot of your code and the value of the R^2.
 971 | 
 972 | 
 973 | ```python
 974 | U = df[['sqft_living']]
 975 | V = df['price']
 976 | lm.fit(U,V)
 977 | lm.score(U,V)
 978 | ```
 979 | 
 980 | 
 981 | 
 982 | 
 983 |     0.49285321790379316
 984 | 
 985 | 
 986 | 
 987 | ### Question 7
 988 | Fit a linear regression model to predict the 'price' using the list of features:
 989 | 
 990 | 
 991 | ```python
 992 | features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]     
 993 | X = df[features]
 994 | Y = df['price']
 995 | lm.fit(X,Y)
 996 | ```
 997 | 
 998 | 
 999 | 
1000 | 
1001 |     LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
1002 |              normalize=False)
1003 | 
1004 | 
1005 | 
1006 | the calculate the R^2. Take a screenshot of your code
1007 | 
1008 | 
1009 | ```python
1010 | lm.score(X,Y)
1011 | ```
1012 | 
1013 | 
1014 | 
1015 | 
1016 |     0.6576951666037504
1017 | 
1018 | 
1019 | 
1020 | #### this will help with Question 8
1021 | 
1022 | Create a list of tuples, the first element in the tuple contains the name of the estimator:
1023 | 
1024 | 'scale'
1025 | 
1026 | 'polynomial'
1027 | 
1028 | 'model'
1029 | 
1030 | The second element in the tuple  contains the model constructor 
1031 | 
1032 | StandardScaler()
1033 | 
1034 | PolynomialFeatures(include_bias=False)
1035 | 
1036 | LinearRegression()
1037 | 
1038 | 
1039 | 
1040 | ```python
1041 | Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]
1042 | ```
1043 | 
1044 | ### Question 8
1045 | Use the list to create a pipeline object,  predict the 'price', fit the object using the features in the list  features , then fit the model and calculate the R^2
1046 | 
1047 | 
1048 | ```python
1049 | pipe=Pipeline(Input)
1050 | pipe
1051 | ```
1052 | 
1053 | 
1054 | 
1055 | 
1056 |     Pipeline(memory=None,
1057 |          steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
1058 |              normalize=False))])
1059 | 
1060 | 
1061 | 
1062 | 
1063 | ```python
1064 | pipe.fit(X,Y)
1065 | ```
1066 | 
1067 |     /opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/preprocessing/data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
1068 |       return self.partial_fit(X, y)
1069 |     /opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/base.py:467: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
1070 |       return self.fit(X, y, **fit_params).transform(X)
1071 | 
1072 | 
1073 | 
1074 | 
1075 | 
1076 |     Pipeline(memory=None,
1077 |          steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
1078 |              normalize=False))])
1079 | 
1080 | 
1081 | 
1082 | 
1083 | ```python
1084 | pipe.score(X,Y)
1085 | ```
1086 | 
1087 |     /opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/pipeline.py:511: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
1088 |       Xt = transform.transform(Xt)
1089 | 
1090 | 
1091 | 
1092 | 
1093 | 
1094 |     0.7513427797293394
1095 | 
1096 | 
1097 | 
1098 | # Module 5: MODEL EVALUATION AND REFINEMENT
1099 | 
1100 | import the necessary modules  
1101 | 
1102 | 
1103 | ```python
1104 | from sklearn.model_selection import cross_val_score
1105 | from sklearn.model_selection import train_test_split
1106 | print("done")
1107 | ```
1108 | 
1109 |     done
1110 | 
1111 | 
1112 | we will split the data into training and testing set
1113 | 
1114 | 
1115 | ```python
1116 | features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]    
1117 | X = df[features ]
1118 | Y = df['price']
1119 | 
1120 | x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)
1121 | 
1122 | 
1123 | print("number of test samples :", x_test.shape[0])
1124 | print("number of training samples:",x_train.shape[0])
1125 | ```
1126 | 
1127 |     number of test samples : 3242
1128 |     number of training samples: 18371
1129 | 
1130 | 
1131 | ### Question 9
1132 | Create and fit a Ridge regression object using the training data, setting the regularization parameter to 0.1 and calculate the R^2 using the test data. 
1133 | 
1134 | 
1135 | 
1136 | ```python
1137 | from sklearn.linear_model import Ridge
1138 | ```
1139 | 
1140 | 
1141 | ```python
1142 | RigeModel=Ridge(alpha=0.1)
1143 | RigeModel.fit(x_train, y_train)
1144 | RigeModel.score(x_test, y_test)
1145 | ```
1146 | 
1147 | 
1148 | 
1149 | 
1150 |     0.6478759163939111
1151 | 
1152 | 
1153 | 
1154 | ### Question 10
1155 | Perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, setting the regularisation parameter to 0.1.  Calculate the R^2 utilising the test data provided. Take a screenshot of your code and the R^2.
1156 | 
1157 | 
1158 | ```python
1159 | pr = PolynomialFeatures(degree=2)
1160 | x_train_pr = pr.fit_transform(x_train)
1161 | x_test_pr = pr.fit_transform(x_test)
1162 | 
1163 | RigeModel=Ridge(alpha=0.1)
1164 | RigeModel.fit(x_train_pr, y_train)
1165 | RigeModel.score(x_test_pr, y_test)
1166 | ```
1167 | 
1168 | 
1169 | 
1170 | 
1171 |     0.7002744268659787
1172 | 
1173 | 
1174 | 
1175 | About this Project: 
1176 | 
1177 | This project is part of a graded excercise in "Data Analysis Using Python" course on Coursera offered by IBM
1178 | 
1179 | 
1180 | 
1181 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # House-sale-price-prediction-using-python
2 | This project analyzes and predicts housing sale price based on features such as square footage, number of bedrooms, views, locations, etc. It uses the dataset of house sale prices for King County, USA, including home sales between May 2014 and May 2015.
3 | 
4 | It uses python codes to do data-cleaning, analyse data and create models for price prediction, evaluate and refine models. Major activities covered include:
5 | - Numerical representation of data using correlation, linear and polynomial regression, R-Squared values, etc
6 | - Graphical representation of data using boxplot, and seaborn's regplot.
7 | - Model refinement suing ridge regression object.
8 | - Polynomial transform of training and test data, etc.
9 | 


--------------------------------------------------------------------------------
/output_29_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calistus-igwilo/House-sale-price-prediction-using-python/23d5b4735e772ff99a759b5548f76355ed63346d/output_29_1.png


--------------------------------------------------------------------------------
/output_31_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calistus-igwilo/House-sale-price-prediction-using-python/23d5b4735e772ff99a759b5548f76355ed63346d/output_31_1.png


--------------------------------------------------------------------------------