├── Housing Price Prediction.md
├── README.md
├── output_29_1.png
└── output_31_1.png
/Housing Price Prediction.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
Data Analysis with Python
5 |
6 | # House Sales in King County, USA
7 |
8 | This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.
9 |
10 | id :a notation for a house
11 |
12 | date: Date house was sold
13 |
14 |
15 | price: Price is prediction target
16 |
17 |
18 | bedrooms: Number of Bedrooms/House
19 |
20 |
21 | bathrooms: Number of bathrooms/bedrooms
22 |
23 | sqft_living: square footage of the home
24 |
25 | sqft_lot: square footage of the lot
26 |
27 |
28 | floors :Total floors (levels) in house
29 |
30 |
31 | waterfront :House which has a view to a waterfront
32 |
33 |
34 | view: Has been viewed
35 |
36 |
37 | condition :How good the condition is Overall
38 |
39 | grade: overall grade given to the housing unit, based on King County grading system
40 |
41 |
42 | sqft_above :square footage of house apart from basement
43 |
44 |
45 | sqft_basement: square footage of the basement
46 |
47 | yr_built :Built Year
48 |
49 |
50 | yr_renovated :Year when house was renovated
51 |
52 | zipcode:zip code
53 |
54 |
55 | lat: Latitude coordinate
56 |
57 | long: Longitude coordinate
58 |
59 | sqft_living15 :Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
60 |
61 |
62 | sqft_lot15 :lotSize area in 2015(implies-- some renovations)
63 |
64 | You will require the following libraries
65 |
66 |
67 | ```python
68 | import pandas as pd
69 | import matplotlib.pyplot as plt
70 | import numpy as np
71 | import seaborn as sns
72 | from sklearn.pipeline import Pipeline
73 | from sklearn.preprocessing import StandardScaler,PolynomialFeatures
74 | %matplotlib inline
75 | ```
76 |
77 | # 1.0 Importing the Data
78 |
79 | Load the csv:
80 |
81 |
82 | ```python
83 | file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'
84 | df=pd.read_csv(file_name)
85 | ```
86 |
87 |
88 | we use the method head
to display the first 5 columns of the dataframe.
89 |
90 |
91 | ```python
92 | df.head(5)
93 | ```
94 |
95 |
96 |
97 |
98 |
99 |
112 |
113 |
114 |
115 | |
116 | Unnamed: 0 |
117 | id |
118 | date |
119 | price |
120 | bedrooms |
121 | bathrooms |
122 | sqft_living |
123 | sqft_lot |
124 | floors |
125 | waterfront |
126 | ... |
127 | grade |
128 | sqft_above |
129 | sqft_basement |
130 | yr_built |
131 | yr_renovated |
132 | zipcode |
133 | lat |
134 | long |
135 | sqft_living15 |
136 | sqft_lot15 |
137 |
138 |
139 |
140 |
141 | 0 |
142 | 0 |
143 | 7129300520 |
144 | 20141013T000000 |
145 | 221900.0 |
146 | 3.0 |
147 | 1.00 |
148 | 1180 |
149 | 5650 |
150 | 1.0 |
151 | 0 |
152 | ... |
153 | 7 |
154 | 1180 |
155 | 0 |
156 | 1955 |
157 | 0 |
158 | 98178 |
159 | 47.5112 |
160 | -122.257 |
161 | 1340 |
162 | 5650 |
163 |
164 |
165 | 1 |
166 | 1 |
167 | 6414100192 |
168 | 20141209T000000 |
169 | 538000.0 |
170 | 3.0 |
171 | 2.25 |
172 | 2570 |
173 | 7242 |
174 | 2.0 |
175 | 0 |
176 | ... |
177 | 7 |
178 | 2170 |
179 | 400 |
180 | 1951 |
181 | 1991 |
182 | 98125 |
183 | 47.7210 |
184 | -122.319 |
185 | 1690 |
186 | 7639 |
187 |
188 |
189 | 2 |
190 | 2 |
191 | 5631500400 |
192 | 20150225T000000 |
193 | 180000.0 |
194 | 2.0 |
195 | 1.00 |
196 | 770 |
197 | 10000 |
198 | 1.0 |
199 | 0 |
200 | ... |
201 | 6 |
202 | 770 |
203 | 0 |
204 | 1933 |
205 | 0 |
206 | 98028 |
207 | 47.7379 |
208 | -122.233 |
209 | 2720 |
210 | 8062 |
211 |
212 |
213 | 3 |
214 | 3 |
215 | 2487200875 |
216 | 20141209T000000 |
217 | 604000.0 |
218 | 4.0 |
219 | 3.00 |
220 | 1960 |
221 | 5000 |
222 | 1.0 |
223 | 0 |
224 | ... |
225 | 7 |
226 | 1050 |
227 | 910 |
228 | 1965 |
229 | 0 |
230 | 98136 |
231 | 47.5208 |
232 | -122.393 |
233 | 1360 |
234 | 5000 |
235 |
236 |
237 | 4 |
238 | 4 |
239 | 1954400510 |
240 | 20150218T000000 |
241 | 510000.0 |
242 | 3.0 |
243 | 2.00 |
244 | 1680 |
245 | 8080 |
246 | 1.0 |
247 | 0 |
248 | ... |
249 | 8 |
250 | 1680 |
251 | 0 |
252 | 1987 |
253 | 0 |
254 | 98074 |
255 | 47.6168 |
256 | -122.045 |
257 | 1800 |
258 | 7503 |
259 |
260 |
261 |
262 |
5 rows × 22 columns
263 |
264 |
265 |
266 |
267 | #### Question 1
268 | Display the data types of each column using the attribute dtype, then take a screenshot and submit it, include your code in the image.
269 |
270 |
271 | ```python
272 | df.dtypes
273 | ```
274 |
275 |
276 |
277 |
278 | Unnamed: 0 int64
279 | id int64
280 | date object
281 | price float64
282 | bedrooms float64
283 | bathrooms float64
284 | sqft_living int64
285 | sqft_lot int64
286 | floors float64
287 | waterfront int64
288 | view int64
289 | condition int64
290 | grade int64
291 | sqft_above int64
292 | sqft_basement int64
293 | yr_built int64
294 | yr_renovated int64
295 | zipcode int64
296 | lat float64
297 | long float64
298 | sqft_living15 int64
299 | sqft_lot15 int64
300 | dtype: object
301 |
302 |
303 |
304 | We use the method describe to obtain a statistical summary of the dataframe.
305 |
306 |
307 | ```python
308 | df.describe()
309 | ```
310 |
311 |
312 |
313 |
314 |
315 |
328 |
329 |
330 |
331 | |
332 | Unnamed: 0 |
333 | id |
334 | price |
335 | bedrooms |
336 | bathrooms |
337 | sqft_living |
338 | sqft_lot |
339 | floors |
340 | waterfront |
341 | view |
342 | ... |
343 | grade |
344 | sqft_above |
345 | sqft_basement |
346 | yr_built |
347 | yr_renovated |
348 | zipcode |
349 | lat |
350 | long |
351 | sqft_living15 |
352 | sqft_lot15 |
353 |
354 |
355 |
356 |
357 | count |
358 | 21613.00000 |
359 | 2.161300e+04 |
360 | 2.161300e+04 |
361 | 21600.000000 |
362 | 21603.000000 |
363 | 21613.000000 |
364 | 2.161300e+04 |
365 | 21613.000000 |
366 | 21613.000000 |
367 | 21613.000000 |
368 | ... |
369 | 21613.000000 |
370 | 21613.000000 |
371 | 21613.000000 |
372 | 21613.000000 |
373 | 21613.000000 |
374 | 21613.000000 |
375 | 21613.000000 |
376 | 21613.000000 |
377 | 21613.000000 |
378 | 21613.000000 |
379 |
380 |
381 | mean |
382 | 10806.00000 |
383 | 4.580302e+09 |
384 | 5.400881e+05 |
385 | 3.372870 |
386 | 2.115736 |
387 | 2079.899736 |
388 | 1.510697e+04 |
389 | 1.494309 |
390 | 0.007542 |
391 | 0.234303 |
392 | ... |
393 | 7.656873 |
394 | 1788.390691 |
395 | 291.509045 |
396 | 1971.005136 |
397 | 84.402258 |
398 | 98077.939805 |
399 | 47.560053 |
400 | -122.213896 |
401 | 1986.552492 |
402 | 12768.455652 |
403 |
404 |
405 | std |
406 | 6239.28002 |
407 | 2.876566e+09 |
408 | 3.671272e+05 |
409 | 0.926657 |
410 | 0.768996 |
411 | 918.440897 |
412 | 4.142051e+04 |
413 | 0.539989 |
414 | 0.086517 |
415 | 0.766318 |
416 | ... |
417 | 1.175459 |
418 | 828.090978 |
419 | 442.575043 |
420 | 29.373411 |
421 | 401.679240 |
422 | 53.505026 |
423 | 0.138564 |
424 | 0.140828 |
425 | 685.391304 |
426 | 27304.179631 |
427 |
428 |
429 | min |
430 | 0.00000 |
431 | 1.000102e+06 |
432 | 7.500000e+04 |
433 | 1.000000 |
434 | 0.500000 |
435 | 290.000000 |
436 | 5.200000e+02 |
437 | 1.000000 |
438 | 0.000000 |
439 | 0.000000 |
440 | ... |
441 | 1.000000 |
442 | 290.000000 |
443 | 0.000000 |
444 | 1900.000000 |
445 | 0.000000 |
446 | 98001.000000 |
447 | 47.155900 |
448 | -122.519000 |
449 | 399.000000 |
450 | 651.000000 |
451 |
452 |
453 | 25% |
454 | 5403.00000 |
455 | 2.123049e+09 |
456 | 3.219500e+05 |
457 | 3.000000 |
458 | 1.750000 |
459 | 1427.000000 |
460 | 5.040000e+03 |
461 | 1.000000 |
462 | 0.000000 |
463 | 0.000000 |
464 | ... |
465 | 7.000000 |
466 | 1190.000000 |
467 | 0.000000 |
468 | 1951.000000 |
469 | 0.000000 |
470 | 98033.000000 |
471 | 47.471000 |
472 | -122.328000 |
473 | 1490.000000 |
474 | 5100.000000 |
475 |
476 |
477 | 50% |
478 | 10806.00000 |
479 | 3.904930e+09 |
480 | 4.500000e+05 |
481 | 3.000000 |
482 | 2.250000 |
483 | 1910.000000 |
484 | 7.618000e+03 |
485 | 1.500000 |
486 | 0.000000 |
487 | 0.000000 |
488 | ... |
489 | 7.000000 |
490 | 1560.000000 |
491 | 0.000000 |
492 | 1975.000000 |
493 | 0.000000 |
494 | 98065.000000 |
495 | 47.571800 |
496 | -122.230000 |
497 | 1840.000000 |
498 | 7620.000000 |
499 |
500 |
501 | 75% |
502 | 16209.00000 |
503 | 7.308900e+09 |
504 | 6.450000e+05 |
505 | 4.000000 |
506 | 2.500000 |
507 | 2550.000000 |
508 | 1.068800e+04 |
509 | 2.000000 |
510 | 0.000000 |
511 | 0.000000 |
512 | ... |
513 | 8.000000 |
514 | 2210.000000 |
515 | 560.000000 |
516 | 1997.000000 |
517 | 0.000000 |
518 | 98118.000000 |
519 | 47.678000 |
520 | -122.125000 |
521 | 2360.000000 |
522 | 10083.000000 |
523 |
524 |
525 | max |
526 | 21612.00000 |
527 | 9.900000e+09 |
528 | 7.700000e+06 |
529 | 33.000000 |
530 | 8.000000 |
531 | 13540.000000 |
532 | 1.651359e+06 |
533 | 3.500000 |
534 | 1.000000 |
535 | 4.000000 |
536 | ... |
537 | 13.000000 |
538 | 9410.000000 |
539 | 4820.000000 |
540 | 2015.000000 |
541 | 2015.000000 |
542 | 98199.000000 |
543 | 47.777600 |
544 | -121.315000 |
545 | 6210.000000 |
546 | 871200.000000 |
547 |
548 |
549 |
550 |
8 rows × 21 columns
551 |
552 |
553 |
554 |
555 | # 2.0 Data Wrangling
556 |
557 | #### Question 2
558 | Drop the columns "id"
and "Unnamed: 0"
from axis 1 using the method drop()
, then use the method describe()
to obtain a statistical summary of the data. Take a screenshot and submit it, make sure the inplace parameter is set to True
559 |
560 |
561 | ```python
562 | df.drop('id', axis=1, inplace=True)
563 | df.drop('Unnamed: 0', axis=1, inplace=True)
564 | df.describe()
565 | ```
566 |
567 |
568 |
569 |
570 |
571 |
584 |
585 |
586 |
587 | |
588 | price |
589 | bedrooms |
590 | bathrooms |
591 | sqft_living |
592 | sqft_lot |
593 | floors |
594 | waterfront |
595 | view |
596 | condition |
597 | grade |
598 | sqft_above |
599 | sqft_basement |
600 | yr_built |
601 | yr_renovated |
602 | zipcode |
603 | lat |
604 | long |
605 | sqft_living15 |
606 | sqft_lot15 |
607 |
608 |
609 |
610 |
611 | count |
612 | 2.161300e+04 |
613 | 21600.000000 |
614 | 21603.000000 |
615 | 21613.000000 |
616 | 2.161300e+04 |
617 | 21613.000000 |
618 | 21613.000000 |
619 | 21613.000000 |
620 | 21613.000000 |
621 | 21613.000000 |
622 | 21613.000000 |
623 | 21613.000000 |
624 | 21613.000000 |
625 | 21613.000000 |
626 | 21613.000000 |
627 | 21613.000000 |
628 | 21613.000000 |
629 | 21613.000000 |
630 | 21613.000000 |
631 |
632 |
633 | mean |
634 | 5.400881e+05 |
635 | 3.372870 |
636 | 2.115736 |
637 | 2079.899736 |
638 | 1.510697e+04 |
639 | 1.494309 |
640 | 0.007542 |
641 | 0.234303 |
642 | 3.409430 |
643 | 7.656873 |
644 | 1788.390691 |
645 | 291.509045 |
646 | 1971.005136 |
647 | 84.402258 |
648 | 98077.939805 |
649 | 47.560053 |
650 | -122.213896 |
651 | 1986.552492 |
652 | 12768.455652 |
653 |
654 |
655 | std |
656 | 3.671272e+05 |
657 | 0.926657 |
658 | 0.768996 |
659 | 918.440897 |
660 | 4.142051e+04 |
661 | 0.539989 |
662 | 0.086517 |
663 | 0.766318 |
664 | 0.650743 |
665 | 1.175459 |
666 | 828.090978 |
667 | 442.575043 |
668 | 29.373411 |
669 | 401.679240 |
670 | 53.505026 |
671 | 0.138564 |
672 | 0.140828 |
673 | 685.391304 |
674 | 27304.179631 |
675 |
676 |
677 | min |
678 | 7.500000e+04 |
679 | 1.000000 |
680 | 0.500000 |
681 | 290.000000 |
682 | 5.200000e+02 |
683 | 1.000000 |
684 | 0.000000 |
685 | 0.000000 |
686 | 1.000000 |
687 | 1.000000 |
688 | 290.000000 |
689 | 0.000000 |
690 | 1900.000000 |
691 | 0.000000 |
692 | 98001.000000 |
693 | 47.155900 |
694 | -122.519000 |
695 | 399.000000 |
696 | 651.000000 |
697 |
698 |
699 | 25% |
700 | 3.219500e+05 |
701 | 3.000000 |
702 | 1.750000 |
703 | 1427.000000 |
704 | 5.040000e+03 |
705 | 1.000000 |
706 | 0.000000 |
707 | 0.000000 |
708 | 3.000000 |
709 | 7.000000 |
710 | 1190.000000 |
711 | 0.000000 |
712 | 1951.000000 |
713 | 0.000000 |
714 | 98033.000000 |
715 | 47.471000 |
716 | -122.328000 |
717 | 1490.000000 |
718 | 5100.000000 |
719 |
720 |
721 | 50% |
722 | 4.500000e+05 |
723 | 3.000000 |
724 | 2.250000 |
725 | 1910.000000 |
726 | 7.618000e+03 |
727 | 1.500000 |
728 | 0.000000 |
729 | 0.000000 |
730 | 3.000000 |
731 | 7.000000 |
732 | 1560.000000 |
733 | 0.000000 |
734 | 1975.000000 |
735 | 0.000000 |
736 | 98065.000000 |
737 | 47.571800 |
738 | -122.230000 |
739 | 1840.000000 |
740 | 7620.000000 |
741 |
742 |
743 | 75% |
744 | 6.450000e+05 |
745 | 4.000000 |
746 | 2.500000 |
747 | 2550.000000 |
748 | 1.068800e+04 |
749 | 2.000000 |
750 | 0.000000 |
751 | 0.000000 |
752 | 4.000000 |
753 | 8.000000 |
754 | 2210.000000 |
755 | 560.000000 |
756 | 1997.000000 |
757 | 0.000000 |
758 | 98118.000000 |
759 | 47.678000 |
760 | -122.125000 |
761 | 2360.000000 |
762 | 10083.000000 |
763 |
764 |
765 | max |
766 | 7.700000e+06 |
767 | 33.000000 |
768 | 8.000000 |
769 | 13540.000000 |
770 | 1.651359e+06 |
771 | 3.500000 |
772 | 1.000000 |
773 | 4.000000 |
774 | 5.000000 |
775 | 13.000000 |
776 | 9410.000000 |
777 | 4820.000000 |
778 | 2015.000000 |
779 | 2015.000000 |
780 | 98199.000000 |
781 | 47.777600 |
782 | -121.315000 |
783 | 6210.000000 |
784 | 871200.000000 |
785 |
786 |
787 |
788 |
789 |
790 |
791 |
792 | we can see we have missing values for the columns bedrooms
and bathrooms
793 |
794 |
795 | ```python
796 | print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
797 | print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
798 |
799 | ```
800 |
801 | number of NaN values for the column bedrooms : 13
802 | number of NaN values for the column bathrooms : 10
803 |
804 |
805 |
806 | We can replace the missing values of the column 'bedrooms'
with the mean of the column 'bedrooms'
using the method replace. Don't forget to set the inplace
parameter top True
807 |
808 |
809 | ```python
810 | mean=df['bedrooms'].mean()
811 | df['bedrooms'].replace(np.nan,mean, inplace=True)
812 | ```
813 |
814 |
815 | We also replace the missing values of the column 'bathrooms'
with the mean of the column 'bedrooms' using the method replace.Don't forget to set the inplace
parameter top Ture
816 |
817 |
818 | ```python
819 | mean=df['bathrooms'].mean()
820 | df['bathrooms'].replace(np.nan,mean, inplace=True)
821 | ```
822 |
823 |
824 | ```python
825 | print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
826 | print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
827 | ```
828 |
829 | number of NaN values for the column bedrooms : 0
830 | number of NaN values for the column bathrooms : 0
831 |
832 |
833 | # 3.0 Exploratory data analysis
834 |
835 | #### Question 3
836 | Use the method value_counts to count the number of houses with unique floor values, use the method .to_frame() to convert it to a dataframe.
837 |
838 |
839 |
840 | ```python
841 | df['floors'].value_counts().to_frame()
842 | ```
843 |
844 |
845 |
846 |
847 |
848 |
861 |
862 |
863 |
864 | |
865 | floors |
866 |
867 |
868 |
869 |
870 | 1.0 |
871 | 10680 |
872 |
873 |
874 | 2.0 |
875 | 8241 |
876 |
877 |
878 | 1.5 |
879 | 1910 |
880 |
881 |
882 | 3.0 |
883 | 613 |
884 |
885 |
886 | 2.5 |
887 | 161 |
888 |
889 |
890 | 3.5 |
891 | 8 |
892 |
893 |
894 |
895 |
896 |
897 |
898 |
899 | ### Question 4
900 | Use the function boxplot
in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers .
901 |
902 |
903 | ```python
904 | sns.boxplot(x='waterfront', y='price', data=df)
905 | ```
906 |
907 |
908 |
909 |
910 |
911 |
912 |
913 |
914 |
915 | 
916 |
917 |
918 | ### Question 5
919 | Use the function regplot
in the seaborn library to determine if the feature sqft_above
is negatively or positively correlated with price.
920 |
921 |
922 | ```python
923 | sns.regplot(x='sqft_above', y='price', data=df)
924 | ```
925 |
926 |
927 |
928 |
929 |
930 |
931 |
932 |
933 |
934 | 
935 |
936 |
937 |
938 | We can use the Pandas method corr()
to find the feature other than price that is most correlated with price.
939 |
940 |
941 | ```python
942 | df.corr()['price'].sort_values()
943 | ```
944 |
945 | # Module 4: Model Development
946 |
947 | Import libraries
948 |
949 |
950 | ```python
951 | import matplotlib.pyplot as plt
952 | from sklearn.linear_model import LinearRegression
953 |
954 | ```
955 |
956 |
957 | We can Fit a linear regression model using the longitude feature 'long'
and caculate the R^2.
958 |
959 |
960 | ```python
961 | X = df[['long']]
962 | Y = df['price']
963 | lm = LinearRegression()
964 | lm
965 | lm.fit(X,Y)
966 | lm.score(X, Y)
967 | ```
968 |
969 | ### Question 6
970 | Fit a linear regression model to predict the 'price'
using the feature 'sqft_living' then calculate the R^2. Take a screenshot of your code and the value of the R^2.
971 |
972 |
973 | ```python
974 | U = df[['sqft_living']]
975 | V = df['price']
976 | lm.fit(U,V)
977 | lm.score(U,V)
978 | ```
979 |
980 |
981 |
982 |
983 | 0.49285321790379316
984 |
985 |
986 |
987 | ### Question 7
988 | Fit a linear regression model to predict the 'price' using the list of features:
989 |
990 |
991 | ```python
992 | features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]
993 | X = df[features]
994 | Y = df['price']
995 | lm.fit(X,Y)
996 | ```
997 |
998 |
999 |
1000 |
1001 | LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
1002 | normalize=False)
1003 |
1004 |
1005 |
1006 | the calculate the R^2. Take a screenshot of your code
1007 |
1008 |
1009 | ```python
1010 | lm.score(X,Y)
1011 | ```
1012 |
1013 |
1014 |
1015 |
1016 | 0.6576951666037504
1017 |
1018 |
1019 |
1020 | #### this will help with Question 8
1021 |
1022 | Create a list of tuples, the first element in the tuple contains the name of the estimator:
1023 |
1024 | 'scale'
1025 |
1026 | 'polynomial'
1027 |
1028 | 'model'
1029 |
1030 | The second element in the tuple contains the model constructor
1031 |
1032 | StandardScaler()
1033 |
1034 | PolynomialFeatures(include_bias=False)
1035 |
1036 | LinearRegression()
1037 |
1038 |
1039 |
1040 | ```python
1041 | Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]
1042 | ```
1043 |
1044 | ### Question 8
1045 | Use the list to create a pipeline object, predict the 'price', fit the object using the features in the list features
, then fit the model and calculate the R^2
1046 |
1047 |
1048 | ```python
1049 | pipe=Pipeline(Input)
1050 | pipe
1051 | ```
1052 |
1053 |
1054 |
1055 |
1056 | Pipeline(memory=None,
1057 | steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
1058 | normalize=False))])
1059 |
1060 |
1061 |
1062 |
1063 | ```python
1064 | pipe.fit(X,Y)
1065 | ```
1066 |
1067 | /opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/preprocessing/data.py:645: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
1068 | return self.partial_fit(X, y)
1069 | /opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/base.py:467: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
1070 | return self.fit(X, y, **fit_params).transform(X)
1071 |
1072 |
1073 |
1074 |
1075 |
1076 | Pipeline(memory=None,
1077 | steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
1078 | normalize=False))])
1079 |
1080 |
1081 |
1082 |
1083 | ```python
1084 | pipe.score(X,Y)
1085 | ```
1086 |
1087 | /opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/pipeline.py:511: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler.
1088 | Xt = transform.transform(Xt)
1089 |
1090 |
1091 |
1092 |
1093 |
1094 | 0.7513427797293394
1095 |
1096 |
1097 |
1098 | # Module 5: MODEL EVALUATION AND REFINEMENT
1099 |
1100 | import the necessary modules
1101 |
1102 |
1103 | ```python
1104 | from sklearn.model_selection import cross_val_score
1105 | from sklearn.model_selection import train_test_split
1106 | print("done")
1107 | ```
1108 |
1109 | done
1110 |
1111 |
1112 | we will split the data into training and testing set
1113 |
1114 |
1115 | ```python
1116 | features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]
1117 | X = df[features ]
1118 | Y = df['price']
1119 |
1120 | x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)
1121 |
1122 |
1123 | print("number of test samples :", x_test.shape[0])
1124 | print("number of training samples:",x_train.shape[0])
1125 | ```
1126 |
1127 | number of test samples : 3242
1128 | number of training samples: 18371
1129 |
1130 |
1131 | ### Question 9
1132 | Create and fit a Ridge regression object using the training data, setting the regularization parameter to 0.1 and calculate the R^2 using the test data.
1133 |
1134 |
1135 |
1136 | ```python
1137 | from sklearn.linear_model import Ridge
1138 | ```
1139 |
1140 |
1141 | ```python
1142 | RigeModel=Ridge(alpha=0.1)
1143 | RigeModel.fit(x_train, y_train)
1144 | RigeModel.score(x_test, y_test)
1145 | ```
1146 |
1147 |
1148 |
1149 |
1150 | 0.6478759163939111
1151 |
1152 |
1153 |
1154 | ### Question 10
1155 | Perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, setting the regularisation parameter to 0.1. Calculate the R^2 utilising the test data provided. Take a screenshot of your code and the R^2.
1156 |
1157 |
1158 | ```python
1159 | pr = PolynomialFeatures(degree=2)
1160 | x_train_pr = pr.fit_transform(x_train)
1161 | x_test_pr = pr.fit_transform(x_test)
1162 |
1163 | RigeModel=Ridge(alpha=0.1)
1164 | RigeModel.fit(x_train_pr, y_train)
1165 | RigeModel.score(x_test_pr, y_test)
1166 | ```
1167 |
1168 |
1169 |
1170 |
1171 | 0.7002744268659787
1172 |
1173 |
1174 |
1175 | About this Project:
1176 |
1177 | This project is part of a graded excercise in "Data Analysis Using Python" course on Coursera offered by IBM
1178 |
1179 |
1180 |
1181 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # House-sale-price-prediction-using-python
2 | This project analyzes and predicts housing sale price based on features such as square footage, number of bedrooms, views, locations, etc. It uses the dataset of house sale prices for King County, USA, including home sales between May 2014 and May 2015.
3 |
4 | It uses python codes to do data-cleaning, analyse data and create models for price prediction, evaluate and refine models. Major activities covered include:
5 | - Numerical representation of data using correlation, linear and polynomial regression, R-Squared values, etc
6 | - Graphical representation of data using boxplot, and seaborn's regplot.
7 | - Model refinement suing ridge regression object.
8 | - Polynomial transform of training and test data, etc.
9 |
--------------------------------------------------------------------------------
/output_29_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calistus-igwilo/House-sale-price-prediction-using-python/23d5b4735e772ff99a759b5548f76355ed63346d/output_29_1.png
--------------------------------------------------------------------------------
/output_31_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/calistus-igwilo/House-sale-price-prediction-using-python/23d5b4735e772ff99a759b5548f76355ed63346d/output_31_1.png
--------------------------------------------------------------------------------