"
737 | ]
738 | },
739 | "metadata": {
740 | "needs_background": "light"
741 | },
742 | "output_type": "display_data"
743 | }
744 | ],
745 | "source": [
746 | "df.hist(column='estprojectcost', bins=50, range = [0,1.000000e+04], color='gray')"
747 | ]
748 | },
749 | {
750 | "cell_type": "markdown",
751 | "metadata": {},
752 | "source": [
753 | "#### We can see that we have a lot of 0 values for Estimated Project Costs. Maybe 0 was entered where data was missing or where estimated project cost was unknown."
754 | ]
755 | },
756 | {
757 | "cell_type": "markdown",
758 | "metadata": {},
759 | "source": [
760 | "### 2. Issue Date Month"
761 | ]
762 | },
763 | {
764 | "cell_type": "markdown",
765 | "metadata": {},
766 | "source": [
767 | "#### Moving on to the next feature, we now create a histogram of the Issued Date Month feature"
768 | ]
769 | },
770 | {
771 | "cell_type": "code",
772 | "execution_count": 18,
773 | "metadata": {},
774 | "outputs": [
775 | {
776 | "data": {
777 | "text/plain": [
778 | ""
779 | ]
780 | },
781 | "execution_count": 18,
782 | "metadata": {},
783 | "output_type": "execute_result"
784 | },
785 | {
786 | "data": {
787 | "image/png": "\n",
788 | "text/plain": [
789 | "
"
790 | ]
791 | },
792 | "metadata": {
793 | "needs_background": "light"
794 | },
795 | "output_type": "display_data"
796 | }
797 | ],
798 | "source": [
799 | "sns.countplot(df['issueddate_mth'], color='gray')"
800 | ]
801 | },
802 | {
803 | "cell_type": "markdown",
804 | "metadata": {},
805 | "source": [
806 | "#### From the above histogram we see that the number of permits issued was low between the months of November and February and maximum permits were issued in the month of June. Another way to look as the permits issued per month in descending order is:"
807 | ]
808 | },
809 | {
810 | "cell_type": "code",
811 | "execution_count": 19,
812 | "metadata": {},
813 | "outputs": [
814 | {
815 | "data": {
816 | "text/plain": [
817 | ""
818 | ]
819 | },
820 | "execution_count": 19,
821 | "metadata": {},
822 | "output_type": "execute_result"
823 | },
824 | {
825 | "data": {
826 | "image/png": "\n",
827 | "text/plain": [
828 | "
"
829 | ]
830 | },
831 | "metadata": {
832 | "needs_background": "light"
833 | },
834 | "output_type": "display_data"
835 | }
836 | ],
837 | "source": [
838 | "df['issueddate_mth'].value_counts().plot(kind='bar', color='gray')"
839 | ]
840 | },
841 | {
842 | "cell_type": "markdown",
843 | "metadata": {},
844 | "source": [
845 | "#### From the above plot we can see that the highest number permits were issued in June followed by May, August, April, July, March, September, and October; while the least number of permist issued in December followed by November, February, and January."
846 | ]
847 | },
848 | {
849 | "cell_type": "markdown",
850 | "metadata": {},
851 | "source": [
852 | "## Relationship between Permit Issue Year and Estimated Project Cost"
853 | ]
854 | },
855 | {
856 | "cell_type": "markdown",
857 | "metadata": {},
858 | "source": [
859 | "### We want to understand the relationship between Permit Issue Year and Estimated Project Cost, but only for \"New\" construction of type \"V B\" with less than 3 stories. Thus we will filter the dataset based on these values"
860 | ]
861 | },
862 | {
863 | "cell_type": "code",
864 | "execution_count": 20,
865 | "metadata": {},
866 | "outputs": [
867 | {
868 | "data": {
869 | "text/html": [
870 | "
\n",
871 | "\n",
884 | "
\n",
885 | " \n",
886 | "
\n",
887 | "
\n",
888 | "
X
\n",
889 | "
Y
\n",
890 | "
OBJECTID
\n",
891 | "
permittypemapped
\n",
892 | "
permitnum
\n",
893 | "
workclass
\n",
894 | "
permitclass
\n",
895 | "
proposedworkdescription
\n",
896 | "
permitclassmapped
\n",
897 | "
applieddate
\n",
898 | "
...
\n",
899 | "
totalsqft
\n",
900 | "
voiddate
\n",
901 | "
workclassmapped
\n",
902 | "
GlobalID
\n",
903 | "
CreationDate
\n",
904 | "
Creator
\n",
905 | "
EditDate
\n",
906 | "
Editor
\n",
907 | "
const_type
\n",
908 | "
occupancyclass
\n",
909 | "
\n",
910 | " \n",
911 | " \n",
912 | "
\n",
913 | "
1
\n",
914 | "
-78.534184
\n",
915 | "
35.729309
\n",
916 | "
48521
\n",
917 | "
Building
\n",
918 | "
147288
\n",
919 | "
New Building
\n",
920 | "
101.0
\n",
921 | "
SFD
\n",
922 | "
Residential
\n",
923 | "
2018-03-02T15:16:37.000Z
\n",
924 | "
...
\n",
925 | "
1684.0
\n",
926 | "
NaN
\n",
927 | "
New
\n",
928 | "
f114dc19-3b62-459b-bd6c-9084162403c8
\n",
929 | "
2018-03-16T01:55:55.663Z
\n",
930 | "
justin.greco@raleighnc.gov_ral
\n",
931 | "
2018-06-12T22:02:31.949Z
\n",
932 | "
OpenData_ral
\n",
933 | "
V B
\n",
934 | "
RESIDENT 3 SFD/DUP
\n",
935 | "
\n",
936 | "
\n",
937 | "
2
\n",
938 | "
-78.534323
\n",
939 | "
35.728595
\n",
940 | "
48522
\n",
941 | "
Building
\n",
942 | "
147287
\n",
943 | "
New Building
\n",
944 | "
101.0
\n",
945 | "
SFD
\n",
946 | "
Residential
\n",
947 | "
2018-03-02T15:08:25.000Z
\n",
948 | "
...
\n",
949 | "
2378.0
\n",
950 | "
NaN
\n",
951 | "
New
\n",
952 | "
d4b182cb-af25-4c3f-92a9-a59d4b82ada3
\n",
953 | "
2018-03-16T01:55:55.663Z
\n",
954 | "
justin.greco@raleighnc.gov_ral
\n",
955 | "
2018-06-13T22:02:40.102Z
\n",
956 | "
OpenData_ral
\n",
957 | "
V B
\n",
958 | "
RESIDENT 3 SFD/DUP
\n",
959 | "
\n",
960 | "
\n",
961 | "
3
\n",
962 | "
-78.531789
\n",
963 | "
35.729794
\n",
964 | "
48523
\n",
965 | "
Building
\n",
966 | "
147286
\n",
967 | "
New Building
\n",
968 | "
101.0
\n",
969 | "
NEW SFD
\n",
970 | "
Residential
\n",
971 | "
2018-03-02T15:00:47.000Z
\n",
972 | "
...
\n",
973 | "
1392.0
\n",
974 | "
NaN
\n",
975 | "
New
\n",
976 | "
ecc76e8c-48d3-4529-a7ae-f1d616592c08
\n",
977 | "
2018-03-16T01:55:55.663Z
\n",
978 | "
justin.greco@raleighnc.gov_ral
\n",
979 | "
2018-06-27T22:02:34.320Z
\n",
980 | "
OpenData_ral
\n",
981 | "
V B
\n",
982 | "
RESIDENT 3 SFD/DUP
\n",
983 | "
\n",
984 | "
\n",
985 | "
4
\n",
986 | "
-78.533914
\n",
987 | "
35.729473
\n",
988 | "
48524
\n",
989 | "
Building
\n",
990 | "
147284
\n",
991 | "
New Building
\n",
992 | "
101.0
\n",
993 | "
NEW SFD
\n",
994 | "
Residential
\n",
995 | "
2018-03-02T14:32:33.000Z
\n",
996 | "
...
\n",
997 | "
1392.0
\n",
998 | "
NaN
\n",
999 | "
New
\n",
1000 | "
a1074b43-bc40-4efa-bc7f-a167c351c327
\n",
1001 | "
2018-03-16T01:55:55.663Z
\n",
1002 | "
justin.greco@raleighnc.gov_ral
\n",
1003 | "
2018-06-12T22:02:31.949Z
\n",
1004 | "
OpenData_ral
\n",
1005 | "
V B
\n",
1006 | "
RESIDENT 3 SFD/DUP
\n",
1007 | "
\n",
1008 | "
\n",
1009 | "
9
\n",
1010 | "
-78.594252
\n",
1011 | "
35.909492
\n",
1012 | "
48529
\n",
1013 | "
Building
\n",
1014 | "
147194
\n",
1015 | "
New Building
\n",
1016 | "
318.0
\n",
1017 | "
PAVILLION NEAR POOL, GRILLS, SEATING
\n",
1018 | "
Residential
\n",
1019 | "
2018-02-27T21:31:37.000Z
\n",
1020 | "
...
\n",
1021 | "
504.0
\n",
1022 | "
NaN
\n",
1023 | "
New
\n",
1024 | "
39eb46ff-e7d2-4c23-95f8-71c4f67b3924
\n",
1025 | "
2018-03-16T01:55:55.663Z
\n",
1026 | "
justin.greco@raleighnc.gov_ral
\n",
1027 | "
2018-04-23T19:46:09.547Z
\n",
1028 | "
OpenData_ral
\n",
1029 | "
V B
\n",
1030 | "
ASSEMBLY 3
\n",
1031 | "
\n",
1032 | " \n",
1033 | "
\n",
1034 | "
5 rows × 87 columns
\n",
1035 | "
"
1036 | ],
1037 | "text/plain": [
1038 | " X Y OBJECTID permittypemapped permitnum workclass \\\n",
1039 | "1 -78.534184 35.729309 48521 Building 147288 New Building \n",
1040 | "2 -78.534323 35.728595 48522 Building 147287 New Building \n",
1041 | "3 -78.531789 35.729794 48523 Building 147286 New Building \n",
1042 | "4 -78.533914 35.729473 48524 Building 147284 New Building \n",
1043 | "9 -78.594252 35.909492 48529 Building 147194 New Building \n",
1044 | "\n",
1045 | " permitclass proposedworkdescription permitclassmapped \\\n",
1046 | "1 101.0 SFD Residential \n",
1047 | "2 101.0 SFD Residential \n",
1048 | "3 101.0 NEW SFD Residential \n",
1049 | "4 101.0 NEW SFD Residential \n",
1050 | "9 318.0 PAVILLION NEAR POOL, GRILLS, SEATING Residential \n",
1051 | "\n",
1052 | " applieddate ... totalsqft voiddate \\\n",
1053 | "1 2018-03-02T15:16:37.000Z ... 1684.0 NaN \n",
1054 | "2 2018-03-02T15:08:25.000Z ... 2378.0 NaN \n",
1055 | "3 2018-03-02T15:00:47.000Z ... 1392.0 NaN \n",
1056 | "4 2018-03-02T14:32:33.000Z ... 1392.0 NaN \n",
1057 | "9 2018-02-27T21:31:37.000Z ... 504.0 NaN \n",
1058 | "\n",
1059 | " workclassmapped GlobalID \\\n",
1060 | "1 New f114dc19-3b62-459b-bd6c-9084162403c8 \n",
1061 | "2 New d4b182cb-af25-4c3f-92a9-a59d4b82ada3 \n",
1062 | "3 New ecc76e8c-48d3-4529-a7ae-f1d616592c08 \n",
1063 | "4 New a1074b43-bc40-4efa-bc7f-a167c351c327 \n",
1064 | "9 New 39eb46ff-e7d2-4c23-95f8-71c4f67b3924 \n",
1065 | "\n",
1066 | " CreationDate Creator \\\n",
1067 | "1 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n",
1068 | "2 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n",
1069 | "3 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n",
1070 | "4 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n",
1071 | "9 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n",
1072 | "\n",
1073 | " EditDate Editor const_type occupancyclass \n",
1074 | "1 2018-06-12T22:02:31.949Z OpenData_ral V B RESIDENT 3 SFD/DUP \n",
1075 | "2 2018-06-13T22:02:40.102Z OpenData_ral V B RESIDENT 3 SFD/DUP \n",
1076 | "3 2018-06-27T22:02:34.320Z OpenData_ral V B RESIDENT 3 SFD/DUP \n",
1077 | "4 2018-06-12T22:02:31.949Z OpenData_ral V B RESIDENT 3 SFD/DUP \n",
1078 | "9 2018-04-23T19:46:09.547Z OpenData_ral V B ASSEMBLY 3 \n",
1079 | "\n",
1080 | "[5 rows x 87 columns]"
1081 | ]
1082 | },
1083 | "execution_count": 20,
1084 | "metadata": {},
1085 | "output_type": "execute_result"
1086 | }
1087 | ],
1088 | "source": [
1089 | "df_filtered = df[(df.numberstories < 3) & (df.workclassmapped == \"New\") & (df.const_type == \"V B\")]\n",
1090 | "df_filtered.head()"
1091 | ]
1092 | },
1093 | {
1094 | "cell_type": "code",
1095 | "execution_count": 21,
1096 | "metadata": {},
1097 | "outputs": [
1098 | {
1099 | "data": {
1100 | "text/plain": [
1101 | "(31044, 87)"
1102 | ]
1103 | },
1104 | "execution_count": 21,
1105 | "metadata": {},
1106 | "output_type": "execute_result"
1107 | }
1108 | ],
1109 | "source": [
1110 | "df_filtered.shape"
1111 | ]
1112 | },
1113 | {
1114 | "cell_type": "markdown",
1115 | "metadata": {},
1116 | "source": [
1117 | "### Now for this newly filtered dataset, we are interested in the relationship between the Issued Date Year and the Estimated Project Cost Columns"
1118 | ]
1119 | },
1120 | {
1121 | "cell_type": "markdown",
1122 | "metadata": {},
1123 | "source": [
1124 | "#### Lets begin by describing our columns of interest"
1125 | ]
1126 | },
1127 | {
1128 | "cell_type": "code",
1129 | "execution_count": 22,
1130 | "metadata": {},
1131 | "outputs": [
1132 | {
1133 | "data": {
1134 | "text/plain": [
1135 | "count 30851.000000\n",
1136 | "mean 2008.365401\n",
1137 | "std 4.604880\n",
1138 | "min 2002.000000\n",
1139 | "25% 2005.000000\n",
1140 | "50% 2007.000000\n",
1141 | "75% 2012.000000\n",
1142 | "max 2018.000000\n",
1143 | "Name: issueddate_yr, dtype: float64"
1144 | ]
1145 | },
1146 | "execution_count": 22,
1147 | "metadata": {},
1148 | "output_type": "execute_result"
1149 | }
1150 | ],
1151 | "source": [
1152 | "df_filtered.issueddate_yr.describe()"
1153 | ]
1154 | },
1155 | {
1156 | "cell_type": "code",
1157 | "execution_count": 23,
1158 | "metadata": {},
1159 | "outputs": [
1160 | {
1161 | "data": {
1162 | "text/plain": [
1163 | "count 3.104400e+04\n",
1164 | "mean 2.120274e+05\n",
1165 | "std 8.762292e+05\n",
1166 | "min 0.000000e+00\n",
1167 | "25% 9.834500e+04\n",
1168 | "50% 1.500000e+05\n",
1169 | "75% 2.562760e+05\n",
1170 | "max 1.000000e+08\n",
1171 | "Name: estprojectcost, dtype: float64"
1172 | ]
1173 | },
1174 | "execution_count": 23,
1175 | "metadata": {},
1176 | "output_type": "execute_result"
1177 | }
1178 | ],
1179 | "source": [
1180 | "df_filtered.estprojectcost.describe()"
1181 | ]
1182 | },
1183 | {
1184 | "cell_type": "markdown",
1185 | "metadata": {},
1186 | "source": [
1187 | "#### Next we will check for null values in these columns"
1188 | ]
1189 | },
1190 | {
1191 | "cell_type": "code",
1192 | "execution_count": 24,
1193 | "metadata": {},
1194 | "outputs": [
1195 | {
1196 | "data": {
1197 | "text/plain": [
1198 | "(193, 87)"
1199 | ]
1200 | },
1201 | "execution_count": 24,
1202 | "metadata": {},
1203 | "output_type": "execute_result"
1204 | }
1205 | ],
1206 | "source": [
1207 | "df_filtered_yr_na = df_filtered[(df_filtered.issueddate_yr.isna() == True)]\n",
1208 | "df_filtered_yr_na.shape"
1209 | ]
1210 | },
1211 | {
1212 | "cell_type": "code",
1213 | "execution_count": 25,
1214 | "metadata": {},
1215 | "outputs": [
1216 | {
1217 | "data": {
1218 | "text/plain": [
1219 | "(0, 87)"
1220 | ]
1221 | },
1222 | "execution_count": 25,
1223 | "metadata": {},
1224 | "output_type": "execute_result"
1225 | }
1226 | ],
1227 | "source": [
1228 | "df_filtered_cost_na = df_filtered[(df_filtered.estprojectcost.isna() == True)]\n",
1229 | "df_filtered_cost_na.shape"
1230 | ]
1231 | },
1232 | {
1233 | "cell_type": "markdown",
1234 | "metadata": {},
1235 | "source": [
1236 | "#### We can see that while the issued date year column has 193 null values, the estimated project cost column has no null values. However, lets circle back to the histograms be created for the estimated project cost feature. There were many 0 values for the estimated project cost column in the original dataset. Lets look at the 0 values in the filtered dataset."
1237 | ]
1238 | },
1239 | {
1240 | "cell_type": "code",
1241 | "execution_count": 26,
1242 | "metadata": {},
1243 | "outputs": [
1244 | {
1245 | "data": {
1246 | "text/plain": [
1247 | "(18, 87)"
1248 | ]
1249 | },
1250 | "execution_count": 26,
1251 | "metadata": {},
1252 | "output_type": "execute_result"
1253 | }
1254 | ],
1255 | "source": [
1256 | "df_filtered_cost_0 = df_filtered[(df_filtered.estprojectcost == 0)]\n",
1257 | "df_filtered_cost_0.shape"
1258 | ]
1259 | },
1260 | {
1261 | "cell_type": "markdown",
1262 | "metadata": {},
1263 | "source": [
1264 | "#### In our filtered dataset the estimated project cost column has 18 rows with a value of 0. Let's check the issued date year column for these rows."
1265 | ]
1266 | },
1267 | {
1268 | "cell_type": "code",
1269 | "execution_count": 27,
1270 | "metadata": {},
1271 | "outputs": [
1272 | {
1273 | "data": {
1274 | "text/plain": [
1275 | "1223 NaN\n",
1276 | "2800 NaN\n",
1277 | "6810 NaN\n",
1278 | "132323 NaN\n",
1279 | "133517 NaN\n",
1280 | "133731 NaN\n",
1281 | "136022 NaN\n",
1282 | "138124 NaN\n",
1283 | "138185 NaN\n",
1284 | "138222 NaN\n",
1285 | "138272 NaN\n",
1286 | "138383 NaN\n",
1287 | "138498 NaN\n",
1288 | "138690 NaN\n",
1289 | "138694 NaN\n",
1290 | "141292 NaN\n",
1291 | "141522 2018.0\n",
1292 | "141922 NaN\n",
1293 | "Name: issueddate_yr, dtype: float64"
1294 | ]
1295 | },
1296 | "execution_count": 27,
1297 | "metadata": {},
1298 | "output_type": "execute_result"
1299 | }
1300 | ],
1301 | "source": [
1302 | "df_filtered_cost_0.issueddate_yr"
1303 | ]
1304 | },
1305 | {
1306 | "cell_type": "markdown",
1307 | "metadata": {},
1308 | "source": [
1309 | "#### Only 1/18 rows have a not null value for the issued date year column. Since we want to understand the realtionship between the Issued Date Year and Estimated Project Cost features and the above rows are null or 0 for both these features, I am going to go ahead and drop these rows since they will not be useful in establishing a relationship."
1310 | ]
1311 | },
1312 | {
1313 | "cell_type": "code",
1314 | "execution_count": 28,
1315 | "metadata": {},
1316 | "outputs": [
1317 | {
1318 | "data": {
1319 | "text/plain": [
1320 | "(31026, 87)"
1321 | ]
1322 | },
1323 | "execution_count": 28,
1324 | "metadata": {},
1325 | "output_type": "execute_result"
1326 | }
1327 | ],
1328 | "source": [
1329 | "df_filtered = df_filtered[(df_filtered.estprojectcost != 0)]\n",
1330 | "df_filtered.shape"
1331 | ]
1332 | },
1333 | {
1334 | "cell_type": "markdown",
1335 | "metadata": {},
1336 | "source": [
1337 | "#### Next we decide how to impute missing value for the Issued date year column. As observed earlier, 193 rows had null values for this column. Out of those, we have already dropped 17 leaving us with 176 rows with null values for this column. Generally, missing values are imputed using either the mean or median values for a column. Considering that years is a categorical value, I did not think I would be a good idea to do so. My next thought was to check the issued date column to do some feature engineering and impute the null values for the issued year from there."
1338 | ]
1339 | },
1340 | {
1341 | "cell_type": "code",
1342 | "execution_count": 29,
1343 | "metadata": {},
1344 | "outputs": [
1345 | {
1346 | "data": {
1347 | "text/plain": [
1348 | "array([nan], dtype=object)"
1349 | ]
1350 | },
1351 | "execution_count": 29,
1352 | "metadata": {},
1353 | "output_type": "execute_result"
1354 | }
1355 | ],
1356 | "source": [
1357 | "df_filtered_yr_na.issueddate.unique()"
1358 | ]
1359 | },
1360 | {
1361 | "cell_type": "markdown",
1362 | "metadata": {},
1363 | "source": [
1364 | "#### Unfortunately, the issueddate column values are also null for those rows. So I decided to go ahead and drop these rows."
1365 | ]
1366 | },
1367 | {
1368 | "cell_type": "code",
1369 | "execution_count": 30,
1370 | "metadata": {},
1371 | "outputs": [
1372 | {
1373 | "data": {
1374 | "text/plain": [
1375 | "(30850, 87)"
1376 | ]
1377 | },
1378 | "execution_count": 30,
1379 | "metadata": {},
1380 | "output_type": "execute_result"
1381 | }
1382 | ],
1383 | "source": [
1384 | "df_no_null = df_filtered[(df_filtered.issueddate_yr.isna()==False)]\n",
1385 | "df_no_null.shape"
1386 | ]
1387 | },
1388 | {
1389 | "cell_type": "markdown",
1390 | "metadata": {},
1391 | "source": [
1392 | "#### I begin analysing the relationship between the two variables with a paired regression plot"
1393 | ]
1394 | },
1395 | {
1396 | "cell_type": "code",
1397 | "execution_count": 31,
1398 | "metadata": {},
1399 | "outputs": [
1400 | {
1401 | "data": {
1402 | "text/plain": [
1403 | ""
1404 | ]
1405 | },
1406 | "execution_count": 31,
1407 | "metadata": {},
1408 | "output_type": "execute_result"
1409 | },
1410 | {
1411 | "data": {
1412 | "image/png": "\n",
1413 | "text/plain": [
1414 | ""
1415 | ]
1416 | },
1417 | "metadata": {
1418 | "needs_background": "light"
1419 | },
1420 | "output_type": "display_data"
1421 | }
1422 | ],
1423 | "source": [
1424 | "sns.pairplot(df_no_null, x_vars=['issueddate_yr'], y_vars='estprojectcost', size=7, aspect=0.7, kind = 'reg')"
1425 | ]
1426 | },
1427 | {
1428 | "cell_type": "markdown",
1429 | "metadata": {},
1430 | "source": [
1431 | "#### Out of curiosity, I googled the construction type 'V B' and learned that it was for single family homes with wooden frames. Finding this piece of information to interesting, I then proceeded to limit the Estimated Project Cost Variable to less than 4 Million to get a detailed idea of its relationship with the Issued Date Year variable"
1432 | ]
1433 | },
1434 | {
1435 | "cell_type": "code",
1436 | "execution_count": 32,
1437 | "metadata": {},
1438 | "outputs": [],
1439 | "source": [
1440 | "df_limited = df_filtered[df_filtered.estprojectcost < 4000000]"
1441 | ]
1442 | },
1443 | {
1444 | "cell_type": "code",
1445 | "execution_count": 33,
1446 | "metadata": {},
1447 | "outputs": [
1448 | {
1449 | "data": {
1450 | "text/plain": [
1451 | ""
1452 | ]
1453 | },
1454 | "execution_count": 33,
1455 | "metadata": {},
1456 | "output_type": "execute_result"
1457 | },
1458 | {
1459 | "data": {
1460 | "image/png": "\n",
1461 | "text/plain": [
1462 | ""
1463 | ]
1464 | },
1465 | "metadata": {
1466 | "needs_background": "light"
1467 | },
1468 | "output_type": "display_data"
1469 | }
1470 | ],
1471 | "source": [
1472 | "sns.pairplot(df_limited, x_vars=['issueddate_yr'], y_vars='estprojectcost', size=7, aspect=0.7, kind = 'reg')"
1473 | ]
1474 | },
1475 | {
1476 | "cell_type": "markdown",
1477 | "metadata": {},
1478 | "source": [
1479 | "#### From the above plot, I came to the following conclusions:\n",
1480 | "#### 1. Most of the single family homes have an estimated project cost of upto 500K and there are fewere estimated costs above 1M.\n",
1481 | "#### 2. Permits for houses with an estimated cost of above 1M were issued more regularly 2006 onwards\n",
1482 | "#### 3. Maximum permits were granted for projects estimated above 1M in 2008. I found this interesting because of the resccesion of 2008. Did people have that kind of money in 2008? I decided to look at the esitmated project costs for the year 2008 in more detail"
1483 | ]
1484 | },
1485 | {
1486 | "cell_type": "code",
1487 | "execution_count": 34,
1488 | "metadata": {},
1489 | "outputs": [],
1490 | "source": [
1491 | "df_2008 = df_filtered[df_filtered.issueddate_yr == 2008]"
1492 | ]
1493 | },
1494 | {
1495 | "cell_type": "code",
1496 | "execution_count": 35,
1497 | "metadata": {},
1498 | "outputs": [
1499 | {
1500 | "data": {
1501 | "text/plain": [
1502 | ""
1503 | ]
1504 | },
1505 | "execution_count": 35,
1506 | "metadata": {},
1507 | "output_type": "execute_result"
1508 | },
1509 | {
1510 | "data": {
1511 | "image/png": "\n",
1512 | "text/plain": [
1513 | ""
1514 | ]
1515 | },
1516 | "metadata": {
1517 | "needs_background": "light"
1518 | },
1519 | "output_type": "display_data"
1520 | }
1521 | ],
1522 | "source": [
1523 | "# I will run a paired regression plot between estimated project cost and issueddate mth since the year is going to be 2008\n",
1524 | "sns.pairplot(df_2008, x_vars=['issueddate_mth'], y_vars='estprojectcost', size=7, aspect=0.7, kind = 'reg')"
1525 | ]
1526 | },
1527 | {
1528 | "cell_type": "markdown",
1529 | "metadata": {},
1530 | "source": [
1531 | "#### Again we can see some outliers, so lets limit the dataset as we did earlier and the create a plot"
1532 | ]
1533 | },
1534 | {
1535 | "cell_type": "code",
1536 | "execution_count": 36,
1537 | "metadata": {},
1538 | "outputs": [
1539 | {
1540 | "data": {
1541 | "text/plain": [
1542 | ""
1543 | ]
1544 | },
1545 | "execution_count": 36,
1546 | "metadata": {},
1547 | "output_type": "execute_result"
1548 | },
1549 | {
1550 | "data": {
1551 | "image/png": "\n",
1552 | "text/plain": [
1553 | ""
1554 | ]
1555 | },
1556 | "metadata": {
1557 | "needs_background": "light"
1558 | },
1559 | "output_type": "display_data"
1560 | }
1561 | ],
1562 | "source": [
1563 | "df_limited_2008 = df_2008[df_2008.estprojectcost < 4000000]\n",
1564 | "sns.pairplot(df_limited_2008, x_vars=['issueddate_mth'], y_vars='estprojectcost', size=7, aspect=0.7, kind = 'reg')"
1565 | ]
1566 | },
1567 | {
1568 | "cell_type": "markdown",
1569 | "metadata": {},
1570 | "source": [
1571 | "#### From the above plot, it seems like people kept building houses all through 2008 - right through the great recession"
1572 | ]
1573 | },
1574 | {
1575 | "cell_type": "markdown",
1576 | "metadata": {},
1577 | "source": [
1578 | "### Linear Regression"
1579 | ]
1580 | },
1581 | {
1582 | "cell_type": "markdown",
1583 | "metadata": {},
1584 | "source": [
1585 | "#### While the above plots visually explain the realtionship between the two variables, for success metrics, I will begin by running a linear regression. "
1586 | ]
1587 | },
1588 | {
1589 | "cell_type": "code",
1590 | "execution_count": 37,
1591 | "metadata": {},
1592 | "outputs": [
1593 | {
1594 | "name": "stdout",
1595 | "output_type": "stream",
1596 | "text": [
1597 | "The regression intercept is: -24785847.7335\n",
1598 | "The regression coefficient is: [ 12445.196861]\n"
1599 | ]
1600 | }
1601 | ],
1602 | "source": [
1603 | "### SCIKIT-LEARN ###\n",
1604 | "# create X and y\n",
1605 | "feature_cols = ['issueddate_yr']\n",
1606 | "X = df_no_null[feature_cols]\n",
1607 | "y = df_no_null.estprojectcost\n",
1608 | "\n",
1609 | "# instantiate and fit\n",
1610 | "lm = LinearRegression()\n",
1611 | "lm.fit(X,y)\n",
1612 | "\n",
1613 | "# print the coefficients\n",
1614 | "print(\"The regression intercept is: \"+ str(lm.intercept_))\n",
1615 | "print(\"The regression coefficient is: \"+ str(lm.coef_))"
1616 | ]
1617 | },
1618 | {
1619 | "cell_type": "code",
1620 | "execution_count": 38,
1621 | "metadata": {},
1622 | "outputs": [
1623 | {
1624 | "name": "stdout",
1625 | "output_type": "stream",
1626 | "text": [
1627 | "The mean absolute error for the linear regression model is: 105795.616638\n"
1628 | ]
1629 | }
1630 | ],
1631 | "source": [
1632 | "y_pred = lm.predict(X)\n",
1633 | "y_true = df_no_null.estprojectcost\n",
1634 | "mae = mean_absolute_error(y_true, y_pred)\n",
1635 | "print(\"The mean absolute error for the linear regression model is: \" + str(mae))"
1636 | ]
1637 | },
1638 | {
1639 | "cell_type": "code",
1640 | "execution_count": 39,
1641 | "metadata": {},
1642 | "outputs": [
1643 | {
1644 | "name": "stdout",
1645 | "output_type": "stream",
1646 | "text": [
1647 | "The mean squared error for the linear regression model is: 682166512211.0\n"
1648 | ]
1649 | }
1650 | ],
1651 | "source": [
1652 | "mse = mean_squared_error(y_true, y_pred)\n",
1653 | "print(\"The mean squared error for the linear regression model is: \" + str(mse))"
1654 | ]
1655 | },
1656 | {
1657 | "cell_type": "code",
1658 | "execution_count": 40,
1659 | "metadata": {},
1660 | "outputs": [
1661 | {
1662 | "name": "stdout",
1663 | "output_type": "stream",
1664 | "text": [
1665 | "The r-squared error for the linear regression model is: 0.00479073989308\n"
1666 | ]
1667 | }
1668 | ],
1669 | "source": [
1670 | "r2 = r2_score(y_true, y_pred)\n",
1671 | "print(\"The r-squared error for the linear regression model is: \" + str(r2))"
1672 | ]
1673 | },
1674 | {
1675 | "cell_type": "markdown",
1676 | "metadata": {},
1677 | "source": [
1678 | "#### From the above success metrics, we can see:\n",
1679 | "#### 1. The mean squared error is pretty big. This is due to the variance in the estimated cost of single family homes. \n",
1680 | "#### 2. The r-squared error shows that the Issued Date Year Variable doesnot influence the Esitmated Project Cost variable significantly, even though the estinmated project cost has increased as the years go by.\n",
1681 | "#### 3. The metrics tell us that this is not the best model and can certainly be improved. We can do this by creating regression models for each quartile of the esitmated project cost variable against the issued date year variable. This might give us better success metrics for each individual model."
1682 | ]
1683 | }
1684 | ],
1685 | "metadata": {
1686 | "kernelspec": {
1687 | "display_name": "Python 3 [ analysis-preview-py3 ]",
1688 | "language": "python",
1689 | "name": "analysis-preview-py3-latest"
1690 | },
1691 | "language_info": {
1692 | "codemirror_mode": {
1693 | "name": "ipython",
1694 | "version": 3
1695 | },
1696 | "file_extension": ".py",
1697 | "mimetype": "text/x-python",
1698 | "name": "python",
1699 | "nbconvert_exporter": "python",
1700 | "pygments_lexer": "ipython3",
1701 | "version": "3.6.6"
1702 | }
1703 | },
1704 | "nbformat": 4,
1705 | "nbformat_minor": 2
1706 | }
1707 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data Analyst Interview Project
2 | Provided directions to the data analysis project as part of the data analyst interview process
3 |
4 | ## Introduction
5 | The following is an outline for a simple data analysis project. It is being assigned as part of the interview for the Data Analyst position at ConnectWise, Inc.
6 |
7 | The aim of this test is to assess your overall preparedness for typical tasks that will be necessary to perform as part of a data science team. The project was designed to have an anticipated completion time of 60-90 minutes for an entry-level data analyst.
8 |
9 | ## Rules
10 |
11 | * Unless explicitly stated, please provide your analysis through code or comments.
12 | * You are free to use whatever language and environment desired to complete this task. However, we expect to be able to recreate your work if necessary. Please provide any enviornment files (requirements.txt, package.json, etc.) in your completed analysis. Interactive environments such as Jupyter Notebooks are preferred.
13 | * Submissions for this project will only be accepted and considered by applicants that have been specifically requested to do so.
14 |
15 | ## Submission
16 | Once complete, please send a link to your repository to [sresar@connectwise.com](mailto:sresar@connectwise.com).
17 |
18 | ## Directions
19 |
20 | 1. Fork this repository to create a new working copy for your work.
21 | 1. To conduct your analysis, we have provided a dataset for download [here](https://s3.amazonaws.com/cc-analytics-datasets/Building_Permits.csv). The provided dataset comes from the City of Raleigh Open Data website and is based upon pending/granted building permits. Documentation on the dataset can be found [here](http://data-ral.opendata.arcgis.com/datasets/building-permits).
22 | 1. Load the data from the provided source via web request rather than downloading a local copy and loading from disk.
23 | 1. Review the summary statistics for the included features. Please be sure to include the following in your exploratory data analysis:
24 | - Number of rows and columns in the dataset
25 | - Total different types of construction
26 | - Mean and median number of stories
27 | - Standard deviation for the X and Y coordinates of the permits
28 | 1. Plot the distributions for each of the following features: _Estimated Project Cost_ and _Issue Date Month_. Describe the distributions for these fields and explain what insights you might be able to gather.
29 | 1. The executive team is interested is the behavior between _Permit Issue Year_ and _Estimated Project Cost_, but only for "New" construction of type "V B" with less than 3 stories. Perform a simple regression analysis of this relationship and describe what insights we can gleam from this using success metrics. _(Hint: Implement handling for missing values and explain your reasoning.)_
30 | 1. Commit all changes and analysis, then email your completed submission.
31 |
--------------------------------------------------------------------------------