├── DataAnalyst_Interview_NidhiShah.ipynb └── README.md /DataAnalyst_Interview_NidhiShah.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Connectwise - Data Analyst Interview\n", 8 | "#### Nidhi Shah" 9 | ] 10 | }, 11 | { 12 | "cell_type": "markdown", 13 | "metadata": {}, 14 | "source": [ 15 | "## Read and check csv file" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "### Imports" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "metadata": {}, 29 | "outputs": [ 30 | { 31 | "name": "stderr", 32 | "output_type": "stream", 33 | "text": [ 34 | "/opt/maxpoint/envs/analysis-preview-py3-83.0-5147/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", 35 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n" 36 | ] 37 | } 38 | ], 39 | "source": [ 40 | "import numpy as np\n", 41 | "import pandas as pd\n", 42 | "import seaborn as sns\n", 43 | "import matplotlib.pyplot as plt\n", 44 | "from scipy import stats\n", 45 | "import io\n", 46 | "import requests\n", 47 | "from sklearn.linear_model import LinearRegression\n", 48 | "from sklearn import metrics\n", 49 | "from sklearn.cross_validation import train_test_split\n", 50 | "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "### Read csv file into Pandas Dataframe from url" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stderr", 67 | "output_type": "stream", 68 | "text": [ 69 | "/opt/maxpoint/envs/analysis-preview-py3-83.0-5147/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (17,44,50,55,63,69,72) have mixed types. Specify dtype option on import or set low_memory=False.\n", 70 | " interactivity=interactivity, compiler=compiler, result=result)\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "url='https://s3.amazonaws.com/cc-analytics-datasets/Building_Permits.csv'\n", 76 | "s=requests.get(url).content\n", 77 | "df=pd.read_csv(io.StringIO(s.decode('utf-8')))" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "### Check if dataframe was loaded correctly" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 3, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "text/html": [ 95 | "
\n", 96 | "\n", 109 | "\n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | "
XYOBJECTIDpermittypemappedpermitnumworkclasspermitclassproposedworkdescriptionpermitclassmappedapplieddate...totalsqftvoiddateworkclassmappedGlobalIDCreationDateCreatorEditDateEditorconst_typeoccupancyclass
0-78.73483535.90304548520Building147303Alterations/repairs434.0REPAIR FIRE DAMAGEResidential2018-03-02T18:12:31.000Z...2064.0NaNExistinge94cf493-fe7f-49d6-927f-b6127a3435b62018-03-16T01:55:55.663Zjustin.greco@raleighnc.gov_ral2018-06-15T22:03:10.675ZOpenData_ralV BRESIDENT 3 SFD/DUP
1-78.53418435.72930948521Building147288New Building101.0SFDResidential2018-03-02T15:16:37.000Z...1684.0NaNNewf114dc19-3b62-459b-bd6c-9084162403c82018-03-16T01:55:55.663Zjustin.greco@raleighnc.gov_ral2018-06-12T22:02:31.949ZOpenData_ralV BRESIDENT 3 SFD/DUP
2-78.53432335.72859548522Building147287New Building101.0SFDResidential2018-03-02T15:08:25.000Z...2378.0NaNNewd4b182cb-af25-4c3f-92a9-a59d4b82ada32018-03-16T01:55:55.663Zjustin.greco@raleighnc.gov_ral2018-06-13T22:02:40.102ZOpenData_ralV BRESIDENT 3 SFD/DUP
3-78.53178935.72979448523Building147286New Building101.0NEW SFDResidential2018-03-02T15:00:47.000Z...1392.0NaNNewecc76e8c-48d3-4529-a7ae-f1d616592c082018-03-16T01:55:55.663Zjustin.greco@raleighnc.gov_ral2018-06-27T22:02:34.320ZOpenData_ralV BRESIDENT 3 SFD/DUP
4-78.53391435.72947348524Building147284New Building101.0NEW SFDResidential2018-03-02T14:32:33.000Z...1392.0NaNNewa1074b43-bc40-4efa-bc7f-a167c351c3272018-03-16T01:55:55.663Zjustin.greco@raleighnc.gov_ral2018-06-12T22:02:31.949ZOpenData_ralV BRESIDENT 3 SFD/DUP
\n", 259 | "

5 rows × 87 columns

\n", 260 | "
" 261 | ], 262 | "text/plain": [ 263 | " X Y OBJECTID permittypemapped permitnum \\\n", 264 | "0 -78.734835 35.903045 48520 Building 147303 \n", 265 | "1 -78.534184 35.729309 48521 Building 147288 \n", 266 | "2 -78.534323 35.728595 48522 Building 147287 \n", 267 | "3 -78.531789 35.729794 48523 Building 147286 \n", 268 | "4 -78.533914 35.729473 48524 Building 147284 \n", 269 | "\n", 270 | " workclass permitclass proposedworkdescription permitclassmapped \\\n", 271 | "0 Alterations/repairs 434.0 REPAIR FIRE DAMAGE Residential \n", 272 | "1 New Building 101.0 SFD Residential \n", 273 | "2 New Building 101.0 SFD Residential \n", 274 | "3 New Building 101.0 NEW SFD Residential \n", 275 | "4 New Building 101.0 NEW SFD Residential \n", 276 | "\n", 277 | " applieddate ... totalsqft voiddate \\\n", 278 | "0 2018-03-02T18:12:31.000Z ... 2064.0 NaN \n", 279 | "1 2018-03-02T15:16:37.000Z ... 1684.0 NaN \n", 280 | "2 2018-03-02T15:08:25.000Z ... 2378.0 NaN \n", 281 | "3 2018-03-02T15:00:47.000Z ... 1392.0 NaN \n", 282 | "4 2018-03-02T14:32:33.000Z ... 1392.0 NaN \n", 283 | "\n", 284 | " workclassmapped GlobalID \\\n", 285 | "0 Existing e94cf493-fe7f-49d6-927f-b6127a3435b6 \n", 286 | "1 New f114dc19-3b62-459b-bd6c-9084162403c8 \n", 287 | "2 New d4b182cb-af25-4c3f-92a9-a59d4b82ada3 \n", 288 | "3 New ecc76e8c-48d3-4529-a7ae-f1d616592c08 \n", 289 | "4 New a1074b43-bc40-4efa-bc7f-a167c351c327 \n", 290 | "\n", 291 | " CreationDate Creator \\\n", 292 | "0 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n", 293 | "1 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n", 294 | "2 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n", 295 | "3 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n", 296 | "4 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n", 297 | "\n", 298 | " EditDate Editor const_type occupancyclass \n", 299 | "0 2018-06-15T22:03:10.675Z OpenData_ral V B RESIDENT 3 SFD/DUP \n", 300 | "1 2018-06-12T22:02:31.949Z OpenData_ral V B RESIDENT 3 SFD/DUP \n", 301 | "2 2018-06-13T22:02:40.102Z OpenData_ral V B RESIDENT 3 SFD/DUP \n", 302 | "3 2018-06-27T22:02:34.320Z OpenData_ral V B RESIDENT 3 SFD/DUP \n", 303 | "4 2018-06-12T22:02:31.949Z OpenData_ral V B RESIDENT 3 SFD/DUP \n", 304 | "\n", 305 | "[5 rows x 87 columns]" 306 | ] 307 | }, 308 | "execution_count": 3, 309 | "metadata": {}, 310 | "output_type": "execute_result" 311 | } 312 | ], 313 | "source": [ 314 | "df.head()" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "## Summary Statistics" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "### 1. Find out number of rows and columns in the dataset" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 4, 334 | "metadata": {}, 335 | "outputs": [ 336 | { 337 | "name": "stdout", 338 | "output_type": "stream", 339 | "text": [ 340 | "Number of rows in dataset: 141953\n" 341 | ] 342 | } 343 | ], 344 | "source": [ 345 | "rows = df.shape[0]\n", 346 | "print ('Number of rows in dataset: ' + str(rows))" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 5, 352 | "metadata": {}, 353 | "outputs": [ 354 | { 355 | "name": "stdout", 356 | "output_type": "stream", 357 | "text": [ 358 | "Number of columns in dataset: 87\n" 359 | ] 360 | } 361 | ], 362 | "source": [ 363 | "cols = df.shape[1]\n", 364 | "print ('Number of columns in dataset: ' + str(cols))" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "### 2. Total different types of construction" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "metadata": {}, 377 | "source": [ 378 | "#### To get the total different types of construction, we will have to look at the const_type variable" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 6, 384 | "metadata": {}, 385 | "outputs": [], 386 | "source": [ 387 | "const_types = list(df.const_type.unique())" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 7, 393 | "metadata": {}, 394 | "outputs": [ 395 | { 396 | "data": { 397 | "text/plain": [ 398 | "['V B',\n", 399 | " nan,\n", 400 | " 'II B',\n", 401 | " 'II A',\n", 402 | " 'IIIB',\n", 403 | " 'I A',\n", 404 | " 'I B',\n", 405 | " 'IV',\n", 406 | " 'V A',\n", 407 | " 'IV U',\n", 408 | " 'VI U',\n", 409 | " 'V U',\n", 410 | " 'VI P',\n", 411 | " 'V P',\n", 412 | " 'IIIA',\n", 413 | " 'I',\n", 414 | " 'III',\n", 415 | " 'II',\n", 416 | " 'IV P']" 417 | ] 418 | }, 419 | "execution_count": 7, 420 | "metadata": {}, 421 | "output_type": "execute_result" 422 | } 423 | ], 424 | "source": [ 425 | "const_types" 426 | ] 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "metadata": {}, 431 | "source": [ 432 | "#### As we can see above, the list of different types of constructions has a nan value. So to get the total different types of construction, we would have to subtract 1 for the list length" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": 8, 438 | "metadata": {}, 439 | "outputs": [ 440 | { 441 | "name": "stdout", 442 | "output_type": "stream", 443 | "text": [ 444 | "The different construction types are: 18\n" 445 | ] 446 | } 447 | ], 448 | "source": [ 449 | "no_diff_const = (len(const_types) - 1)\n", 450 | "print('The different construction types are: ' + str(no_diff_const))" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "### 3. Mean and median number of stories" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 9, 463 | "metadata": {}, 464 | "outputs": [ 465 | { 466 | "name": "stdout", 467 | "output_type": "stream", 468 | "text": [ 469 | "Mean of number of stories = 9.70742065631\n" 470 | ] 471 | } 472 | ], 473 | "source": [ 474 | "mean = df.numberstories.describe()[1]\n", 475 | "print('Mean of number of stories = '+ str(mean))" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 10, 481 | "metadata": {}, 482 | "outputs": [ 483 | { 484 | "name": "stdout", 485 | "output_type": "stream", 486 | "text": [ 487 | "Median of number of stories = 2.0\n" 488 | ] 489 | } 490 | ], 491 | "source": [ 492 | "median = df.numberstories.describe()[5]\n", 493 | "print('Median of number of stories = '+ str(median))" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 11, 499 | "metadata": {}, 500 | "outputs": [ 501 | { 502 | "data": { 503 | "text/plain": [ 504 | "count 122442.000000\n", 505 | "mean 9.707421\n", 506 | "std 1114.892012\n", 507 | "min 0.000000\n", 508 | "25% 1.000000\n", 509 | "50% 2.000000\n", 510 | "75% 2.000000\n", 511 | "max 342365.000000\n", 512 | "Name: numberstories, dtype: float64" 513 | ] 514 | }, 515 | "execution_count": 11, 516 | "metadata": {}, 517 | "output_type": "execute_result" 518 | } 519 | ], 520 | "source": [ 521 | "df.numberstories.describe()" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "### 4. Standard deviation for the X and Y coordinates of the permits" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": 12, 534 | "metadata": {}, 535 | "outputs": [ 536 | { 537 | "name": "stdout", 538 | "output_type": "stream", 539 | "text": [ 540 | "The standard deviations for the X and Y coordinates of the permits are 0.0583920750027 and 0.0688053361904 respectively.\n" 541 | ] 542 | } 543 | ], 544 | "source": [ 545 | "# I am going to assume that the X co-ordinate is the Latitude and the Y co-ordinate is the Longitude of the Permits and not use the X and Y variables in the dataset\n", 546 | "std_x = df.latitude_perm.describe()[2]\n", 547 | "std_y = df.longitude_perm.describe()[2]\n", 548 | "print('The standard deviations for the X and Y coordinates of the permits are '+str(std_x)+' and '+str(std_y)+' respectively.')" 549 | ] 550 | }, 551 | { 552 | "cell_type": "markdown", 553 | "metadata": {}, 554 | "source": [ 555 | "## Plot distributions:" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "### 1. Estimated Project Cost" 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "metadata": {}, 568 | "source": [ 569 | "#### Creating a histogram of estimated project cost" 570 | ] 571 | }, 572 | { 573 | "cell_type": "code", 574 | "execution_count": 13, 575 | "metadata": {}, 576 | "outputs": [ 577 | { 578 | "data": { 579 | "text/plain": [ 580 | "array([[]], dtype=object)" 581 | ] 582 | }, 583 | "execution_count": 13, 584 | "metadata": {}, 585 | "output_type": "execute_result" 586 | }, 587 | { 588 | "data": { 589 | "image/png": "\n", 590 | "text/plain": [ 591 | "
" 592 | ] 593 | }, 594 | "metadata": { 595 | "needs_background": "light" 596 | }, 597 | "output_type": "display_data" 598 | } 599 | ], 600 | "source": [ 601 | "df.hist(column='estprojectcost', bins=50, color='gray')" 602 | ] 603 | }, 604 | { 605 | "cell_type": "markdown", 606 | "metadata": {}, 607 | "source": [ 608 | "#### As seen above, a simple histogram does not give us much information about the Estimated Project Cost\n", 609 | "#### So let's go ahead and describe the variable to learn more about it" 610 | ] 611 | }, 612 | { 613 | "cell_type": "code", 614 | "execution_count": 14, 615 | "metadata": {}, 616 | "outputs": [ 617 | { 618 | "data": { 619 | "text/plain": [ 620 | "count 1.419530e+05\n", 621 | "mean 1.920127e+05\n", 622 | "std 1.493678e+06\n", 623 | "min 0.000000e+00\n", 624 | "25% 1.000000e+04\n", 625 | "50% 5.440400e+04\n", 626 | "75% 1.350000e+05\n", 627 | "max 1.700000e+08\n", 628 | "Name: estprojectcost, dtype: float64" 629 | ] 630 | }, 631 | "execution_count": 14, 632 | "metadata": {}, 633 | "output_type": "execute_result" 634 | } 635 | ], 636 | "source": [ 637 | "df.estprojectcost.describe()" 638 | ] 639 | }, 640 | { 641 | "cell_type": "markdown", 642 | "metadata": {}, 643 | "source": [ 644 | "#### From the standard deviation we can see that the values for this column are quite spread out (mostly away from the mean). We can plot a boxplot to look at the spread of values" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 15, 650 | "metadata": {}, 651 | "outputs": [ 652 | { 653 | "data": { 654 | "text/plain": [ 655 | "" 656 | ] 657 | }, 658 | "execution_count": 15, 659 | "metadata": {}, 660 | "output_type": "execute_result" 661 | }, 662 | { 663 | "data": { 664 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWUAAAEKCAYAAADKJ0Q0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAE7BJREFUeJzt3X+QVeV9x/HPFxYF1BhcwFpBF4MYcEg0YdqiTqtGJgtotE3SmDHDkljFmirTdeIkZQm7Dh1NO5IqTYbSaJBMa0xM46iVHfFXteCvJcHF2iob3Vg0U3dvJAaIAuvTP85zl7OXvbtn796997v4fs3c2XOf85z7fM/D4bNnz9171kIIAgD4MKbaBQAADiGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHKkZSufJkyeHurq6ESoFAI5M27Zt6w4hTMnSd0ihXFdXp7a2ttKqAoAPKDP7Zda+XL4AAEcIZQBwhFAGAEcIZQBwhFAGAEcIZQBwhFAGAEcIZQBwhFAGAEcIZQBwhFAGAEcIZQBwhFAGAEcIZQBwhFAGAEcIZQBwhFAGAEcIZQBwhFAGAEcqEspr167V2rVrKzEUAIxqFQnl1tZWtba2VmIoABjVuHwBAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgCKEMAI4QygDgSE0lBtm3b18lhgGAUa8ioRxCqMQwADDqcfkCABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAEUIZABwhlAHAkZpKDnb++edXcrhBTZo0SW+//bYkafHixdq0aZOuuuoqrV+/XiEEmZmmTZumCRMmaP/+/XrzzTe1f/9+jR8/Xu+++27v60ydOlU33nijVq5cqenTp+vmm29WbW2tJKmjo0PXXnut9u/fr2XLlunpp59WQ0ODVq5cqRNPPFETJkxQY2Oj1qxZoxCCVq9erdraWnV0dGj58uW66aabdMcddyiEoBtuuEG33nqrenp6NHbs2N6+uVxOy5cv165du7Rq1SpdcMEFyuVyamlp0apVq3prybddf/31uv322/usy+tvu/x+LF++XLfddptmzpyZaRtPstSYpU+xeRhofkodq5LKWY+3fRuuSu/P2Obm5syd169f33z11VcPeZANGzYMeZtKSAfrzp07FULQtm3b+vR55513lMvltHv3bvX09EiSDh482KfP3r17tXXrVu3bt0+5XE7vvfee5s+fL0lqbGxULpeTJG3btk1vvfWWtmzZor1792r37t3q6upSe3u7du7cqe7u7t5tGxsb1dXVpa1bt+qNN95Qd3e32tvb1dHRoVwu16fvunXr9Pzzz0uStmzZoiVLlmjdunV66qmn9O677/bWkm9rb2/XK6+80mddXn/b5fcjX+tll12WaRtPstSYpU+xeRhofkodq5LKWY+3fRuucuxPS0vLr5qbm9dn6Tvily+8nR2PlD179vQuP/TQQ8rlcuro6FBnZ2effiGEPn0l9emzadMmtbW19bal+xa+1qZNm9TR0aEHH3ywt+3gwYN64IEH1NraqhCCWltblcvllMvlets6Ozv7rMtL90mvS+9HZ2enOjo6Bt3Gkyw1ZulTbB4Gmp9S66mkctbjbd+Gqxr7wzXlEXDgwAFt3LhRq1evLmnbrD+9HDhwQKtXr9b777/fp33NmjW9bT09Pdq4caPuuuuuw/rl1+Wl+6TXFe5H+nmxbTzJUmOWPsXmYaD5KbWeSipnPd72bbiqsT+DhrKZXW1mbWbW1tXVNeIFHSk2b9582JltFv2dSQ/Ut78xQgi9l1gOHjyozZs365FHHjnsskt+XV66T3pd4Rjp58W28SRLjVn6FJuHgean1HoqqZz1eNu34arG/gwayiGE9SGEeSGEeVOmTBnxgo4UCxYsUF1d3ZC3MzMde+yxmfv2N4aZqaYmeQ+3pqZGCxYs0EUXXdTblpdfl5fuk15XOEb6ebFtPMlSY5Y+xeZhoPkptZ5KKmc93vZtuKqxP1y+GAHjxo3TkiVL1NTUVNK2WS9fjBs3Tk1NTRozpu8/Y2NjY2/b2LFjtWTJEjU0NBzWL78uL90nva5wP9LPi23jSZYas/QpNg8DzU+p9VRSOevxtm/DVY39GfFQfuKJJ0Z6CBfSZ7eLFi1SbW2tZs6cedhZU39nwuk+Cxcu1Lx583rb0n0LX2vhwoWaOXOmLr744t62mpoaXXLJJaqvr5eZqb6+XrW1taqtre1tq6ur67MuL90nvS69H3V1dX1+5avYNp5kqTFLn2LzMND8lFpPJZWzHm/7NlzV2J8P9JnypEmTepcXL16sMWPGaNmyZTIzSUmATp8+XbNmzVJdXZ2OOuooSdL48eP7vM7UqVPV3NysCRMmaNasWX2+mzY1NfVut2zZMs2dO1ctLS2aOHGiZsyYoTlz5qipqUlz5szR7Nmz+5ydHnPMMWppaeld19TUpNmzZ2vWrFl9+jY0NGjatGmSpBUrVvS2zZ0797Az4blz56qpqemwdYV9Ctfl6+nvLLDYNp5kqTFLn2LzMND8lDpWJZWzHm/7NlyV3h8LIWTuPG/evNDW1jbkQfK/FvdBOWsGgDQz2xZCmJel7wf6TBkAvCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcARQhkAHCGUAcCRmkoMYmaVGAYARr2KhPLEiRMrMQwAjHpcvgAARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYARwhlAHCEUAYAR2oqMUh9fX0lhgGAUa8ioXzddddVYhgAGPW4fAEAjhDKAOAIoQwAjhDKAOAIoQwAjhDKAOAIoQwAjhDKAOAIoQwAjhDKAOAIoQwAjhDKAOAIoQwAjhDKAOAIoQwAjhDKAOAIoQwAjhDKAOAIoQwAjhDKAOCIhRCydzbrkvTLEseaLKm7xG2rZbTVPNrqlai5UkZbzaOtXmngmk8NIUzJ8iJDCuXhMLO2EMK8igxWJqOt5tFWr0TNlTLaah5t9Urlq5nLFwDgCKEMAI5UMpTXV3CschltNY+2eiVqrpTRVvNoq1cqU80Vu6YMABgcly8AwJGyhLKZ1ZvZy2bWYWZf72f90WZ2T1z/rJnVpdZ9I7a/bGafLkc9Zai30cxeMrN2M3vUzE5Nresxs+3xcX8l6s1Y81Iz60rV9hepdQ1mtjM+GhzV/O1Uva+Y2e7UuorPs5ndaWZvmdmLRdabmd0e96fdzD6RWletOR6s5itire1mttXMPp5a12lmO+Ictzmp93wz+03q3/6bqXUDHk9VrPlrqXpfjMfuCXHd0Oc4hDCsh6Sxkn4h6TRJR0l6QdKcgj7XSloXly+XdE9cnhP7Hy1pRnydscOtqQz1XiBpYlz+y3y98fmekaxvGDUvlfSP/Wx7gqRX49dJcXmSh5oL+l8n6c4qz/MfS/qEpBeLrF8kaZMkk/RHkp6t5hxnrPmcfC2SFuZrjs87JU12NsfnS3pwuMdTJWsu6HuJpMeGM8flOFP+A0kdIYRXQwj7Jf1Q0qUFfS6VdFdcvlfSp8zMYvsPQwjvhRBek9QRX28kDVpvCOHxEMK++PQZSdNGuKbBZJnjYj4taXMI4dchhLclbZZUP0J1pg215i9KursCdRUVQnhS0q8H6HKppI0h8YykD5vZSareHA9acwhha6xJcnAsZ5jjYobzf2BYhljzsI/jcoTyyZL+N/V8V2zrt08I4aCk30iqzbhtuQ11zCuVnB3ljTezNjN7xswuG4kC+5G15s/GH1PvNbPpQ9y23DKPGy8PzZD0WKq5GvM8mGL7VK05HqrCYzlIetjMtpnZ1VWqqT/zzewFM9tkZmfGNvdzbGYTlXwz/kmqechzXFOOWvppK/yVjmJ9smxbbpnHNLMvSZon6U9SzaeEEN40s9MkPWZmO0IIvxiBOvuU0k9bYc0PSLo7hPCemV2j5CeTCzNuOxKGMu7lku4NIfSk2qoxz4PxdBwPiZldoCSUz0s1nxvneKqkzWb2P/GssJp+puQjyXvMbJGk+ySdrlEwx0ouXWwJIaTPqoc8x+U4U94laXrq+TRJbxbrY2Y1ko5X8uNAlm3LLdOYZnaRpBWSPhNCeC/fHkJ4M359VdITks4eyWKjQWsOIeRSdf6zpE9m3XaEDGXcy1XwI1+V5nkwxfapWnOciZl9TNL3JF0aQsjl21Nz/Jakn2rkLx0OKoTwTghhT1x+SNI4M5ss53McDXQcZ5/jMlwEr1HyxsYMHboAf2ZBn6+q7xt9P4rLZ6rvG32vauTf6MtS79lK3lQ4vaB9kqSj4/JkSTtVgTcbMtZ8Umr5TyU9E5dPkPRarH1SXD7BQ82x3xlK3gyxas9zHK9Oxd+EWqy+b/Q9V805zljzKUreqzmnoP0YScellrdKqndQ7+/ljwUlAfZ6nO9Mx1M1ao7r8yeaxwx3jstV8CJJr8QgWxHbblJylilJ4yX9OB4cz0k6LbXtirjdy5IWVmiCB6v3EUn/J2l7fNwf28+RtCMeEDskXVnBg2Kwmm+W9F+xtsclfTS17Vfi3HdI+rKXmuPzZkm3FGxXlXlWcpbzK0kHlJyZXSnpGknXxPUm6Ttxf3ZImudgjger+XuS3k4dy22x/bQ4vy/E42aFk3r/KnUcP6PUN5P+jicPNcc+S5X80kJ6u5LmmE/0AYAjfKIPABwhlAHAEUIZABwhlAHAEUIZAIoY7GZEBX1PMbPHzezn8ZO1i0oZk1BGRcS72P1+GV/vM6XeKczM/qZcdcTXO6vU/4Bwb4Oy38ekSclnMM5W8nmM75YyIKGMSlkqaUihHD/92a8Qwv0hhFtKrKWsoSzpLCW/Q4sjTOjnZkRm9hEza433s3jKzD6a7y7pQ3H5eJX6icNK/QI2jyPzIelLSj4QtF3SPym5xeIGSS8q+YDFX0v6nKQ9Sj4gtF3SBCWf4vtW3PY5STPj622QtEbJB2BuVfJpufsktSv5MMHHYr+lircqlTRFyU1gno+Pc2P7sZK+H+tol/RZSbdI6ol1/EvstySuf0HSD2LbqZIeje2PKrkXhyR9Pu7bC5KeVPLpstcldcXX/EK1/014lP0Yr1Pq03zxeDg9Lv+h4q06JZ0Uj7VdSj6w88mSxqv2DvMYvQ9Js5XcCGlcfP5dSauU3MYy3+fD8esT6vsJuE4d+pTfEsV76MZQflDx4/aS1kpaFZcvlLQ9LqdD+V8lnReXT5H033H5W5L+ITVm/r7Ce1JtZ8ZvFpPj8xPi1wckNcTlr0i6Ly7vkHRywb711sLjyHukQzl+o/+dDn1CcnvqeGuUdENcni/pJUljhjpeOe4Shw+uTym58dHzye2xNUFSq6TTzGytpH+X9PAA29+d+vrtVPuPw6E7xp2n5AxXIYTHzKzWzI4veJ2LJM2JNUjSh8zsuNh+eb4xHLqvcNqFSu5Q1x375H9UnS/pz+LyDyT9XVzeImmDmf1I0r8NsG84Mo2RtDuEcFY/665UvP4cQnjazMYruXfLW0MdACiVSborhHBWfJwRQlgu6eNKzoy/quTeC8WEIst7C8YYaDspOY7np+o4OYTw27jtYPcRyNKnd8wQwjVK3tCZLmm7mdVm2BZHiBDCO5JeM7PPS71/Iiz/J7ZeV3KiIjObreSeP11DHYNQxnA8Kulz8V6xMrMT4g3rx4QQfiJppZI/oyNJv5V0XMH2X0h9fbrIGE9KuiK+/vmSuuN/jLSHldzIRrHfWUXaJ8XFA2Y2LrUPf54P1/zfVlNyR6/8WfYVkv4zrv9ICOHZEMI3JXUrCef+9g1HADO7W8mxeYaZ7TKzK5UcD1eaWf5GQ/m/gHKDpKti+92SloZ4LWMouHyBkoUQXjKzJiV/WWGMkrtoNUr6aXwuSd+IXzdIWmdmv1NyaUCSjjazZ5WcHHyxyDDNkr5vZu2S9klqSJcQv14v6TuxT42SIL9G0urY/qKSN/dalFxyWC+p3cx+FkK4wsz+VtJ/mFmPpJ8ruUZ8vaQ7zexrSs52vhzH+nszy990/VElb/i9LunrZrZd0s0hhHsyTSDcCyEUOy4P+zW5EMJLks4d7pjcJQ5VYWadSt746y5x+xskfSiEsKqshQFVxpkyRp34566W6tAbccARgzNlAHCEN/oAwBFCGQAcIZQBwBFCGQAcIZQBwBFCGQAc+X+ZTFqfroU35gAAAABJRU5ErkJggg==\n", 665 | "text/plain": [ 666 | "
" 667 | ] 668 | }, 669 | "metadata": { 670 | "needs_background": "light" 671 | }, 672 | "output_type": "display_data" 673 | } 674 | ], 675 | "source": [ 676 | "sns.boxplot(x=df['estprojectcost'])" 677 | ] 678 | }, 679 | { 680 | "cell_type": "markdown", 681 | "metadata": {}, 682 | "source": [ 683 | "#### From the boxplot we can see that most of the values for Estimated Project Cost are within the 1st quartile and the median. Thus limiting the range of the histogram will better explain the spread of values for the column" 684 | ] 685 | }, 686 | { 687 | "cell_type": "code", 688 | "execution_count": 16, 689 | "metadata": { 690 | "scrolled": true 691 | }, 692 | "outputs": [ 693 | { 694 | "data": { 695 | "image/png": "\n", 696 | "text/plain": [ 697 | "
" 698 | ] 699 | }, 700 | "metadata": { 701 | "needs_background": "light" 702 | }, 703 | "output_type": "display_data" 704 | } 705 | ], 706 | "source": [ 707 | "estprojectcost_hist = df.hist(column='estprojectcost', bins=50, range = [0,1.350000e+05], color='gray')" 708 | ] 709 | }, 710 | { 711 | "cell_type": "markdown", 712 | "metadata": {}, 713 | "source": [ 714 | "#### Seeing that most of the values are concentrated in the !st Quartile, for deeper analysis, we can plot an histogram of the minimum value to the 1st Quartile." 715 | ] 716 | }, 717 | { 718 | "cell_type": "code", 719 | "execution_count": 17, 720 | "metadata": {}, 721 | "outputs": [ 722 | { 723 | "data": { 724 | "text/plain": [ 725 | "array([[]], dtype=object)" 726 | ] 727 | }, 728 | "execution_count": 17, 729 | "metadata": {}, 730 | "output_type": "execute_result" 731 | }, 732 | { 733 | "data": { 734 | "image/png": "\n", 735 | "text/plain": [ 736 | "
" 737 | ] 738 | }, 739 | "metadata": { 740 | "needs_background": "light" 741 | }, 742 | "output_type": "display_data" 743 | } 744 | ], 745 | "source": [ 746 | "df.hist(column='estprojectcost', bins=50, range = [0,1.000000e+04], color='gray')" 747 | ] 748 | }, 749 | { 750 | "cell_type": "markdown", 751 | "metadata": {}, 752 | "source": [ 753 | "#### We can see that we have a lot of 0 values for Estimated Project Costs. Maybe 0 was entered where data was missing or where estimated project cost was unknown." 754 | ] 755 | }, 756 | { 757 | "cell_type": "markdown", 758 | "metadata": {}, 759 | "source": [ 760 | "### 2. Issue Date Month" 761 | ] 762 | }, 763 | { 764 | "cell_type": "markdown", 765 | "metadata": {}, 766 | "source": [ 767 | "#### Moving on to the next feature, we now create a histogram of the Issued Date Month feature" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 18, 773 | "metadata": {}, 774 | "outputs": [ 775 | { 776 | "data": { 777 | "text/plain": [ 778 | "" 779 | ] 780 | }, 781 | "execution_count": 18, 782 | "metadata": {}, 783 | "output_type": "execute_result" 784 | }, 785 | { 786 | "data": { 787 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAELCAYAAAARNxsIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAG8BJREFUeJzt3Xu0XnV95/H3RwIiViRIdDDBJtqUljKt0AzSOsOy0kKgltBWptgLqdKVXtDRdlwVxjXNIKVTWqe09IKDEAHHASleoBWLKUKdVrmEixAuTiJYiEQSG0SqU23wO3/s34GHw3OSk5P9nIeQ92utZ529v/u393fv51y+Z+/9e347VYUkSX143rh3QJL03GFRkST1xqIiSeqNRUWS1BuLiiSpNxYVSVJvLCqSpN5YVCRJvbGoSJJ6M2fcOzDbDjjggFq4cOG4d0OSdim33nrrV6tq3vba7XZFZeHChaxZs2bcuyFJu5Qk/ziddl7+kiT1xqIiSeqNRUWS1BuLiiSpNyMrKklWJdmUZO2QZe9MUkkOaPNJcl6S9UnuTHL4QNvlSda11/KB+A8nuautc16SjOpYJEnTM8ozlYuBpZODSQ4CfgJ4cCB8HLC4vVYA57e2+wMrgdcARwArk8xt65zf2k6s94xckqTZNbKiUlWfAbYMWXQu8NvA4CMnlwGXVudGYL8kBwLHAquraktVPQqsBpa2ZftW1eeqe3TlpcCJozoWSdL0zOo9lSQnAF+uqs9PWjQfeGhgfkOLbSu+YUh8qrwrkqxJsmbz5s07cQSSpG2ZtaKSZB/g3cDvDFs8JFYziA9VVRdU1ZKqWjJv3nY/ECpJmqHZ/ET9q4BFwOfbPfUFwG1JjqA70zhooO0C4OEWf92k+A0tvmBIe2kkzjzzzJFsd+XKlSPZrjQus3amUlV3VdVLq2phVS2kKwyHV9VXgKuBU1ovsCOBx6pqI3AtcEySue0G/THAtW3Z40mObL2+TgGumq1jkSQNN8ouxZcBnwMOTrIhyanbaH4NcD+wHng/8BsAVbUFOAu4pb3e02IAvw5c2Nb5IvDJURyHJGn6Rnb5q6retJ3lCwemCzhtinargFVD4muAQ3duLyVJffIT9ZKk3lhUJEm92e2ep6LnBntjSc9OnqlIknpjUZEk9cbLX5K8nKjeeKYiSeqNRUWS1BuLiiSpNxYVSVJvLCqSpN5YVCRJvbGoSJJ6Y1GRJPXGoiJJ6o1FRZLUG4uKJKk3jv2lXoxq7Chw/ChpV+KZiiSpNxYVSVJvRlZUkqxKsinJ2oHYHya5L8mdST6WZL+BZWckWZ/kC0mOHYgvbbH1SU4fiC9KclOSdUk+nGSvUR2LJGl6RnlP5WLgz4BLB2KrgTOqamuSc4AzgHclOQQ4GfgB4OXA3yb53rbOnwM/AWwAbklydVXdA5wDnFtVlyd5H3AqcP4Ij2eX4j2OXZvPN9GuamRnKlX1GWDLpNinqmprm70RWNCmlwGXV9W3quoBYD1wRHutr6r7q+rbwOXAsiQBXg9c2da/BDhxVMciSZqecd5TeQvwyTY9H3hoYNmGFpsq/hLgawMFaiIuSRqjsRSVJO8GtgIfmggNaVYziE+Vb0WSNUnWbN68eUd3V5I0TbNeVJIsB94A/EJVTRSCDcBBA80WAA9vI/5VYL8kcybFh6qqC6pqSVUtmTdvXj8HIkl6hlktKkmWAu8CTqiqbw4suho4OcnzkywCFgM3A7cAi1tPr73obuZf3YrR9cAb2/rLgatm6zgkScONskvxZcDngIOTbEhyKl1vsBcBq5Pc0XptUVV3A1cA9wB/A5xWVU+0eyZvBa4F7gWuaG2hK06/lWQ93T2Wi0Z1LJKk6RlZl+KqetOQ8JR/+KvqbODsIfFrgGuGxO+n6x0mSXqWcOwvSbPOz+E8dzlMiySpNxYVSVJvvPwl6TnPy22zxzMVSVJvLCqSpN5YVCRJvbGoSJJ6Y1GRJPXGoiJJ6o1FRZLUG4uKJKk3FhVJUm8sKpKk3lhUJEm9sahIknpjUZEk9caiIknqjUVFktQbi4okqTcWFUlSb0ZWVJKsSrIpydqB2P5JVidZ177ObfEkOS/J+iR3Jjl8YJ3lrf26JMsH4j+c5K62znlJMqpjkSRNzyjPVC4Glk6KnQ5cV1WLgevaPMBxwOL2WgGcD10RAlYCrwGOAFZOFKLWZsXAepNzSZJm2ciKSlV9BtgyKbwMuKRNXwKcOBC/tDo3AvslORA4FlhdVVuq6lFgNbC0Ldu3qj5XVQVcOrAtSdKYzPY9lZdV1UaA9vWlLT4feGig3YYW21Z8w5C4JGmMni036ofdD6kZxIdvPFmRZE2SNZs3b57hLkqStmfOLOd7JMmBVbWxXcLa1OIbgIMG2i0AHm7x102K39DiC4a0H6qqLgAuAFiyZEkBnHnmmTtzHNu0cuXKkW1bkp7NZvtM5WpgogfXcuCqgfgprRfYkcBj7fLYtcAxSea2G/THANe2ZY8nObL1+jplYFuSpDEZ2ZlKksvozjIOSLKBrhfX7wNXJDkVeBA4qTW/BjgeWA98E3gzQFVtSXIWcEtr956qmrj5/+t0PcxeAHyyvSRJYzSyolJVb5pi0dFD2hZw2hTbWQWsGhJfAxy6M/soSerXbN9T2W15D0fS7uDZ0vtLkvQcYFGRJPXGoiJJ6o1FRZLUG2/US1KPdvdOOZ6pSJJ6Y1GRJPXGoiJJ6o1FRZLUG4uKJKk39v6SpF3Ys623mWcqkqTeWFQkSb2xqEiSemNRkST1xqIiSeqNRUWS1BuLiiSpNxYVSVJvLCqSpN6Mpagk+c0kdydZm+SyJHsnWZTkpiTrknw4yV6t7fPb/Pq2fOHAds5o8S8kOXYcxyJJesqsF5Uk84H/BCypqkOBPYCTgXOAc6tqMfAocGpb5VTg0ar6HuDc1o4kh7T1fgBYCvxFkj1m81gkSU83raKS5LrpxHbAHOAFSeYA+wAbgdcDV7bllwAntullbZ62/OgkafHLq+pbVfUAsB44Yif2SZK0k7Y5oGSSven+6B+QZC6Qtmhf4OUzSVhVX07yXuBB4P8BnwJuBb5WVVtbsw3A/DY9H3iorbs1yWPAS1r8xoFND64jSRqD7Y1S/KvAO+gKyK08VVS+Dvz5TBK24rQMWAR8DfhL4LghTWtilSmWTRUflnMFsALgFa94xQ7usSRpurZ5+auq/qSqFgHvrKpXVtWi9vqhqvqzGeb8ceCBqtpcVf8KfBT4UWC/djkMYAHwcJveABwE0Ja/GNgyGB+yzuTjuKCqllTVknnz5s1wtyVJ2zOteypV9adJfjTJzyc5ZeI1w5wPAkcm2afdGzkauAe4Hnhja7McuKpNX93macs/XVXV4ie33mGLgMXAzTPcJ0lSD6b1kK4kHwReBdwBPNHCBVy6owmr6qYkVwK3AVuB24ELgE8Alyf53Ra7qK1yEfDBJOvpzlBObtu5O8kVdAVpK3BaVT2BJGlspvvkxyXAIe0MYadV1Upg8iPF7mdI762q+hfgpCm2czZwdh/7JEnaedP9nMpa4N+MckckSbu+6Z6pHADck+Rm4FsTwao6YSR7JUnaJU23qPy3Ue6EJOm5YVpFpar+btQ7Ikna9U2399fjPPXBwr2APYFvVNW+o9oxSdKuZ7pnKi8anE9yIo6zJUmaZEajFFfVx+kGgJQk6UnTvfz1MwOzz6P73Eovn1mRJD13TLf3108NTG8FvkQ3KKQkSU+a7j2VN496RyRJu77pPqRrQZKPJdmU5JEkH0myYNQ7J0natUz3Rv0H6EYFfjndg7D+qsUkSXrSdIvKvKr6QFVtba+LAR9MIkl6mukWla8m+cUke7TXLwL/NModkyTteqZbVN4C/EfgK8BGuodlefNekvQ00+1SfBawvKoeBUiyP/BeumIjSRIw/TOVH5woKABVtQU4bDS7JEnaVU23qDwvydyJmXamMt2zHEnSbmK6heF/AJ9tz5YvuvsrPsZXkvQ00/1E/aVJ1tANIhngZ6rqnpHumSRplzPtS1itiFhIJElTmtHQ9zsryX5JrkxyX5J7k/xIkv2TrE6yrn2d29omyXlJ1ie5M8nhA9tZ3tqvS7J8HMciSXrKWIoK8CfA31TV9wE/BNwLnA5cV1WLgevaPMBxwOL2WgGcD092FlgJvIbugWErBzsTSJJm36wXlST7AkcBFwFU1ber6mt0Q+lf0ppdApzYppcBl1bnRmC/JAcCxwKrq2pL6+68Glg6i4ciSZpkHGcqrwQ2Ax9IcnuSC5O8EHhZVW0EaF9f2trPBx4aWH9Di00Vf4YkK5KsSbJm8+bN/R6NJOlJ4ygqc4DDgfOr6jDgGzx1qWuYDInVNuLPDFZdUFVLqmrJvHmOgylJozKOorIB2FBVN7X5K+mKzCPtshbt66aB9gcNrL8AeHgbcUnSmMx6UamqrwAPJTm4hY6m66p8NTDRg2s5cFWbvho4pfUCOxJ4rF0euxY4JsncdoP+mBaTJI3JuIZaeRvwoSR7AffTjXj8POCKJKcCDwIntbbXAMcD64FvtrZU1ZYkZwG3tHbvaWOSSZLGZCxFparuAJYMWXT0kLYFnDbFdlYBq/rdO0nSTI3rcyqSpOcgi4okqTcWFUlSbywqkqTeWFQkSb2xqEiSemNRkST1xqIiSeqNRUWS1BuLiiSpNxYVSVJvLCqSpN5YVCRJvbGoSJJ6Y1GRJPXGoiJJ6o1FRZLUG4uKJKk3FhVJUm8sKpKk3lhUJEm9GVtRSbJHktuT/HWbX5TkpiTrknw4yV4t/vw2v74tXziwjTNa/AtJjh3PkUiSJozzTOXtwL0D8+cA51bVYuBR4NQWPxV4tKq+Bzi3tSPJIcDJwA8AS4G/SLLHLO27JGmIsRSVJAuAnwQubPMBXg9c2ZpcApzYppe1edryo1v7ZcDlVfWtqnoAWA8cMTtHIEkaZlxnKn8M/DbwnTb/EuBrVbW1zW8A5rfp+cBDAG35Y639k/Eh6zxNkhVJ1iRZs3nz5j6PQ5I0YNaLSpI3AJuq6tbB8JCmtZ1l21rn6cGqC6pqSVUtmTdv3g7tryRp+uaMIedrgROSHA/sDexLd+ayX5I57WxkAfBwa78BOAjYkGQO8GJgy0B8wuA6kqQxmPUzlao6o6oWVNVCuhvtn66qXwCuB97Ymi0HrmrTV7d52vJPV1W1+Mmtd9giYDFw8ywdhiRpiHGcqUzlXcDlSX4XuB24qMUvAj6YZD3dGcrJAFV1d5IrgHuArcBpVfXE7O+2JGnCWItKVd0A3NCm72dI762q+hfgpCnWPxs4e3R7KEnaEX6iXpLUG4uKJKk3FhVJUm8sKpKk3lhUJEm9sahIknpjUZEk9caiIknqjUVFktQbi4okqTcWFUlSbywqkqTeWFQkSb2xqEiSemNRkST1xqIiSeqNRUWS1BuLiiSpNxYVSVJvLCqSpN7MelFJclCS65Pcm+TuJG9v8f2TrE6yrn2d2+JJcl6S9UnuTHL4wLaWt/brkiyf7WORJD3dOM5UtgL/uaq+HzgSOC3JIcDpwHVVtRi4rs0DHAcsbq8VwPnQFSFgJfAa4Ahg5UQhkiSNx6wXlaraWFW3tenHgXuB+cAy4JLW7BLgxDa9DLi0OjcC+yU5EDgWWF1VW6rqUWA1sHQWD0WSNMlY76kkWQgcBtwEvKyqNkJXeICXtmbzgYcGVtvQYlPFJUljMraikuS7gI8A76iqr2+r6ZBYbSM+LNeKJGuSrNm8efOO76wkaVrGUlSS7ElXUD5UVR9t4UfaZS3a100tvgE4aGD1BcDD24g/Q1VdUFVLqmrJvHnz+jsQSdLTjKP3V4CLgHur6o8GFl0NTPTgWg5cNRA/pfUCOxJ4rF0euxY4JsncdoP+mBaTJI3JnDHkfC3wS8BdSe5osf8C/D5wRZJTgQeBk9qya4DjgfXAN4E3A1TVliRnAbe0du+pqi2zcwiSpGFmvahU1d8z/H4IwNFD2hdw2hTbWgWs6m/vJEk7w0/US5J6Y1GRJPXGoiJJ6o1FRZLUG4uKJKk3FhVJUm8sKpKk3lhUJEm9sahIknpjUZEk9caiIknqjUVFktQbi4okqTcWFUlSbywqkqTeWFQkSb2xqEiSemNRkST1xqIiSeqNRUWS1BuLiiSpN7t8UUmyNMkXkqxPcvq490eSdme7dFFJsgfw58BxwCHAm5IcMt69kqTd1y5dVIAjgPVVdX9VfRu4HFg25n2SpN3Wrl5U5gMPDcxvaDFJ0hikqsa9DzOW5CTg2Kr6lTb/S8ARVfW2Se1WACva7MHAF2aQ7gDgqzuxu8/mfM/lYzOf+czXT77vrqp522s0ZwYbfjbZABw0ML8AeHhyo6q6ALhgZxIlWVNVS3ZmG8/WfM/lYzOf+cw3u/l29ctftwCLkyxKshdwMnD1mPdJknZbu/SZSlVtTfJW4FpgD2BVVd095t2SpN3WLl1UAKrqGuCaWUi1U5fPnuX5nsvHZj7zmW8W8+3SN+olSc8uu/o9FUnSs4hFZUCSVUk2JVk7xfIkOa8NCXNnksN3ItdBSa5Pcm+Su5O8fcT59k5yc5LPt3xnDmnz/CQfbvluSrJwpvkGtrlHktuT/PWo8yX5UpK7ktyRZM2Q5b29n217+yW5Msl97fv4I6PKl+TgdlwTr68neceo8rXt/Wb7WVmb5LIke09a3vf37+0t192Tj60t36njG/b7nWT/JKuTrGtf506x7vLWZl2S5TuR76R2fN9JMmUPrMxg+Kkp8v1h+/m8M8nHkuzXV74pVZWv9gKOAg4H1k6x/Hjgk0CAI4GbdiLXgcDhbfpFwP8FDhlhvgDf1ab3BG4CjpzU5jeA97Xpk4EP9/Ce/hbwv4G/HrKs13zAl4ADtrG8t/ezbe8S4Ffa9F7AfqPMN7DdPYCv0H1uYFQ/L/OBB4AXtPkrgF8e1fcPOBRYC+xDd6/3b4HFfR7fsN9v4A+A09v06cA5Q9bbH7i/fZ3bpufOMN/3031W7gZgyTa+v18EXtl+rj4/+W/DDuQ7BpjTps+Z4vhmlG+ql2cqA6rqM8CWbTRZBlxanRuB/ZIcOMNcG6vqtjb9OHAvzxwNoM98VVX/3Gb3bK/JN9SW0f2hBLgSODpJZpIPIMkC4CeBC6do0mu+aejt/UyyL90v8UUAVfXtqvraqPJNcjTwxar6xxHnmwO8IMkcuj/2kz8D1uf37/uBG6vqm1W1Ffg74KeH5Jvx8U3x+z14DJcAJw5Z9VhgdVVtqapHgdXA0pnkq6p7q2p7H76e0fBTU+T7VHs/AW6k+yxfL/mmYlHZMSMZFqZdNjiM7uxhZPnapag7gE10vyRT5ms/iI8BL5lpPuCPgd8GvjPF8r7zFfCpJLemG0VhynzNzryfrwQ2Ax9ol/cuTPLCEeYbdDJw2ZB4b/mq6svAe4EHgY3AY1X1qany9fD9WwscleQlSfahOys5aFKbUbyfL6uqjdD9owe8dEib2R4OalT53kJ3pjfSfBaVHTPsv7Cd6j6X5LuAjwDvqKqvjzJfVT1RVa+m+2/liCSHjipfkjcAm6rq1m016ytf89qqOpxu1OrTkhw1wnxz6C41nF9VhwHfoLt8Mqp83Qa7D/meAPzlsMV95Wv3FpYBi4CXAy9M8oujyldV99JdnlkN/A3dJZitk5r1/n5O02znHcXPzbvp3s8PjTqfRWXHTGtYmOlKsiddQflQVX101PkmtMs0N/DMU/gn87VLHi9m25cDt+W1wAlJvkR3Ov36JP9rhPmoqofb103Ax+hO64fma3bm/dwAbBg427uSrsiMKt+E44DbquqRKfapr3w/DjxQVZur6l+BjwI/OlW+nr5/F1XV4VV1VNvOuqnyNX28n49MXEJrXzcNaTOS38Nt6PvvzHLgDcAvVLuJMsp8FpUdczVwSuuFciTdJYGNM9lQu/Z8EXBvVf3RLOSbN9HzI8kL6P5o3Dck30TPljcCn57ih3C7quqMqlpQVQvpLtd8uqom/6fbW74kL0zyoolpuhuUk3vx9fZ+VtVXgIeSHNxCRwP3jCrfgDcx/NJX3/keBI5Msk/7WT2a7r7f5Hy9fP8Akry0fX0F8DM88zhH8X4OHsNy4Kohba4Fjkkyt53BHdNio9Lb8FNJlgLvAk6oqm+OOh9g76/BF90P8UbgX+mq96nArwG/1paH7qFgXwTuYoreG9PM9e/pTjHvBO5or+NHmO8HgdtbvrXA77T4e9oPHMDedJdV1gM3A6/s6X19Ha3316jy0d3j+Hx73Q28u8VH8n627b0aWNPe04/T9QwaZb59gH8CXjwQG2W+M+n+8VgLfBB4/ih/XoD/Q1eYPw8c3ffxTfH7/RLgOrqzouuA/VvbJcCFA+u+pR3neuDNO5Hvp9v0t4BHgGtb25cD1wysezxdj9AvTvwszzDferr7JRN/Y97XV76pXn6iXpLUGy9/SZJ6Y1GRJPXGoiJJ6o1FRZLUG4uKJKk3FhVJUm8sKtqtJPnsGHJenOSNQ+Kvy5BHAkxq8+okx49u7yDJLyd5+cD8l5IcMMqceu6yqGi3UlWThxp5tns13QfTRumX6T4MJ+00i4p2K0n+uX09MMln0j3wam2S/9BGcb64zd+V5Ddb2xvSHqiU5IA2ntnEqM9/mOSWdA9B+tUWT5I/S3JPkk8wMPJtuoch3Zfk7+mGIpmIH5Hks23E48+meyjXXnSfYP+5tp8/14ajWdVy3p5kyiHK2xnIx5P8VZIHkrw1yW+19W5M94CqN9J9evxDLccL2upvS3Jbex++r7/vgJ7r5ox7B6Qx+Xm6ITLOTrIH3RAorwbmV9Wh0D3ZcTvbOJVu/Kl/l+T5wD8k+RTdYwwOBv4t8DK6oUdWpXty4vuB19MNn/HhgW3dBxxVVVuT/Djwe1X1s0l+h244kre2ffo9ujG23tL27+Ykf1tV35hiHw9t+7N3y/muqjosybnAKVX1x0neCryzqta0HABfrarDk/wG8E7gV7b/lkoWFe2+bqH7Q78n8PGquiPJ/cArk/wp8Alg8vNDJjsG+MGB+yUvBhbTPbzrsqp6Ang4yafb8u+jG/l3HUC6UZtXDKx7SZLFdGPC7bmNnCckeWeb3xt4Bc8c7HHC9dU9BO7xJI8Bf9Xid9GNBzeViVGzb2XgjEraHi9/abdU3VPyjgK+DHwwySnVPdXvh+geC3AaTz2xcitP/a4MPqc9wNuq6tXttaieepDVVIPqTRU/i64AHAr81KQ8gwL87EDOV1T3LJKpfGtg+jsD899h2/9UTrR7YjvtpKexqGi3lOS76R4i9n66RxAc3no8Pa+qPgL8V556PsqXgB9u04O9uK4Ffr2d7ZDke9MNu/8Z4OR2z+VA4Mda+/uARUle1ebfNLCtF9MVOOhunE94HHjRpJxva8PRk+SwHT32ISbnkGbMoqLd1euAO5LcDvws8Cd0j1C9Id0jly8Gzmht30tXPD4LDHa1vZDufsltSdYC/5Puv/qP0Q2lfhdwPt3z1qmqf6G73PWJdqN+8BnzfwD89yT/AOwxEL8eOGTiRj3dGc2ewJ0t51k7/1ZwMfC+STfqpRlx6HtJUm88U5Ek9cYbcNIuLsmxwDmTwg9U1U+PY3+0e/PylySpN17+kiT1xqIiSeqNRUWS1BuLiiSpNxYVSVJv/j/iqnQmRkbM+QAAAABJRU5ErkJggg==\n", 788 | "text/plain": [ 789 | "
" 790 | ] 791 | }, 792 | "metadata": { 793 | "needs_background": "light" 794 | }, 795 | "output_type": "display_data" 796 | } 797 | ], 798 | "source": [ 799 | "sns.countplot(df['issueddate_mth'], color='gray')" 800 | ] 801 | }, 802 | { 803 | "cell_type": "markdown", 804 | "metadata": {}, 805 | "source": [ 806 | "#### From the above histogram we see that the number of permits issued was low between the months of November and February and maximum permits were issued in the month of June. Another way to look as the permits issued per month in descending order is:" 807 | ] 808 | }, 809 | { 810 | "cell_type": "code", 811 | "execution_count": 19, 812 | "metadata": {}, 813 | "outputs": [ 814 | { 815 | "data": { 816 | "text/plain": [ 817 | "" 818 | ] 819 | }, 820 | "execution_count": 19, 821 | "metadata": {}, 822 | "output_type": "execute_result" 823 | }, 824 | { 825 | "data": { 826 | "image/png": "\n", 827 | "text/plain": [ 828 | "
" 829 | ] 830 | }, 831 | "metadata": { 832 | "needs_background": "light" 833 | }, 834 | "output_type": "display_data" 835 | } 836 | ], 837 | "source": [ 838 | "df['issueddate_mth'].value_counts().plot(kind='bar', color='gray')" 839 | ] 840 | }, 841 | { 842 | "cell_type": "markdown", 843 | "metadata": {}, 844 | "source": [ 845 | "#### From the above plot we can see that the highest number permits were issued in June followed by May, August, April, July, March, September, and October; while the least number of permist issued in December followed by November, February, and January." 846 | ] 847 | }, 848 | { 849 | "cell_type": "markdown", 850 | "metadata": {}, 851 | "source": [ 852 | "## Relationship between Permit Issue Year and Estimated Project Cost" 853 | ] 854 | }, 855 | { 856 | "cell_type": "markdown", 857 | "metadata": {}, 858 | "source": [ 859 | "### We want to understand the relationship between Permit Issue Year and Estimated Project Cost, but only for \"New\" construction of type \"V B\" with less than 3 stories. Thus we will filter the dataset based on these values" 860 | ] 861 | }, 862 | { 863 | "cell_type": "code", 864 | "execution_count": 20, 865 | "metadata": {}, 866 | "outputs": [ 867 | { 868 | "data": { 869 | "text/html": [ 870 | "
\n", 871 | "\n", 884 | "\n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | " \n", 947 | " \n", 948 | " \n", 949 | " \n", 950 | " \n", 951 | " \n", 952 | " \n", 953 | " \n", 954 | " \n", 955 | " \n", 956 | " \n", 957 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 986 | " \n", 987 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 991 | " \n", 992 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 996 | " \n", 997 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | "
XYOBJECTIDpermittypemappedpermitnumworkclasspermitclassproposedworkdescriptionpermitclassmappedapplieddate...totalsqftvoiddateworkclassmappedGlobalIDCreationDateCreatorEditDateEditorconst_typeoccupancyclass
1-78.53418435.72930948521Building147288New Building101.0SFDResidential2018-03-02T15:16:37.000Z...1684.0NaNNewf114dc19-3b62-459b-bd6c-9084162403c82018-03-16T01:55:55.663Zjustin.greco@raleighnc.gov_ral2018-06-12T22:02:31.949ZOpenData_ralV BRESIDENT 3 SFD/DUP
2-78.53432335.72859548522Building147287New Building101.0SFDResidential2018-03-02T15:08:25.000Z...2378.0NaNNewd4b182cb-af25-4c3f-92a9-a59d4b82ada32018-03-16T01:55:55.663Zjustin.greco@raleighnc.gov_ral2018-06-13T22:02:40.102ZOpenData_ralV BRESIDENT 3 SFD/DUP
3-78.53178935.72979448523Building147286New Building101.0NEW SFDResidential2018-03-02T15:00:47.000Z...1392.0NaNNewecc76e8c-48d3-4529-a7ae-f1d616592c082018-03-16T01:55:55.663Zjustin.greco@raleighnc.gov_ral2018-06-27T22:02:34.320ZOpenData_ralV BRESIDENT 3 SFD/DUP
4-78.53391435.72947348524Building147284New Building101.0NEW SFDResidential2018-03-02T14:32:33.000Z...1392.0NaNNewa1074b43-bc40-4efa-bc7f-a167c351c3272018-03-16T01:55:55.663Zjustin.greco@raleighnc.gov_ral2018-06-12T22:02:31.949ZOpenData_ralV BRESIDENT 3 SFD/DUP
9-78.59425235.90949248529Building147194New Building318.0PAVILLION NEAR POOL, GRILLS, SEATINGResidential2018-02-27T21:31:37.000Z...504.0NaNNew39eb46ff-e7d2-4c23-95f8-71c4f67b39242018-03-16T01:55:55.663Zjustin.greco@raleighnc.gov_ral2018-04-23T19:46:09.547ZOpenData_ralV BASSEMBLY 3
\n", 1034 | "

5 rows × 87 columns

\n", 1035 | "
" 1036 | ], 1037 | "text/plain": [ 1038 | " X Y OBJECTID permittypemapped permitnum workclass \\\n", 1039 | "1 -78.534184 35.729309 48521 Building 147288 New Building \n", 1040 | "2 -78.534323 35.728595 48522 Building 147287 New Building \n", 1041 | "3 -78.531789 35.729794 48523 Building 147286 New Building \n", 1042 | "4 -78.533914 35.729473 48524 Building 147284 New Building \n", 1043 | "9 -78.594252 35.909492 48529 Building 147194 New Building \n", 1044 | "\n", 1045 | " permitclass proposedworkdescription permitclassmapped \\\n", 1046 | "1 101.0 SFD Residential \n", 1047 | "2 101.0 SFD Residential \n", 1048 | "3 101.0 NEW SFD Residential \n", 1049 | "4 101.0 NEW SFD Residential \n", 1050 | "9 318.0 PAVILLION NEAR POOL, GRILLS, SEATING Residential \n", 1051 | "\n", 1052 | " applieddate ... totalsqft voiddate \\\n", 1053 | "1 2018-03-02T15:16:37.000Z ... 1684.0 NaN \n", 1054 | "2 2018-03-02T15:08:25.000Z ... 2378.0 NaN \n", 1055 | "3 2018-03-02T15:00:47.000Z ... 1392.0 NaN \n", 1056 | "4 2018-03-02T14:32:33.000Z ... 1392.0 NaN \n", 1057 | "9 2018-02-27T21:31:37.000Z ... 504.0 NaN \n", 1058 | "\n", 1059 | " workclassmapped GlobalID \\\n", 1060 | "1 New f114dc19-3b62-459b-bd6c-9084162403c8 \n", 1061 | "2 New d4b182cb-af25-4c3f-92a9-a59d4b82ada3 \n", 1062 | "3 New ecc76e8c-48d3-4529-a7ae-f1d616592c08 \n", 1063 | "4 New a1074b43-bc40-4efa-bc7f-a167c351c327 \n", 1064 | "9 New 39eb46ff-e7d2-4c23-95f8-71c4f67b3924 \n", 1065 | "\n", 1066 | " CreationDate Creator \\\n", 1067 | "1 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n", 1068 | "2 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n", 1069 | "3 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n", 1070 | "4 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n", 1071 | "9 2018-03-16T01:55:55.663Z justin.greco@raleighnc.gov_ral \n", 1072 | "\n", 1073 | " EditDate Editor const_type occupancyclass \n", 1074 | "1 2018-06-12T22:02:31.949Z OpenData_ral V B RESIDENT 3 SFD/DUP \n", 1075 | "2 2018-06-13T22:02:40.102Z OpenData_ral V B RESIDENT 3 SFD/DUP \n", 1076 | "3 2018-06-27T22:02:34.320Z OpenData_ral V B RESIDENT 3 SFD/DUP \n", 1077 | "4 2018-06-12T22:02:31.949Z OpenData_ral V B RESIDENT 3 SFD/DUP \n", 1078 | "9 2018-04-23T19:46:09.547Z OpenData_ral V B ASSEMBLY 3 \n", 1079 | "\n", 1080 | "[5 rows x 87 columns]" 1081 | ] 1082 | }, 1083 | "execution_count": 20, 1084 | "metadata": {}, 1085 | "output_type": "execute_result" 1086 | } 1087 | ], 1088 | "source": [ 1089 | "df_filtered = df[(df.numberstories < 3) & (df.workclassmapped == \"New\") & (df.const_type == \"V B\")]\n", 1090 | "df_filtered.head()" 1091 | ] 1092 | }, 1093 | { 1094 | "cell_type": "code", 1095 | "execution_count": 21, 1096 | "metadata": {}, 1097 | "outputs": [ 1098 | { 1099 | "data": { 1100 | "text/plain": [ 1101 | "(31044, 87)" 1102 | ] 1103 | }, 1104 | "execution_count": 21, 1105 | "metadata": {}, 1106 | "output_type": "execute_result" 1107 | } 1108 | ], 1109 | "source": [ 1110 | "df_filtered.shape" 1111 | ] 1112 | }, 1113 | { 1114 | "cell_type": "markdown", 1115 | "metadata": {}, 1116 | "source": [ 1117 | "### Now for this newly filtered dataset, we are interested in the relationship between the Issued Date Year and the Estimated Project Cost Columns" 1118 | ] 1119 | }, 1120 | { 1121 | "cell_type": "markdown", 1122 | "metadata": {}, 1123 | "source": [ 1124 | "#### Lets begin by describing our columns of interest" 1125 | ] 1126 | }, 1127 | { 1128 | "cell_type": "code", 1129 | "execution_count": 22, 1130 | "metadata": {}, 1131 | "outputs": [ 1132 | { 1133 | "data": { 1134 | "text/plain": [ 1135 | "count 30851.000000\n", 1136 | "mean 2008.365401\n", 1137 | "std 4.604880\n", 1138 | "min 2002.000000\n", 1139 | "25% 2005.000000\n", 1140 | "50% 2007.000000\n", 1141 | "75% 2012.000000\n", 1142 | "max 2018.000000\n", 1143 | "Name: issueddate_yr, dtype: float64" 1144 | ] 1145 | }, 1146 | "execution_count": 22, 1147 | "metadata": {}, 1148 | "output_type": "execute_result" 1149 | } 1150 | ], 1151 | "source": [ 1152 | "df_filtered.issueddate_yr.describe()" 1153 | ] 1154 | }, 1155 | { 1156 | "cell_type": "code", 1157 | "execution_count": 23, 1158 | "metadata": {}, 1159 | "outputs": [ 1160 | { 1161 | "data": { 1162 | "text/plain": [ 1163 | "count 3.104400e+04\n", 1164 | "mean 2.120274e+05\n", 1165 | "std 8.762292e+05\n", 1166 | "min 0.000000e+00\n", 1167 | "25% 9.834500e+04\n", 1168 | "50% 1.500000e+05\n", 1169 | "75% 2.562760e+05\n", 1170 | "max 1.000000e+08\n", 1171 | "Name: estprojectcost, dtype: float64" 1172 | ] 1173 | }, 1174 | "execution_count": 23, 1175 | "metadata": {}, 1176 | "output_type": "execute_result" 1177 | } 1178 | ], 1179 | "source": [ 1180 | "df_filtered.estprojectcost.describe()" 1181 | ] 1182 | }, 1183 | { 1184 | "cell_type": "markdown", 1185 | "metadata": {}, 1186 | "source": [ 1187 | "#### Next we will check for null values in these columns" 1188 | ] 1189 | }, 1190 | { 1191 | "cell_type": "code", 1192 | "execution_count": 24, 1193 | "metadata": {}, 1194 | "outputs": [ 1195 | { 1196 | "data": { 1197 | "text/plain": [ 1198 | "(193, 87)" 1199 | ] 1200 | }, 1201 | "execution_count": 24, 1202 | "metadata": {}, 1203 | "output_type": "execute_result" 1204 | } 1205 | ], 1206 | "source": [ 1207 | "df_filtered_yr_na = df_filtered[(df_filtered.issueddate_yr.isna() == True)]\n", 1208 | "df_filtered_yr_na.shape" 1209 | ] 1210 | }, 1211 | { 1212 | "cell_type": "code", 1213 | "execution_count": 25, 1214 | "metadata": {}, 1215 | "outputs": [ 1216 | { 1217 | "data": { 1218 | "text/plain": [ 1219 | "(0, 87)" 1220 | ] 1221 | }, 1222 | "execution_count": 25, 1223 | "metadata": {}, 1224 | "output_type": "execute_result" 1225 | } 1226 | ], 1227 | "source": [ 1228 | "df_filtered_cost_na = df_filtered[(df_filtered.estprojectcost.isna() == True)]\n", 1229 | "df_filtered_cost_na.shape" 1230 | ] 1231 | }, 1232 | { 1233 | "cell_type": "markdown", 1234 | "metadata": {}, 1235 | "source": [ 1236 | "#### We can see that while the issued date year column has 193 null values, the estimated project cost column has no null values. However, lets circle back to the histograms be created for the estimated project cost feature. There were many 0 values for the estimated project cost column in the original dataset. Lets look at the 0 values in the filtered dataset." 1237 | ] 1238 | }, 1239 | { 1240 | "cell_type": "code", 1241 | "execution_count": 26, 1242 | "metadata": {}, 1243 | "outputs": [ 1244 | { 1245 | "data": { 1246 | "text/plain": [ 1247 | "(18, 87)" 1248 | ] 1249 | }, 1250 | "execution_count": 26, 1251 | "metadata": {}, 1252 | "output_type": "execute_result" 1253 | } 1254 | ], 1255 | "source": [ 1256 | "df_filtered_cost_0 = df_filtered[(df_filtered.estprojectcost == 0)]\n", 1257 | "df_filtered_cost_0.shape" 1258 | ] 1259 | }, 1260 | { 1261 | "cell_type": "markdown", 1262 | "metadata": {}, 1263 | "source": [ 1264 | "#### In our filtered dataset the estimated project cost column has 18 rows with a value of 0. Let's check the issued date year column for these rows." 1265 | ] 1266 | }, 1267 | { 1268 | "cell_type": "code", 1269 | "execution_count": 27, 1270 | "metadata": {}, 1271 | "outputs": [ 1272 | { 1273 | "data": { 1274 | "text/plain": [ 1275 | "1223 NaN\n", 1276 | "2800 NaN\n", 1277 | "6810 NaN\n", 1278 | "132323 NaN\n", 1279 | "133517 NaN\n", 1280 | "133731 NaN\n", 1281 | "136022 NaN\n", 1282 | "138124 NaN\n", 1283 | "138185 NaN\n", 1284 | "138222 NaN\n", 1285 | "138272 NaN\n", 1286 | "138383 NaN\n", 1287 | "138498 NaN\n", 1288 | "138690 NaN\n", 1289 | "138694 NaN\n", 1290 | "141292 NaN\n", 1291 | "141522 2018.0\n", 1292 | "141922 NaN\n", 1293 | "Name: issueddate_yr, dtype: float64" 1294 | ] 1295 | }, 1296 | "execution_count": 27, 1297 | "metadata": {}, 1298 | "output_type": "execute_result" 1299 | } 1300 | ], 1301 | "source": [ 1302 | "df_filtered_cost_0.issueddate_yr" 1303 | ] 1304 | }, 1305 | { 1306 | "cell_type": "markdown", 1307 | "metadata": {}, 1308 | "source": [ 1309 | "#### Only 1/18 rows have a not null value for the issued date year column. Since we want to understand the realtionship between the Issued Date Year and Estimated Project Cost features and the above rows are null or 0 for both these features, I am going to go ahead and drop these rows since they will not be useful in establishing a relationship." 1310 | ] 1311 | }, 1312 | { 1313 | "cell_type": "code", 1314 | "execution_count": 28, 1315 | "metadata": {}, 1316 | "outputs": [ 1317 | { 1318 | "data": { 1319 | "text/plain": [ 1320 | "(31026, 87)" 1321 | ] 1322 | }, 1323 | "execution_count": 28, 1324 | "metadata": {}, 1325 | "output_type": "execute_result" 1326 | } 1327 | ], 1328 | "source": [ 1329 | "df_filtered = df_filtered[(df_filtered.estprojectcost != 0)]\n", 1330 | "df_filtered.shape" 1331 | ] 1332 | }, 1333 | { 1334 | "cell_type": "markdown", 1335 | "metadata": {}, 1336 | "source": [ 1337 | "#### Next we decide how to impute missing value for the Issued date year column. As observed earlier, 193 rows had null values for this column. Out of those, we have already dropped 17 leaving us with 176 rows with null values for this column. Generally, missing values are imputed using either the mean or median values for a column. Considering that years is a categorical value, I did not think I would be a good idea to do so. My next thought was to check the issued date column to do some feature engineering and impute the null values for the issued year from there." 1338 | ] 1339 | }, 1340 | { 1341 | "cell_type": "code", 1342 | "execution_count": 29, 1343 | "metadata": {}, 1344 | "outputs": [ 1345 | { 1346 | "data": { 1347 | "text/plain": [ 1348 | "array([nan], dtype=object)" 1349 | ] 1350 | }, 1351 | "execution_count": 29, 1352 | "metadata": {}, 1353 | "output_type": "execute_result" 1354 | } 1355 | ], 1356 | "source": [ 1357 | "df_filtered_yr_na.issueddate.unique()" 1358 | ] 1359 | }, 1360 | { 1361 | "cell_type": "markdown", 1362 | "metadata": {}, 1363 | "source": [ 1364 | "#### Unfortunately, the issueddate column values are also null for those rows. So I decided to go ahead and drop these rows." 1365 | ] 1366 | }, 1367 | { 1368 | "cell_type": "code", 1369 | "execution_count": 30, 1370 | "metadata": {}, 1371 | "outputs": [ 1372 | { 1373 | "data": { 1374 | "text/plain": [ 1375 | "(30850, 87)" 1376 | ] 1377 | }, 1378 | "execution_count": 30, 1379 | "metadata": {}, 1380 | "output_type": "execute_result" 1381 | } 1382 | ], 1383 | "source": [ 1384 | "df_no_null = df_filtered[(df_filtered.issueddate_yr.isna()==False)]\n", 1385 | "df_no_null.shape" 1386 | ] 1387 | }, 1388 | { 1389 | "cell_type": "markdown", 1390 | "metadata": {}, 1391 | "source": [ 1392 | "#### I begin analysing the relationship between the two variables with a paired regression plot" 1393 | ] 1394 | }, 1395 | { 1396 | "cell_type": "code", 1397 | "execution_count": 31, 1398 | "metadata": {}, 1399 | "outputs": [ 1400 | { 1401 | "data": { 1402 | "text/plain": [ 1403 | "" 1404 | ] 1405 | }, 1406 | "execution_count": 31, 1407 | "metadata": {}, 1408 | "output_type": "execute_result" 1409 | }, 1410 | { 1411 | "data": { 1412 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAVUAAAH3CAYAAAAL2MzLAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3XucXWV97/HvdyYJEEK4JSgmQRIbRLQIOFXwgohRAm2hVapwSr3Roq3IaRE8oFY58aWCHLGi2IJUbq1QQK0pIiiFiIJohoZEQgKEcEkMkARC7sncfuePtSbZmew9s4Z51t6zJ5/36zWvmf3stdbz27fvrPWsy3ZECACQRkujCwCAkYRQBYCECFUASIhQBYCECFUASIhQBYCEmjJUbX/P9krbDxeY9iDb99ieZ3uB7ZPqUSOAXVNThqqkayXNLDjt5yXdHBFHSjpN0nfKKgoAmjJUI+JeSS9Wttl+je07bD9o+5e2D+2dXNL4/O+9Ja2oY6kAdjGjGl1AQldJ+kREPG77LcrWSI+XdJGkn9n+lKQ9Jc1oXIkARroREaq2x0l6q6RbbPc275b/Pl3StRHxddvHSLrB9hsioqcBpQIY4UZEqCobxngpIo6oct+ZysdfI+LXtneXNEHSyjrWB2AX0ZRjqn1FxDpJT9r+C0ly5o353c9Ienfe/jpJu0ta1ZBCAYx4bsarVNm+UdJxytY4n5f0RUl3S/pnSQdKGi3ppoiYZfswSd+VNE7ZTqvPRMTPGlE3gJGvKUMVAIarEbH5DwDDRdPtqJo5c2bccccdjS4DwK7FA0+Sabo11dWrVze6BACoqelCFQCGM0IVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgodJC1fb3bK+0/XCN+237cttLbC+wfVRZtQBAvZR55f9rJX1b0vU17j9R0vT85y3KvrTvLSXWgwa5/K7HdPWvntTGjm7tOaZVf/32qTpnxiGNLgsoRWlrqhFxr6QX+5nkFEnXR+YBSfvYPrCsetAYl9/1mL559xJt7uzWqBZpc2e3vnn3El1+12ONLg0oRSPHVCdJWlZxe3nehhHk6l89qRZLo1pa1OKW/HfWDoxEjQzVal+kVfX7sm2fZbvddvuqVatKLgspbezoVkufV7rFWTswEjUyVJdLmlJxe7KkFdUmjIirIqItItomTpxYl+KQxp5jWtXT519lT2TtwEjUyFCdLelD+VEAR0taGxHPNrAelOCv3z5VPSF19fSoJ3ry31k7MBKVtvff9o2SjpM0wfZySV+UNFqSIuJfJN0u6SRJSyRtkvTRsmpB4/Tu5WfvP3YVjqg6jDlstbW1RXt7e6PLALBrqbYPqCrOqAKAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhAhVAEiIUAWAhEoNVdszbT9qe4ntC6rcf5Dte2zPs73A9kll1gMAZSstVG23SrpC0omSDpN0uu3D+kz2eUk3R8SRkk6T9J2y6gGAeihzTfXNkpZExNKI6JB0k6RT+kwTksbnf+8taUWJ9QBA6coM1UmSllXcXp63VbpI0hm2l0u6XdKnqi3I9lm22223r1q1qoxaASCJMkPVVdqiz+3TJV0bEZMlnSTpBts71RQRV0VEW0S0TZw4sYRSASCNMkN1uaQpFbcna+fN+zMl3SxJEfFrSbtLmlBiTQBQqjJDda6k6ban2h6jbEfU7D7TPCPp3ZJk+3XKQpXtewBNq7RQjYguSWdLulPSImV7+RfanmX75HyyT0v6G9vzJd0o6SMR0XeIAACahpstw9ra2qK9vb3RZQDYtVTbR1QVZ1QBQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkRKgCQEKEKgAkVGqo2p5p+1HbS2xfUGOaD9h+xPZC298vsx4AKNuoshZsu1XSFZLeI2m5pLm2Z0fEIxXTTJd0oaS3RcQa2weUVQ8A1EOZa6pvlrQkIpZGRIekmySd0meav5F0RUSskaSIWFliPQBQujJDdZKkZRW3l+dtlQ6RdIjt+2w/YHtmtQXZPst2u+32VatWlVQuAAxdmaHqKm3R5/YoSdMlHSfpdElX295np5kiroqItohomzhxYvJCASCVMkN1uaQpFbcnS1pRZZofR0RnRDwp6VFlIQsATanMUJ0rabrtqbbHSDpN0uw+0/ynpHdJku0JyoYDlpZYEwCUqrRQjYguSWdLulPSIkk3R8RC27Nsn5xPdqekF2w/IukeSedHxAtl1QQAZXNE32HO4a2trS3a29sbXQaAXUu1fURVcUYVACREqAJAQoVC1fZfFGkDgF1d0TXVCwu2AcAurd9z/22fKOkkSZNsX15x13hJXWUWBgDNaKALqqyQ1C7pZEkPVrSvl/QPZRUFAM2q31CNiPmS5tv+fkR0SpLtfSVN6b0ICgBgu6Jjqj+3Pd72fpLmS7rG9mUl1gUATaloqO4dEeskvU/SNRHxJkkzyisLAJpT0VAdZftASR+QdFuJ9QBAUysaqrOUnaf/RETMtT1N0uPllQUAzanQ16lExC2Sbqm4vVTS+8sqCgCaVdEzqibb/pHtlbaft/0D25PLLg4Amk3Rzf9rlF0L9VXKvhLlv/I2AECFoqE6MSKuiYiu/OdaSXyvCQD0UTRUV9s+w3Zr/nOGJC4mDQB9FA3Vjyk7nOo5Sc9KOjVvAwBUKLr3/xll5/8DAPpRdO//dZVfHW17X9vfK68sAGhORTf/D4+Il3pv5BdTObKckgCgeRUN1Zb86lSSpPzCKoWGDgBgV1I0GL8u6X7bt0oKZTutvlJaVQDQpIruqLredruk45V9Vev7IuKRUisDgCZUKFRt3xARfyXpkSptAIBc0THV11fesN0q6U3pywGA5tZvqNq+0PZ6SYfbXpf/rJe0UtKP61IhADSRfkM1Ir4aEXtJujQixuc/e0XE/hHBV1QDQB9FN/9/a3vv3hu297H9ZyXVBABNq2iofjEi1vbeyE8E+GI5JQFA8yp88H+VNg7+B4A+ioZqu+3LbL/G9jTb35D0YJmFAUAzKhqqn5LUIek/JN0sabOkT5ZVFAA0q6JnVG2UdIHtcRGxoeSaAKBpFb3031ttP6L8jCrbb7T9nVIrA4AmVHTz/xuSTlD+FSoRMV/SsWUVBQDNqmioKiKW9WnqTlwLADS9oodFLbP9Vklhe4ykcyQtKq8sAGhORddUP6Fsb/8kScslHSH2/gPAToru/V8t6S9LrgUAml6/oWr7MxHxNdvfUnbF/0oh6UVJ/xYRT5RVIAA0k4HWVHvHTdtr3L+/pB9KemOyigCgifUbqhHxX/nv6yTJ9l7Zze0nANjeWGqFANBEih78/wbb8yQ9LOkR2w/afr0kRcSVZRYIAM2k6N7/qySdGxGvjoiDJH1a0nfLKwsAmlPRUN0zIu7pvRERcyTtWUpFANDEih78v9T2P0q6Ib99hqQnyykJAJpX0TXVj0maqGxP/w8lTZD00bKKAoBmNeCaav511J+NiHPqUA8ANLUB11QjolvSm+pQCwA0vaJjqvNsz5Z0i6Rtx6VGxA9LqQoAmlTRUN1P2bVUj69oC2XjqwCAXNELqrBTCgAKKHpG1TTb/2V7le2Vtn9se2rZxQFAsyl6SNX3lX2L6oGSXqVsbPWmsooCgGZVNFQdETdERFf+82/a+VKAALDLK7qj6h7bFyhbOw1JH5T0E9v7SVJEvFhSfQDQVIqG6gfz3x/v0/4xZSE7LVlFANDEiobq6yJiS2WD7d37tgHArq7omOr9BdsAYJc20HdUvVLZN6juYftISc7vGi9pbMm1AUDTGWjz/wRJH5E0WdLXtT1U10v6bHllAUBzGug7qq6TdJ3t90fED+pUEwA0raJjqpNtj3fmatv/Y/u9pVYGAE2o8EWqI2KdpPdKOkDZBaovLq0qAGhShc+oyn//saRrImJ+RRsAIFc0VB+0faekEyXdaXsvST3llQUAzaloqJ4p6T5Jt0XEJkn7Svr70qoCgCZVNFSvkPQKSTPz2+slXVZKRQDQxIqepvqWiDjK9jxJiog1tseUWBcANKWia6qd+beqhiTZnijGVAFgJ0VD9XJJP5J0gO0vS/qVpK+UVhUANKmi31H177YflPRuZYdS/VlELCq1MgBoQkXXVBURiyPiioj4dtFAtT3T9qO2l+QXua413am2w3Zb0XoAYDgqHKqDlY/BXqHs2NbDJJ1u+7Aq0+0l6RxJvymrFgCol9JCVdKbJS2JiKUR0aHsq1hOqTLdlyR9TRIXvAbQ9MoM1UmSllXcXp63bZNfo3VKRNzW34Jsn2W73Xb7qlWr0lcKAImUGarVrg2w7RtYbbdI+oakTw+0oIi4KiLaIqJt4sSJCUsEgLTKDNXlkqZU3J4saUXF7b0kvUHSHNtPSTpa0mx2VgFoZmWG6lxJ021Pzc++Ok3S7N47I2JtREyIiIMj4mBJD0g6OSLaS6wJAEpVWqhGRJeksyXdKWmRpJsjYqHtWbZPLqtfAGgkR8TAUw0jbW1t0d7OyiyAuip8/egyN/8BYJdDqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQoQqACREqAJAQqWGqu2Zth+1vcT2BVXuP9f2I7YX2P5v268usx4AKFtpoWq7VdIVkk6UdJik020f1meyeZLaIuJwSbdK+lpZ9QBAPZS5pvpmSUsiYmlEdEi6SdIplRNExD0RsSm/+YCkySXWAwClKzNUJ0laVnF7ed5Wy5mSflrtDttn2W633b5q1aqEJQJAWmWGqqu0RdUJ7TMktUm6tNr9EXFVRLRFRNvEiRMTlggAaY0qcdnLJU2puD1Z0oq+E9meIelzkt4ZEVtLrAcASlfmmupcSdNtT7U9RtJpkmZXTmD7SElXSjo5IlaWWAsA1EVpoRoRXZLOlnSnpEWSbo6IhbZn2T45n+xSSeMk3WL7IduzaywOAJqCI6oOcw5bbW1t0d7e3ugyAOxaqu0jqoozqgAgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgoVGNLgAYyJzFK3XlvUu1bM0mTdl3rD5+7DQdd+gBdZsfGAzWVDGszVm8Ul+YvVAr12/RPnuM1sr1W/SF2Qs1Z/HKuswPDBahimHtynuXanSrNXbMKNnZ79Gt1pX3Lq3L/MBgEaoY1pat2aQ9Rrfu0LbH6FYtX7OpLvMDg0WoYlibsu9Ybe7s3qFtc2e3Ju87ti7zA4NFqGJY+/ix09TZHdrU0aWI7Hdnd+jjx06ry/zAYBGqGNaOO/QAzTr59Tpgr921dnOnDthrd806+fWF994PdX5gsBwRja5hUNra2qK9vb3RZTQNDicCknDRCVlTHcE4nAioP0J1BONwIqD+CNURjMOJgPojVEcwDicC6o9QHcE4nAioPy6oMoIdd+gBmqVsbHX5mk2a3KC9/xyBgF0Jh1ShVL1HIIxutfYY3arNnd3q7A6OFUWzKXxIFWuqKFXlEQiSNHbMKG3q6NKV9y4lVOuIrYX6IVRRqmVrNmmfPUbv0MYRCIM3lFCs3FqoPF55lkSwloAdVSgVRyAM3VBP4uB45foiVFEqjkAYuqGGIscr1xeb/wNgLGpohssRCM1sqEMoU/Ydq5Xrt2wb15bYWigTodoPxqLSOO7QA3i+hmCoofjxY6fpC7MXalNH1w5HYLC1UA42//vBWBSGg6EOoXD5w/piTbUf7LlGKkMZRkoxhMLWQv0Qqv0YCWNRI2FMuNkfQ4phpJEQis3+OhbF5n8/mn3Pdarrqc5ZvFKnX/WA3n7J3Tr9qgfqej3WkXBNWIaRRsbrWBSh2o9mH4tK8WFO8WEYSiiPhEAaKYc07eqvY1Fs/g+gmTe7UowJD/U006Fu+o6Ece3hMIw01E1vXsfiWFMdwVKczTTUtawr712qzu5uPbd2ix59fr2eW7tFnd3dhddQUjyGRg5fSGmGkYbyGOYsXqnzb52vec+s0XNrN2veM2t0/q3z67qmuSudWUeolqyRH+gUH+ahfhgeX7lez6/dqk35sZGbOrv1/Nqtenzl+ro8hhSB0rucl/s6DnUYaahDMJfcsVhrNnUqJI1qbVFIWrOpU5fcsbjwY1i2ZpPWbe7QwhVr9bvfr9XCFWu1bnNH4X+uzb5/YjAI1RLNWbxS5906X/OWrdHz67Zo3rI1Ou9lfKBfrhRjwkP9MGzc2qUeSb1XmIyQevL2oo/h1KMmadX6rVr03HqtWr9Vpx41qfBjSBEoKcaVjzv0AN141tH65f85XjeedfSgXoOhriUuXb1RLZZabFlWi60WZ+2F9fRo1YZO9eSvY09IqzZ0Knp6Cs0+1NexmRCqJbr4p4v00qZORY/Uait6pJc2deriny6qWw0Llr+khSvWasXaLVq4Yq0WLH9pUPMP9cOwubP6h65We19zFq/UDQ88rY6uHlmhjq4e3fDA04UDLUWgNHony3DY0fXi5uyfoCXZ2y8u2ts+kDmLV+pffvGE1m3pUndPaN2WLv3LL56o2w7PemJH1QCGMsD/5Aubsg90S/YWtKXoCT35Qn0+DJff9Zguu+vxbbfXbenadvucGYcUWsacxSt1/QNPq6O7Ry2WOrp7dP0DT+vwyfvUZS3jkjsW68WNHeqRpJC6o0edGzt0yR2LC/ffE6Gurm5FZK9Bi7N/ckUtW7NJrZaWrtqgju4ejWlt0YRxYwYVakN5H03Zd6yeXL1B67d0bet/r91HaeqEcYXmn7r/WD32/AZ1dG8fxmmRdMgr9ixc/9auHrVa6g5J+dpqq7P2Is675SFt6vOPdFNnj8675SG1/+N7B5y/dxhn/ZYudfX0aPX6rTr/1vm69NQ3Dru1XdZU+9Hsx9Z96+7HB9VezcU/XaTVGzq0pbNHnd2hLZ09Wr2ho25r20tWblB3ZMMGoex3d2TtRUzcc7S6erLN1VD2u6snay9q3JhWPfPiZm3syMaFN3Z065kXN2vPMa0Dz6yhj+seM20/Pbdu6w79P7duq46Ztl+h+V934F7qG309eXtRu7W2ZIFaoTuy9iJWb+wcVHtfl9yxWC9s6NDW7h519Uhbu3v0woaOQQ3jXH7XYzr8ojv1ms/ersMvulOX3/VY4XkHg1Dtx5X3LlVH1457rju6iu+5njZhT3V2hzZ3dm/76ewOTZtQfA1hKGptYRfc8pYkLX6+enjVak+ts6f61/3Uau/r+fVbB9VezeoNW9W3t8jbi7jkjsXZP6auLBC2dGX/mIoGwnX3PzWo9r5mz392UO3V7DG6elTUak/t8ZXrq47NF93h2bvVVjn8cNldj5cSrCN683+ox+Y99vw6vVDxn7Szu1sbO7rV1V0slQ595Tgtem7HFz3y9qJe+7mfaGvFzvfdWqVHv/zHheff1aX4x/JCjbWpWu19Pfrc+qqh/OhzxQLhhU01+q/R3lffNcyB2suoYahqjTIUHH3od6ut6FBYUSN2TTXFnvdaH5qimyx3LHx+UO199Q1USdranbWjfmplT9FMqvW5H0SuY4hS/HMtasSGaqPHAqWh7/nuG6gDtQNovFJD1fZM24/aXmL7gir372b7P/L7f2P74FR9N3osEMCuqbQxVdutkq6Q9B5JyyXNtT07Ih6pmOxMSWsi4g9snybpEkkf7G+5Iamzu2f7gLVih8HrKLBR9tKmjor5pchv9O5d7l1uf5av2bRDvzvVMkANveNpvfP01h4FtykffHqNtvUa2/vLljXws/DLx1dtO5B7h+krnsOBavnp757tnWWH+rfX0v8Cbmlftm1+9XkOK1+LWq67/6lttVd9PQeY/4p7lmz7O2LHfos+B707m7bNE4Or4Ys/frhqn5WvZ38+ffP8Hfrd9lgKzv+JGx7Mpuvz3u19Pnpv9+ev/vU3O71/q70favnz79y3fbqX8Vl87zd+UdFnxeMvWP/bLr47n3/H563y8fz2czMGWMp2HuiN/3LZPkbSRRFxQn77QkmKiK9WTHNnPs2vbY+S9JykidFPUbsdOD0O/PA/lVIzAFTz1MV/XPjA5jL3/k+StKzi9nJJb6k1TUR02V4raX9Jq0usC5JaWywrO9NIklRxlkx2xkx2q+95/5X23mO0KmaX7W1n3PQurb/Djibts8e2v21VLMvbztp5qp8TJaYfMG5brZXH8lfWsXDFuprzHzFln5r1u+IJ+e2TL9ZcxjumT6jab+/yJOnufnaOnvD6V2x/vBXPe+XrcduC2oc+ve+oSTs8X32fQ0m6ae6ymvN/6JhXV7zu25/EFm9fpiRd/asnay7j7457TT7/jv06b7Skb/537WOjzz/htTXn7/37K7fXPvzsoj89bFvt22re4b0ofe5HD9ec/9JTD98+f8U81d5bRZQZqtVK6bsGWmQa2T5L0lmS9MopB+uWTxyzw4PXDm+q7Mk85Yr7ahZ2+znvqPoh7i2o9+8Zl91bcxn3nv+uHZ/8Ki/KMV+9u+b8D35+xk7z9H6Qex/LH170s5rzL/7SzKohlNWTPQfTPnt7zfmf+MpJNe+rdPAFtY80mP/Fgc+E6W/++y44fkjz//zcdw5p/v/85NsGnH+gZdxwZt/1hMHNf+VftQ04/20Las9/2QeOGHD+/kJ11ilvGHB+qf9Q/czMQwecv79Q/eS7/mDA+fsL1Y+8beqA8/cXqn/RNmXA+QejzFBdLqmy2smSVtSYZnm++b+3pJ1WCyLiKklXSVJbW1v80cHFziSp5bBXjR/S/JJ00P5Du2TZ/uN2G9L8u48udjYPgPoqc+//XEnTbU+1PUbSaZJm95lmtqQP53+fKunu/sZTB+PcGdMH1Q4MV/uPrb7uU6t9JBo3pnpU1WpvpNIqioguSWdLulPSIkk3R8RC27Nsn5xP9q+S9re9RNK5knY67OrlOmfGITp3xnSN332UWlus8buP0rkzpg/q7ImnLq5+5lKt9pE2/3CoodHzD4caHvzCCTsF6P5jR+nBL5xQl/5TLGOo8z8868SdAnTcmBY9POvEuvQ/GKXt/S9LW1tbtLe3N7oMALuWwrurht+6MwA0MUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgIUIVABIiVAEgoaa7nqrtVZKeHuRsE9T4LxNsdA2N7n841NDo/odDDY3ufzjU8HL6Xx0RM4tM2HSh+nLYbo+Igb9hbQTX0Oj+h0MNje5/ONTQ6P6HQw1l98/mPwAkRKgCQEK7Sqhe1egC1PgaGt2/1PgaGt2/1PgaGt2/1PgaSu1/lxhTBYB62VXWVAGgLghVAEioKUPV9hTb99heZHuh7f+dt+9n++e2H89/75u32/bltpfYXmD7qLz9CNu/zpexwPYH611DxfLG2/697W/Xu3/bB9n+Wb6sR2wf3IAavpYvY1E+zYDfs/4y+j80f7232j6vz7Jm2n40r+2CIo8/ZQ21llPP5yC/v9X2PNu31fs5yO/bx/atthfnyzumzv3/Q76Mh23faHv3os/DNhHRdD+SDpR0VP73XpIek3SYpK9JuiBvv0DSJfnfJ0n6qSRLOlrSb/L2QyRNz/9+laRnJe1TzxoqlvdNSd+X9O169y9pjqT35H+PkzS2zq/DWyXdJ6k1//m1pONK6P8ASX8k6cuSzqtYTqukJyRNkzRG0nxJh5X0HNSqoepy6tV/xfLOzd+Ht5X4eaxZg6TrJP11/vcYFfg8JnwNJkl6UtIe+e2bJX2k6POwbTmDnWE4/kj6saT3SHpU0oEVT/Sj+d9XSjrlIgc2AAAG4UlEQVS9Yvpt0/VZznzlIVvPGiS9SdJNkj6igqGaqv/8zferRr4Oko6R9KCkPSSNldQu6XWp+6+Y7qI+H6ZjJN1ZcftCSReW8RzUqqHWcurZv6TJkv5b0vEaRKgmfB3GKws1l/k+7Kf/SZKWSdpP0ihJt0l672D7b8rN/0r5puqRkn4j6RUR8awk5b8PyCfrfbJ6Lc/bKpfzZmX/GZ+oZw22WyR9XdL5g+03Rf/K1tZfsv3DfLPvUtut9awhIn4t6R5lWwrPKgu4RSX0X8uA74861FBrOfXs/58kfUZSz2D6TVjDNEmrJF2Tvxevtr1nvfqPiN9L+n+SnlH2PlwbET8bTP9Sk46p9rI9TtIPJP19RKzrb9IqbduOJbN9oKQbJH00Igb1hkpQw99Juj0illW5vx79j5L0DknnKdskmqZsjbluNdj+A0mvU7amNEnS8baPLaH/QdU1qAUMvYYhLWeo/dv+E0krI+LBwc6bqgZl78WjJP1zRBwpaaOyzfa69J+PuZ4iaaqy4cA9bZ8x2OU0bajaHq3sCfz3iPhh3vx8HpC9Qbkyb18uaUrF7JMlrcinGy/pJ5I+HxEPNKCGYySdbfspZf8lP2T74jr2v1zSvIhYGhFdkv5T2Ru7kEQ1/LmkByJiQ0RsUDbuenQJ/ddS8/1RxxpqLade/b9N0sn5+/AmZf/Y/q3ONSyXtDwietfQb1XB92Ki/mdIejIiVkVEp6QfKhvvH5SmDFXblvSvkhZFxGUVd82W9OH87w8rG1vpbf+QM0crW61/1vYYST+SdH1E3NKIGiLiLyPioIg4WNna4vURMeB/51T9S5oraV/bE/Ppjpf0SD2fA2WbW++0PSr/cLxT0oCb/y+j/1rmSppue2r+njgtX8aAUtXQz3Lq0n9EXBgRk/P34WmS7o6IQmtpCWt4TtIy26/Nm96tAu/FhO+DZyQdbXtsvsx3q8D7cCdDGRBu1I+ktyvbPFsg6aH85yRJ+ysbaH88/71fPr0lXaFsvPR3ktry9jMkdVYs4yFJR9Szhj7L/IiK7/1P1r+yQf0Fefu1ksbU+XVoVbYTa5GyD9FlJfX/SmVrQ+skvZT/PT6/7yRle42fkPS5Et+LVWuotZx6PgcVyzxOg9v7n/J1OELZjsoFyraa9q1z//9X0mJJDysbEtxtsPnEaaoAkFBTbv4DwHBFqAJAQoQqACREqAJAQoQqACREqAJAQoQqSmf7/gb0ea3tU6u0H+cBLmvn7JKQJ5VXHUYyQhWli4hBn+rXYEcoO3i8FLZHlbVsNB6hitLZ3pD/PtD2vbYfcnYR4Hc4uyjytfnt39n+h3zaObbb8r8n5Oek915E+VLbc51d6Prjebttf9vZRbZ/ooorEjm7APVi27+S9L6K9jfbvt/ZFZHut/3a/DTVWZI+mNf5Qdt72v5e3uc826f081h/afuIitv32T7c9kW2r7L9M0nXp3t2MdzwHxP19L+UXdbvy84uLzhW2VrhpIh4g5Rd+X2AZZyp7JoBf2R7N0n35UF1pKTXSvpDSa9Qdrrr95xduf27yq5psETSf1Qsa7GkYyOiy/YMSV+JiPfb/oKyU2jPzmv6irJz4T+W1/db23dFxMYq9V2t7HTjv7d9iLLTHBfYfp+y6+a+PSI2D+ZJQ3NhTRX1NFfSR21fJOkPI2K9pKWSptn+lu2Zys7H7s97lV2U5SFl18zcX9J0ScdKujEiuiNihaS78+kPVXbloccjOye78spLe0u6xfbDkr4h6fX99HlB3uccSbtLOqjGtLdI+pP8wjAfU3YthV6zCdSRj1BF3UTEvcrC7/eSbrD9oYhYI+mNysLqk8rW9CSpS9vfn5XfE2RJn4qII/KfqbH9QsK1LmRRq/1Lku7J15L/tE8/lSzp/RV9HhQ1LqIdEZsk/VzZdTk/oOyrSXpVW7PFCEOoom5sv1rZhZC/q+xSbUfZniCpJSJ+IOkftf36mU8p21yWpMq9+HdK+tt8TVC2D3F2dfh7JZ2Wj7keKOld+fSLJU21/Zr89ukVy9pbWcBLO16Ye72y7zqq7PNT+eXgZPvIAR7q1ZIulzQ3Il4cYFqMMIQq6uk4SQ/Znifp/cq+7HCSpDn5pvW1yr4fSsou2P23+eFYEyqWcbWy8dL/yTfbr1S2b+BHyi7x9jtJ/yzpF5IUEVsknSXpJ/mOqqcrlvU1SV+13fulg73ukXRY744qZWu0oyUtyPv8Un8PMrKr56+TdE2xpwUjCZf+AxKz/SplwxmHxiC/ngfNjzVVICHbH1K2A+1zBOquiTVV4GWwfYKkS/o0PxkRf96IejB8EKoAkBCb/wCQEKEKAAkRqgCQEKEKAAn9f6KyRo6FVCk8AAAAAElFTkSuQmCC\n", 1413 | "text/plain": [ 1414 | "
" 1415 | ] 1416 | }, 1417 | "metadata": { 1418 | "needs_background": "light" 1419 | }, 1420 | "output_type": "display_data" 1421 | } 1422 | ], 1423 | "source": [ 1424 | "sns.pairplot(df_no_null, x_vars=['issueddate_yr'], y_vars='estprojectcost', size=7, aspect=0.7, kind = 'reg')" 1425 | ] 1426 | }, 1427 | { 1428 | "cell_type": "markdown", 1429 | "metadata": {}, 1430 | "source": [ 1431 | "#### Out of curiosity, I googled the construction type 'V B' and learned that it was for single family homes with wooden frames. Finding this piece of information to interesting, I then proceeded to limit the Estimated Project Cost Variable to less than 4 Million to get a detailed idea of its relationship with the Issued Date Year variable" 1432 | ] 1433 | }, 1434 | { 1435 | "cell_type": "code", 1436 | "execution_count": 32, 1437 | "metadata": {}, 1438 | "outputs": [], 1439 | "source": [ 1440 | "df_limited = df_filtered[df_filtered.estprojectcost < 4000000]" 1441 | ] 1442 | }, 1443 | { 1444 | "cell_type": "code", 1445 | "execution_count": 33, 1446 | "metadata": {}, 1447 | "outputs": [ 1448 | { 1449 | "data": { 1450 | "text/plain": [ 1451 | "" 1452 | ] 1453 | }, 1454 | "execution_count": 33, 1455 | "metadata": {}, 1456 | "output_type": "execute_result" 1457 | }, 1458 | { 1459 | "data": { 1460 | "image/png": "\n", 1461 | "text/plain": [ 1462 | "
" 1463 | ] 1464 | }, 1465 | "metadata": { 1466 | "needs_background": "light" 1467 | }, 1468 | "output_type": "display_data" 1469 | } 1470 | ], 1471 | "source": [ 1472 | "sns.pairplot(df_limited, x_vars=['issueddate_yr'], y_vars='estprojectcost', size=7, aspect=0.7, kind = 'reg')" 1473 | ] 1474 | }, 1475 | { 1476 | "cell_type": "markdown", 1477 | "metadata": {}, 1478 | "source": [ 1479 | "#### From the above plot, I came to the following conclusions:\n", 1480 | "#### 1. Most of the single family homes have an estimated project cost of upto 500K and there are fewere estimated costs above 1M.\n", 1481 | "#### 2. Permits for houses with an estimated cost of above 1M were issued more regularly 2006 onwards\n", 1482 | "#### 3. Maximum permits were granted for projects estimated above 1M in 2008. I found this interesting because of the resccesion of 2008. Did people have that kind of money in 2008? I decided to look at the esitmated project costs for the year 2008 in more detail" 1483 | ] 1484 | }, 1485 | { 1486 | "cell_type": "code", 1487 | "execution_count": 34, 1488 | "metadata": {}, 1489 | "outputs": [], 1490 | "source": [ 1491 | "df_2008 = df_filtered[df_filtered.issueddate_yr == 2008]" 1492 | ] 1493 | }, 1494 | { 1495 | "cell_type": "code", 1496 | "execution_count": 35, 1497 | "metadata": {}, 1498 | "outputs": [ 1499 | { 1500 | "data": { 1501 | "text/plain": [ 1502 | "" 1503 | ] 1504 | }, 1505 | "execution_count": 35, 1506 | "metadata": {}, 1507 | "output_type": "execute_result" 1508 | }, 1509 | { 1510 | "data": { 1511 | "image/png": "\n", 1512 | "text/plain": [ 1513 | "
" 1514 | ] 1515 | }, 1516 | "metadata": { 1517 | "needs_background": "light" 1518 | }, 1519 | "output_type": "display_data" 1520 | } 1521 | ], 1522 | "source": [ 1523 | "# I will run a paired regression plot between estimated project cost and issueddate mth since the year is going to be 2008\n", 1524 | "sns.pairplot(df_2008, x_vars=['issueddate_mth'], y_vars='estprojectcost', size=7, aspect=0.7, kind = 'reg')" 1525 | ] 1526 | }, 1527 | { 1528 | "cell_type": "markdown", 1529 | "metadata": {}, 1530 | "source": [ 1531 | "#### Again we can see some outliers, so lets limit the dataset as we did earlier and the create a plot" 1532 | ] 1533 | }, 1534 | { 1535 | "cell_type": "code", 1536 | "execution_count": 36, 1537 | "metadata": {}, 1538 | "outputs": [ 1539 | { 1540 | "data": { 1541 | "text/plain": [ 1542 | "" 1543 | ] 1544 | }, 1545 | "execution_count": 36, 1546 | "metadata": {}, 1547 | "output_type": "execute_result" 1548 | }, 1549 | { 1550 | "data": { 1551 | "image/png": "\n", 1552 | "text/plain": [ 1553 | "
" 1554 | ] 1555 | }, 1556 | "metadata": { 1557 | "needs_background": "light" 1558 | }, 1559 | "output_type": "display_data" 1560 | } 1561 | ], 1562 | "source": [ 1563 | "df_limited_2008 = df_2008[df_2008.estprojectcost < 4000000]\n", 1564 | "sns.pairplot(df_limited_2008, x_vars=['issueddate_mth'], y_vars='estprojectcost', size=7, aspect=0.7, kind = 'reg')" 1565 | ] 1566 | }, 1567 | { 1568 | "cell_type": "markdown", 1569 | "metadata": {}, 1570 | "source": [ 1571 | "#### From the above plot, it seems like people kept building houses all through 2008 - right through the great recession" 1572 | ] 1573 | }, 1574 | { 1575 | "cell_type": "markdown", 1576 | "metadata": {}, 1577 | "source": [ 1578 | "### Linear Regression" 1579 | ] 1580 | }, 1581 | { 1582 | "cell_type": "markdown", 1583 | "metadata": {}, 1584 | "source": [ 1585 | "#### While the above plots visually explain the realtionship between the two variables, for success metrics, I will begin by running a linear regression. " 1586 | ] 1587 | }, 1588 | { 1589 | "cell_type": "code", 1590 | "execution_count": 37, 1591 | "metadata": {}, 1592 | "outputs": [ 1593 | { 1594 | "name": "stdout", 1595 | "output_type": "stream", 1596 | "text": [ 1597 | "The regression intercept is: -24785847.7335\n", 1598 | "The regression coefficient is: [ 12445.196861]\n" 1599 | ] 1600 | } 1601 | ], 1602 | "source": [ 1603 | "### SCIKIT-LEARN ###\n", 1604 | "# create X and y\n", 1605 | "feature_cols = ['issueddate_yr']\n", 1606 | "X = df_no_null[feature_cols]\n", 1607 | "y = df_no_null.estprojectcost\n", 1608 | "\n", 1609 | "# instantiate and fit\n", 1610 | "lm = LinearRegression()\n", 1611 | "lm.fit(X,y)\n", 1612 | "\n", 1613 | "# print the coefficients\n", 1614 | "print(\"The regression intercept is: \"+ str(lm.intercept_))\n", 1615 | "print(\"The regression coefficient is: \"+ str(lm.coef_))" 1616 | ] 1617 | }, 1618 | { 1619 | "cell_type": "code", 1620 | "execution_count": 38, 1621 | "metadata": {}, 1622 | "outputs": [ 1623 | { 1624 | "name": "stdout", 1625 | "output_type": "stream", 1626 | "text": [ 1627 | "The mean absolute error for the linear regression model is: 105795.616638\n" 1628 | ] 1629 | } 1630 | ], 1631 | "source": [ 1632 | "y_pred = lm.predict(X)\n", 1633 | "y_true = df_no_null.estprojectcost\n", 1634 | "mae = mean_absolute_error(y_true, y_pred)\n", 1635 | "print(\"The mean absolute error for the linear regression model is: \" + str(mae))" 1636 | ] 1637 | }, 1638 | { 1639 | "cell_type": "code", 1640 | "execution_count": 39, 1641 | "metadata": {}, 1642 | "outputs": [ 1643 | { 1644 | "name": "stdout", 1645 | "output_type": "stream", 1646 | "text": [ 1647 | "The mean squared error for the linear regression model is: 682166512211.0\n" 1648 | ] 1649 | } 1650 | ], 1651 | "source": [ 1652 | "mse = mean_squared_error(y_true, y_pred)\n", 1653 | "print(\"The mean squared error for the linear regression model is: \" + str(mse))" 1654 | ] 1655 | }, 1656 | { 1657 | "cell_type": "code", 1658 | "execution_count": 40, 1659 | "metadata": {}, 1660 | "outputs": [ 1661 | { 1662 | "name": "stdout", 1663 | "output_type": "stream", 1664 | "text": [ 1665 | "The r-squared error for the linear regression model is: 0.00479073989308\n" 1666 | ] 1667 | } 1668 | ], 1669 | "source": [ 1670 | "r2 = r2_score(y_true, y_pred)\n", 1671 | "print(\"The r-squared error for the linear regression model is: \" + str(r2))" 1672 | ] 1673 | }, 1674 | { 1675 | "cell_type": "markdown", 1676 | "metadata": {}, 1677 | "source": [ 1678 | "#### From the above success metrics, we can see:\n", 1679 | "#### 1. The mean squared error is pretty big. This is due to the variance in the estimated cost of single family homes. \n", 1680 | "#### 2. The r-squared error shows that the Issued Date Year Variable doesnot influence the Esitmated Project Cost variable significantly, even though the estinmated project cost has increased as the years go by.\n", 1681 | "#### 3. The metrics tell us that this is not the best model and can certainly be improved. We can do this by creating regression models for each quartile of the esitmated project cost variable against the issued date year variable. This might give us better success metrics for each individual model." 1682 | ] 1683 | } 1684 | ], 1685 | "metadata": { 1686 | "kernelspec": { 1687 | "display_name": "Python 3 [ analysis-preview-py3 ]", 1688 | "language": "python", 1689 | "name": "analysis-preview-py3-latest" 1690 | }, 1691 | "language_info": { 1692 | "codemirror_mode": { 1693 | "name": "ipython", 1694 | "version": 3 1695 | }, 1696 | "file_extension": ".py", 1697 | "mimetype": "text/x-python", 1698 | "name": "python", 1699 | "nbconvert_exporter": "python", 1700 | "pygments_lexer": "ipython3", 1701 | "version": "3.6.6" 1702 | } 1703 | }, 1704 | "nbformat": 4, 1705 | "nbformat_minor": 2 1706 | } 1707 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Analyst Interview Project 2 | Provided directions to the data analysis project as part of the data analyst interview process 3 | 4 | ## Introduction 5 | The following is an outline for a simple data analysis project. It is being assigned as part of the interview for the Data Analyst position at ConnectWise, Inc. 6 | 7 | The aim of this test is to assess your overall preparedness for typical tasks that will be necessary to perform as part of a data science team. The project was designed to have an anticipated completion time of 60-90 minutes for an entry-level data analyst. 8 | 9 | ## Rules 10 | 11 | * Unless explicitly stated, please provide your analysis through code or comments. 12 | * You are free to use whatever language and environment desired to complete this task. However, we expect to be able to recreate your work if necessary. Please provide any enviornment files (requirements.txt, package.json, etc.) in your completed analysis. Interactive environments such as Jupyter Notebooks are preferred. 13 | * Submissions for this project will only be accepted and considered by applicants that have been specifically requested to do so. 14 | 15 | ## Submission 16 | Once complete, please send a link to your repository to [sresar@connectwise.com](mailto:sresar@connectwise.com). 17 | 18 | ## Directions 19 | 20 | 1. Fork this repository to create a new working copy for your work. 21 | 1. To conduct your analysis, we have provided a dataset for download [here](https://s3.amazonaws.com/cc-analytics-datasets/Building_Permits.csv). The provided dataset comes from the City of Raleigh Open Data website and is based upon pending/granted building permits. Documentation on the dataset can be found [here](http://data-ral.opendata.arcgis.com/datasets/building-permits). 22 | 1. Load the data from the provided source via web request rather than downloading a local copy and loading from disk. 23 | 1. Review the summary statistics for the included features. Please be sure to include the following in your exploratory data analysis: 24 | - Number of rows and columns in the dataset 25 | - Total different types of construction 26 | - Mean and median number of stories 27 | - Standard deviation for the X and Y coordinates of the permits 28 | 1. Plot the distributions for each of the following features: _Estimated Project Cost_ and _Issue Date Month_. Describe the distributions for these fields and explain what insights you might be able to gather. 29 | 1. The executive team is interested is the behavior between _Permit Issue Year_ and _Estimated Project Cost_, but only for "New" construction of type "V B" with less than 3 stories. Perform a simple regression analysis of this relationship and describe what insights we can gleam from this using success metrics. _(Hint: Implement handling for missing values and explain your reasoning.)_ 30 | 1. Commit all changes and analysis, then email your completed submission. 31 | --------------------------------------------------------------------------------