├── AB_Testing.ipynb ├── Conversion Rate.ipynb ├── Employee_Retention_PeopleAnalytics.ipynb ├── Identify_Fraudulent_Activities.ipynb ├── Machine_Learning_Algorithms_Python.ipynb ├── README.md └── raw-data └── readme /AB_Testing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# A/B Test" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "A/B testing is a controlled experiment with two variants - A/B--controll and experiement group. It's a hypothesi tesing to check if there is any statistical/practical difference between the controll and experiment group.
\n", 15 | "A/B tesint plays a vital rol in website optimization." 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "## Goal:\n", 23 | "### 1. Analyze results from an A/B Test\n", 24 | "### 2. Design an algorithm to automate some steps" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "#### Problem description:
\n", 32 | "Company XYZ is a world-wide e-commerce company and its Spain-based users have a much higher conversion rate than any other spanish-speaking countries. All spanish-speaking countries' website was transalated by a Spaniard.
\n", 33 | "They have a hypothesis that website which are translated by local people will have a higher conversion rate. Therefor, they designed the A/B test to test the hypothesis." 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 348, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "import pandas as pd\n", 43 | "import numpy as np\n", 44 | "from scipy import stats\n", 45 | "import matplotlib.pyplot as plt\n", 46 | "%matplotlib inline\n", 47 | "import seaborn as sns" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 288, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "# load two tables into pandas data frame\n", 57 | "test = pd.read_csv(r'C:\\Users\\lshen\\Downloads\\Translation_Test\\test_table.csv')\n", 58 | "user = pd.read_csv(r'C:\\Users\\lshen\\Downloads\\Translation_Test\\user_table.csv')" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "### Step 1: Data Exploration" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 289, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "data": { 75 | "text/html": [ 76 | "
\n", 77 | "\n", 90 | "\n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | "
user_iddatesourcedevicebrowser_languageads_channelbrowserconversiontest
03152812015-12-03DirectWebESNaNIE10
14978512015-12-04AdsWebESGoogleIE01
28484022015-12-04AdsWebESFacebookChrome00
32900512015-12-03AdsMobileOtherFacebookAndroid_App01
45484352015-11-30AdsWebESGoogleFireFox01
\n", 168 | "
" 169 | ], 170 | "text/plain": [ 171 | " user_id date source device browser_language ads_channel \\\n", 172 | "0 315281 2015-12-03 Direct Web ES NaN \n", 173 | "1 497851 2015-12-04 Ads Web ES Google \n", 174 | "2 848402 2015-12-04 Ads Web ES Facebook \n", 175 | "3 290051 2015-12-03 Ads Mobile Other Facebook \n", 176 | "4 548435 2015-11-30 Ads Web ES Google \n", 177 | "\n", 178 | " browser conversion test \n", 179 | "0 IE 1 0 \n", 180 | "1 IE 0 1 \n", 181 | "2 Chrome 0 0 \n", 182 | "3 Android_App 0 1 \n", 183 | "4 FireFox 0 1 " 184 | ] 185 | }, 186 | "execution_count": 289, 187 | "metadata": {}, 188 | "output_type": "execute_result" 189 | } 190 | ], 191 | "source": [ 192 | "test.head()" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 290, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "data": { 202 | "text/html": [ 203 | "
\n", 204 | "\n", 217 | "\n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | "
user_idsexagecountry
0765821M20Mexico
1343561F27Nicaragua
2118744M23Colombia
3987753F27Venezuela
4554597F20Spain
\n", 265 | "
" 266 | ], 267 | "text/plain": [ 268 | " user_id sex age country\n", 269 | "0 765821 M 20 Mexico\n", 270 | "1 343561 F 27 Nicaragua\n", 271 | "2 118744 M 23 Colombia\n", 272 | "3 987753 F 27 Venezuela\n", 273 | "4 554597 F 20 Spain" 274 | ] 275 | }, 276 | "execution_count": 290, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "user.head()" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 291, 288 | "metadata": {}, 289 | "outputs": [ 290 | { 291 | "name": "stdout", 292 | "output_type": "stream", 293 | "text": [ 294 | "Total number of user_id: 453321\n", 295 | "Total number of user_id: 453321\n" 296 | ] 297 | } 298 | ], 299 | "source": [ 300 | "# check if test table's user_id is unique---Yes, one user_id has only one record\n", 301 | "print ('Total number of user_id: {}'.format(test.user_id.size))\n", 302 | "print ('Total number of user_id: {}'.format(test.user_id.nunique()))" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": 292, 308 | "metadata": {}, 309 | "outputs": [ 310 | { 311 | "name": "stdout", 312 | "output_type": "stream", 313 | "text": [ 314 | "Total records in test table: 453321\n", 315 | "Total records in user table: 452867\n" 316 | ] 317 | } 318 | ], 319 | "source": [ 320 | "print ('Total records in test table: {}'.format(len(test)))\n", 321 | "print ('Total records in user table: {}'.format(len(user)))" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "From above code, we can see that some user_id don't exist in user table. Since the analysis is based on different countries, and it's very import variable, so we will drop the records that don't have demographic information." 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 293, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "# merge two tables based on user_id, which will return the records with demographic info.\n", 338 | "data = test.merge(user,how = 'inner', on='user_id')" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 294, 344 | "metadata": {}, 345 | "outputs": [ 346 | { 347 | "data": { 348 | "text/html": [ 349 | "
\n", 350 | "\n", 363 | "\n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | "
user_iddatesourcedevicebrowser_languageads_channelbrowserconversiontestsexagecountry
03152812015-12-03DirectWebESNaNIE10M32Spain
14978512015-12-04AdsWebESGoogleIE01M21Mexico
28484022015-12-04AdsWebESFacebookChrome00M34Spain
32900512015-12-03AdsMobileOtherFacebookAndroid_App01F22Mexico
45484352015-11-30AdsWebESGoogleFireFox01M19Mexico
\n", 459 | "
" 460 | ], 461 | "text/plain": [ 462 | " user_id date source device browser_language ads_channel \\\n", 463 | "0 315281 2015-12-03 Direct Web ES NaN \n", 464 | "1 497851 2015-12-04 Ads Web ES Google \n", 465 | "2 848402 2015-12-04 Ads Web ES Facebook \n", 466 | "3 290051 2015-12-03 Ads Mobile Other Facebook \n", 467 | "4 548435 2015-11-30 Ads Web ES Google \n", 468 | "\n", 469 | " browser conversion test sex age country \n", 470 | "0 IE 1 0 M 32 Spain \n", 471 | "1 IE 0 1 M 21 Mexico \n", 472 | "2 Chrome 0 0 M 34 Spain \n", 473 | "3 Android_App 0 1 F 22 Mexico \n", 474 | "4 FireFox 0 1 M 19 Mexico " 475 | ] 476 | }, 477 | "execution_count": 294, 478 | "metadata": {}, 479 | "output_type": "execute_result" 480 | } 481 | ], 482 | "source": [ 483 | "data.head()" 484 | ] 485 | }, 486 | { 487 | "cell_type": "code", 488 | "execution_count": 295, 489 | "metadata": {}, 490 | "outputs": [ 491 | { 492 | "data": { 493 | "text/plain": [ 494 | "(452867, 12)" 495 | ] 496 | }, 497 | "execution_count": 295, 498 | "metadata": {}, 499 | "output_type": "execute_result" 500 | } 501 | ], 502 | "source": [ 503 | "data.shape" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": 296, 509 | "metadata": {}, 510 | "outputs": [ 511 | { 512 | "data": { 513 | "text/plain": [ 514 | "user_id int64\n", 515 | "date object\n", 516 | "source object\n", 517 | "device object\n", 518 | "browser_language object\n", 519 | "ads_channel object\n", 520 | "browser object\n", 521 | "conversion int64\n", 522 | "test int64\n", 523 | "sex object\n", 524 | "age int64\n", 525 | "country object\n", 526 | "dtype: object" 527 | ] 528 | }, 529 | "execution_count": 296, 530 | "metadata": {}, 531 | "output_type": "execute_result" 532 | } 533 | ], 534 | "source": [ 535 | "# check columns' data types\n", 536 | "data.dtypes" 537 | ] 538 | }, 539 | { 540 | "cell_type": "code", 541 | "execution_count": 297, 542 | "metadata": {}, 543 | "outputs": [ 544 | { 545 | "data": { 546 | "text/html": [ 547 | "
\n", 548 | "\n", 561 | "\n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | "
user_iddatesourcedevicebrowser_languageads_channelbrowserconversiontestsexagecountry
count452867.000000452867452867452867452867181693452867452867.000000452867.000000452867452867.000000452867
uniqueNaN532357NaNNaN2NaN17
topNaN2015-12-04AdsWebESFacebookAndroid_AppNaNNaNMNaNMexico
freqNaN14102418169325131637716068358154977NaNNaN264485NaN128484
mean499944.805166NaNNaNNaNNaNNaNNaN0.0495600.476462NaN27.130740NaN
std288676.264784NaNNaNNaNNaNNaNNaN0.2170340.499446NaN6.776678NaN
min1.000000NaNNaNNaNNaNNaNNaN0.0000000.000000NaN18.000000NaN
25%249819.000000NaNNaNNaNNaNNaNNaN0.0000000.000000NaN22.000000NaN
50%500019.000000NaNNaNNaNNaNNaNNaN0.0000000.000000NaN26.000000NaN
75%749543.000000NaNNaNNaNNaNNaNNaN0.0000001.000000NaN31.000000NaN
max1000000.000000NaNNaNNaNNaNNaNNaN1.0000001.000000NaN70.000000NaN
\n", 747 | "
" 748 | ], 749 | "text/plain": [ 750 | " user_id date source device browser_language \\\n", 751 | "count 452867.000000 452867 452867 452867 452867 \n", 752 | "unique NaN 5 3 2 3 \n", 753 | "top NaN 2015-12-04 Ads Web ES \n", 754 | "freq NaN 141024 181693 251316 377160 \n", 755 | "mean 499944.805166 NaN NaN NaN NaN \n", 756 | "std 288676.264784 NaN NaN NaN NaN \n", 757 | "min 1.000000 NaN NaN NaN NaN \n", 758 | "25% 249819.000000 NaN NaN NaN NaN \n", 759 | "50% 500019.000000 NaN NaN NaN NaN \n", 760 | "75% 749543.000000 NaN NaN NaN NaN \n", 761 | "max 1000000.000000 NaN NaN NaN NaN \n", 762 | "\n", 763 | " ads_channel browser conversion test sex \\\n", 764 | "count 181693 452867 452867.000000 452867.000000 452867 \n", 765 | "unique 5 7 NaN NaN 2 \n", 766 | "top Facebook Android_App NaN NaN M \n", 767 | "freq 68358 154977 NaN NaN 264485 \n", 768 | "mean NaN NaN 0.049560 0.476462 NaN \n", 769 | "std NaN NaN 0.217034 0.499446 NaN \n", 770 | "min NaN NaN 0.000000 0.000000 NaN \n", 771 | "25% NaN NaN 0.000000 0.000000 NaN \n", 772 | "50% NaN NaN 0.000000 0.000000 NaN \n", 773 | "75% NaN NaN 0.000000 1.000000 NaN \n", 774 | "max NaN NaN 1.000000 1.000000 NaN \n", 775 | "\n", 776 | " age country \n", 777 | "count 452867.000000 452867 \n", 778 | "unique NaN 17 \n", 779 | "top NaN Mexico \n", 780 | "freq NaN 128484 \n", 781 | "mean 27.130740 NaN \n", 782 | "std 6.776678 NaN \n", 783 | "min 18.000000 NaN \n", 784 | "25% 22.000000 NaN \n", 785 | "50% 26.000000 NaN \n", 786 | "75% 31.000000 NaN \n", 787 | "max 70.000000 NaN " 788 | ] 789 | }, 790 | "execution_count": 297, 791 | "metadata": {}, 792 | "output_type": "execute_result" 793 | } 794 | ], 795 | "source": [ 796 | "data.describe(include = 'all')" 797 | ] 798 | }, 799 | { 800 | "cell_type": "code", 801 | "execution_count": 298, 802 | "metadata": {}, 803 | "outputs": [ 804 | { 805 | "data": { 806 | "text/plain": [ 807 | "user_id 0\n", 808 | "date 0\n", 809 | "source 0\n", 810 | "device 0\n", 811 | "browser_language 0\n", 812 | "ads_channel 271174\n", 813 | "browser 0\n", 814 | "conversion 0\n", 815 | "test 0\n", 816 | "sex 0\n", 817 | "age 0\n", 818 | "country 0\n", 819 | "dtype: int64" 820 | ] 821 | }, 822 | "execution_count": 298, 823 | "metadata": {}, 824 | "output_type": "execute_result" 825 | } 826 | ], 827 | "source": [ 828 | "# check if there is any null values\n", 829 | "# about 60% ads_channel values are missing \n", 830 | "data.isnull().sum()" 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": 299, 836 | "metadata": {}, 837 | "outputs": [ 838 | { 839 | "data": { 840 | "text/plain": [ 841 | "2015-12-04 141024\n", 842 | "2015-12-03 99399\n", 843 | "2015-11-30 70948\n", 844 | "2015-12-01 70915\n", 845 | "2015-12-02 70581\n", 846 | "Name: date, dtype: int64" 847 | ] 848 | }, 849 | "execution_count": 299, 850 | "metadata": {}, 851 | "output_type": "execute_result" 852 | } 853 | ], 854 | "source": [ 855 | "data.date.value_counts()" 856 | ] 857 | }, 858 | { 859 | "cell_type": "code", 860 | "execution_count": 300, 861 | "metadata": {}, 862 | "outputs": [ 863 | { 864 | "data": { 865 | "text/plain": [ 866 | "Ads 181693\n", 867 | "SEO 180436\n", 868 | "Direct 90738\n", 869 | "Name: source, dtype: int64" 870 | ] 871 | }, 872 | "execution_count": 300, 873 | "metadata": {}, 874 | "output_type": "execute_result" 875 | } 876 | ], 877 | "source": [ 878 | "data.source.value_counts()" 879 | ] 880 | }, 881 | { 882 | "cell_type": "code", 883 | "execution_count": 301, 884 | "metadata": {}, 885 | "outputs": [ 886 | { 887 | "data": { 888 | "text/plain": [ 889 | "Web 251316\n", 890 | "Mobile 201551\n", 891 | "Name: device, dtype: int64" 892 | ] 893 | }, 894 | "execution_count": 301, 895 | "metadata": {}, 896 | "output_type": "execute_result" 897 | } 898 | ], 899 | "source": [ 900 | "data.device.value_counts()" 901 | ] 902 | }, 903 | { 904 | "cell_type": "code", 905 | "execution_count": 302, 906 | "metadata": {}, 907 | "outputs": [ 908 | { 909 | "data": { 910 | "text/plain": [ 911 | "ES 377160\n", 912 | "EN 63079\n", 913 | "Other 12628\n", 914 | "Name: browser_language, dtype: int64" 915 | ] 916 | }, 917 | "execution_count": 302, 918 | "metadata": {}, 919 | "output_type": "execute_result" 920 | } 921 | ], 922 | "source": [ 923 | "data.browser_language.value_counts()" 924 | ] 925 | }, 926 | { 927 | "cell_type": "code", 928 | "execution_count": 303, 929 | "metadata": {}, 930 | "outputs": [ 931 | { 932 | "data": { 933 | "text/plain": [ 934 | "Facebook 68358\n", 935 | "Google 68113\n", 936 | "Yahoo 27409\n", 937 | "Bing 13670\n", 938 | "Other 4143\n", 939 | "Name: ads_channel, dtype: int64" 940 | ] 941 | }, 942 | "execution_count": 303, 943 | "metadata": {}, 944 | "output_type": "execute_result" 945 | } 946 | ], 947 | "source": [ 948 | "data.ads_channel.value_counts()" 949 | ] 950 | }, 951 | { 952 | "cell_type": "code", 953 | "execution_count": 304, 954 | "metadata": {}, 955 | "outputs": [ 956 | { 957 | "data": { 958 | "text/plain": [ 959 | "Android_App 154977\n", 960 | "Chrome 101822\n", 961 | "IE 61656\n", 962 | "Iphone_App 46574\n", 963 | "Safari 41033\n", 964 | "FireFox 40721\n", 965 | "Opera 6084\n", 966 | "Name: browser, dtype: int64" 967 | ] 968 | }, 969 | "execution_count": 304, 970 | "metadata": {}, 971 | "output_type": "execute_result" 972 | } 973 | ], 974 | "source": [ 975 | "data.browser.value_counts()" 976 | ] 977 | }, 978 | { 979 | "cell_type": "code", 980 | "execution_count": 305, 981 | "metadata": {}, 982 | "outputs": [ 983 | { 984 | "data": { 985 | "text/plain": [ 986 | "0 430423\n", 987 | "1 22444\n", 988 | "Name: conversion, dtype: int64" 989 | ] 990 | }, 991 | "execution_count": 305, 992 | "metadata": {}, 993 | "output_type": "execute_result" 994 | } 995 | ], 996 | "source": [ 997 | "data.conversion.value_counts()" 998 | ] 999 | }, 1000 | { 1001 | "cell_type": "code", 1002 | "execution_count": 306, 1003 | "metadata": {}, 1004 | "outputs": [ 1005 | { 1006 | "data": { 1007 | "text/plain": [ 1008 | "0 237093\n", 1009 | "1 215774\n", 1010 | "Name: test, dtype: int64" 1011 | ] 1012 | }, 1013 | "execution_count": 306, 1014 | "metadata": {}, 1015 | "output_type": "execute_result" 1016 | } 1017 | ], 1018 | "source": [ 1019 | "data.test.value_counts()" 1020 | ] 1021 | }, 1022 | { 1023 | "cell_type": "code", 1024 | "execution_count": 307, 1025 | "metadata": {}, 1026 | "outputs": [ 1027 | { 1028 | "data": { 1029 | "text/plain": [ 1030 | "Mexico 128484\n", 1031 | "Colombia 54060\n", 1032 | "Spain 51782\n", 1033 | "Argentina 46733\n", 1034 | "Peru 33666\n", 1035 | "Venezuela 32054\n", 1036 | "Chile 19737\n", 1037 | "Ecuador 15895\n", 1038 | "Guatemala 15125\n", 1039 | "Bolivia 11124\n", 1040 | "Honduras 8568\n", 1041 | "El Salvador 8175\n", 1042 | "Paraguay 7347\n", 1043 | "Nicaragua 6723\n", 1044 | "Costa Rica 5309\n", 1045 | "Uruguay 4134\n", 1046 | "Panama 3951\n", 1047 | "Name: country, dtype: int64" 1048 | ] 1049 | }, 1050 | "execution_count": 307, 1051 | "metadata": {}, 1052 | "output_type": "execute_result" 1053 | } 1054 | ], 1055 | "source": [ 1056 | "data.country.value_counts()" 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "markdown", 1061 | "metadata": {}, 1062 | "source": [ 1063 | "Let's first check and confirm that before test, Spain converts more than the other countrys" 1064 | ] 1065 | }, 1066 | { 1067 | "cell_type": "code", 1068 | "execution_count": 358, 1069 | "metadata": {}, 1070 | "outputs": [ 1071 | { 1072 | "data": { 1073 | "text/plain": [ 1074 | "country\n", 1075 | "Spain 0.079719\n", 1076 | "El Salvador 0.053554\n", 1077 | "Nicaragua 0.052647\n", 1078 | "Costa Rica 0.052256\n", 1079 | "Colombia 0.052089\n", 1080 | "Honduras 0.050906\n", 1081 | "Guatemala 0.050643\n", 1082 | "Venezuela 0.050344\n", 1083 | "Peru 0.049914\n", 1084 | "Mexico 0.049495\n", 1085 | "Bolivia 0.049369\n", 1086 | "Ecuador 0.049154\n", 1087 | "Paraguay 0.048493\n", 1088 | "Chile 0.048107\n", 1089 | "Panama 0.046796\n", 1090 | "Argentina 0.015071\n", 1091 | "Uruguay 0.012048\n", 1092 | "Name: conversion, dtype: float64" 1093 | ] 1094 | }, 1095 | "execution_count": 358, 1096 | "metadata": {}, 1097 | "output_type": "execute_result" 1098 | } 1099 | ], 1100 | "source": [ 1101 | "# Yes, Spain has the highest conversion rate.\n", 1102 | "data[data['test']==0].groupby('country').conversion.mean().sort_values(ascending = False)" 1103 | ] 1104 | }, 1105 | { 1106 | "cell_type": "code", 1107 | "execution_count": 362, 1108 | "metadata": {}, 1109 | "outputs": [], 1110 | "source": [ 1111 | "# group by country, and do NOT set country as index\n", 1112 | "data_country = data[data['test']==0].groupby('country', as_index = False).conversion.mean()" 1113 | ] 1114 | }, 1115 | { 1116 | "cell_type": "code", 1117 | "execution_count": 368, 1118 | "metadata": {}, 1119 | "outputs": [ 1120 | { 1121 | "data": { 1122 | "text/plain": [ 1123 | "" 1124 | ] 1125 | }, 1126 | "execution_count": 368, 1127 | "metadata": {}, 1128 | "output_type": "execute_result" 1129 | }, 1130 | { 1131 | "data": { 1132 | "image/png": "\n", 1133 | "text/plain": [ 1134 | "" 1135 | ] 1136 | }, 1137 | "metadata": {}, 1138 | "output_type": "display_data" 1139 | } 1140 | ], 1141 | "source": [ 1142 | "g = sns.factorplot(x = 'country', y = 'conversion', \\\n", 1143 | " data = data[data['test']==0].groupby('country', as_index = False).conversion.mean(),\\\n", 1144 | " kind = 'bar', size = 3, aspect = 5)\n", 1145 | "g.set_xticklabels(rotation=30)" 1146 | ] 1147 | }, 1148 | { 1149 | "cell_type": "markdown", 1150 | "metadata": {}, 1151 | "source": [ 1152 | "### Step2: Calculate t statistics" 1153 | ] 1154 | }, 1155 | { 1156 | "cell_type": "markdown", 1157 | "metadata": {}, 1158 | "source": [ 1159 | "#### Hypothesis:\n", 1160 | "Null Hypothesis: the population mean(conversion rate) of local-translation is the same as the population mean of Spaniard-translation. mu1 = mu2
\n", 1161 | "Alternative Hypothesis: mu1 != mu2
\n", 1162 | "And let's use a signifigance level alpha < 0.05 and we're doing a two-tail test" 1163 | ] 1164 | }, 1165 | { 1166 | "cell_type": "markdown", 1167 | "metadata": {}, 1168 | "source": [ 1169 | "Breake the data into two groups: controlled and experiment group, without Spain data" 1170 | ] 1171 | }, 1172 | { 1173 | "cell_type": "code", 1174 | "execution_count": 309, 1175 | "metadata": {}, 1176 | "outputs": [], 1177 | "source": [ 1178 | "controll = data[(data['test']==0)&(data['country']!='Spain')]\n", 1179 | "exp = data[(data['test']==1)&(data['country']!='Spain')]" 1180 | ] 1181 | }, 1182 | { 1183 | "cell_type": "markdown", 1184 | "metadata": {}, 1185 | "source": [ 1186 | "#### I will use below example to explain Simpson's Paradox" 1187 | ] 1188 | }, 1189 | { 1190 | "cell_type": "markdown", 1191 | "metadata": {}, 1192 | "source": [ 1193 | "Compare the control and test groups in all country as a whole" 1194 | ] 1195 | }, 1196 | { 1197 | "cell_type": "code", 1198 | "execution_count": 314, 1199 | "metadata": {}, 1200 | "outputs": [ 1201 | { 1202 | "name": "stdout", 1203 | "output_type": "stream", 1204 | "text": [ 1205 | "The avg conversion rate of controll group: 0.04829179055749524\n", 1206 | "The avg conversion rate of exp group: 0.043411161678422794\n" 1207 | ] 1208 | } 1209 | ], 1210 | "source": [ 1211 | "# calculate the mean conversion rate for both groups\n", 1212 | "print ('The avg conversion rate of controll group: {}'.format(controll.conversion.mean()))\n", 1213 | "print ('The avg conversion rate of exp group: {}'.format(exp.conversion.mean()))" 1214 | ] 1215 | }, 1216 | { 1217 | "cell_type": "code", 1218 | "execution_count": 315, 1219 | "metadata": {}, 1220 | "outputs": [], 1221 | "source": [ 1222 | "# calculate t statistics and p value\n", 1223 | "t,p = stats.ttest_ind(a=controll['conversion'], b=exp['conversion'],equal_var = False)" 1224 | ] 1225 | }, 1226 | { 1227 | "cell_type": "code", 1228 | "execution_count": 316, 1229 | "metadata": {}, 1230 | "outputs": [ 1231 | { 1232 | "name": "stdout", 1233 | "output_type": "stream", 1234 | "text": [ 1235 | "7.35389520308 1.92891785778e-13\n" 1236 | ] 1237 | } 1238 | ], 1239 | "source": [ 1240 | "print (t,p)" 1241 | ] 1242 | }, 1243 | { 1244 | "cell_type": "markdown", 1245 | "metadata": {}, 1246 | "source": [ 1247 | "If we look at the above analysis, the test group is doing significally worse than control group. It seems that after the change, the conversion rates drops significantly!
\n", 1248 | "When things seems not in the way that we expected, there must be something wrong.
\n", 1249 | "Let's dive deeper into the sample" 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "code", 1254 | "execution_count": 319, 1255 | "metadata": {}, 1256 | "outputs": [ 1257 | { 1258 | "data": { 1259 | "text/plain": [ 1260 | "date\n", 1261 | "2015-11-30 0.051204\n", 1262 | "2015-12-01 0.046249\n", 1263 | "2015-12-02 0.048472\n", 1264 | "2015-12-03 0.049255\n", 1265 | "2015-12-04 0.047085\n", 1266 | "Name: conversion, dtype: float64" 1267 | ] 1268 | }, 1269 | "execution_count": 319, 1270 | "metadata": {}, 1271 | "output_type": "execute_result" 1272 | } 1273 | ], 1274 | "source": [ 1275 | "# the conversion rate in test group are constantly ower throughout the days\n", 1276 | "controll.groupby('date').conversion.mean()" 1277 | ] 1278 | }, 1279 | { 1280 | "cell_type": "code", 1281 | "execution_count": 320, 1282 | "metadata": {}, 1283 | "outputs": [ 1284 | { 1285 | "data": { 1286 | "text/plain": [ 1287 | "date\n", 1288 | "2015-11-30 0.043878\n", 1289 | "2015-12-01 0.041371\n", 1290 | "2015-12-02 0.044216\n", 1291 | "2015-12-03 0.043898\n", 1292 | "2015-12-04 0.043459\n", 1293 | "Name: conversion, dtype: float64" 1294 | ] 1295 | }, 1296 | "execution_count": 320, 1297 | "metadata": {}, 1298 | "output_type": "execute_result" 1299 | } 1300 | ], 1301 | "source": [ 1302 | "exp.groupby('date').conversion.mean()" 1303 | ] 1304 | }, 1305 | { 1306 | "cell_type": "code", 1307 | "execution_count": 321, 1308 | "metadata": {}, 1309 | "outputs": [], 1310 | "source": [ 1311 | "c_country = pd.Series(controll.groupby('country').size(), name = 'controll')" 1312 | ] 1313 | }, 1314 | { 1315 | "cell_type": "code", 1316 | "execution_count": 322, 1317 | "metadata": {}, 1318 | "outputs": [], 1319 | "source": [ 1320 | "e_country = pd.Series(exp.groupby('country').size(), name = 'exp')" 1321 | ] 1322 | }, 1323 | { 1324 | "cell_type": "code", 1325 | "execution_count": 323, 1326 | "metadata": {}, 1327 | "outputs": [ 1328 | { 1329 | "data": { 1330 | "text/html": [ 1331 | "
\n", 1332 | "\n", 1345 | "\n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | " \n", 1368 | " \n", 1369 | " \n", 1370 | " \n", 1371 | " \n", 1372 | " \n", 1373 | " \n", 1374 | " \n", 1375 | " \n", 1376 | " \n", 1377 | " \n", 1378 | " \n", 1379 | " \n", 1380 | " \n", 1381 | " \n", 1382 | " \n", 1383 | " \n", 1384 | " \n", 1385 | " \n", 1386 | " \n", 1387 | " \n", 1388 | " \n", 1389 | " \n", 1390 | " \n", 1391 | " \n", 1392 | " \n", 1393 | " \n", 1394 | " \n", 1395 | " \n", 1396 | " \n", 1397 | " \n", 1398 | " \n", 1399 | " \n", 1400 | " \n", 1401 | " \n", 1402 | " \n", 1403 | " \n", 1404 | " \n", 1405 | " \n", 1406 | " \n", 1407 | " \n", 1408 | " \n", 1409 | " \n", 1410 | " \n", 1411 | " \n", 1412 | " \n", 1413 | " \n", 1414 | " \n", 1415 | " \n", 1416 | " \n", 1417 | " \n", 1418 | " \n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | "
controllexp
country
Argentina935637377
Bolivia55505574
Chile98539884
Colombia2708826972
Costa Rica26602649
Ecuador80367859
El Salvador41084067
Guatemala76227503
Honduras43614207
Mexico6420964275
Nicaragua34193304
Panama19661985
Paraguay36503697
Peru1686916797
Uruguay4153719
Venezuela1614915905
\n", 1441 | "
" 1442 | ], 1443 | "text/plain": [ 1444 | " controll exp\n", 1445 | "country \n", 1446 | "Argentina 9356 37377\n", 1447 | "Bolivia 5550 5574\n", 1448 | "Chile 9853 9884\n", 1449 | "Colombia 27088 26972\n", 1450 | "Costa Rica 2660 2649\n", 1451 | "Ecuador 8036 7859\n", 1452 | "El Salvador 4108 4067\n", 1453 | "Guatemala 7622 7503\n", 1454 | "Honduras 4361 4207\n", 1455 | "Mexico 64209 64275\n", 1456 | "Nicaragua 3419 3304\n", 1457 | "Panama 1966 1985\n", 1458 | "Paraguay 3650 3697\n", 1459 | "Peru 16869 16797\n", 1460 | "Uruguay 415 3719\n", 1461 | "Venezuela 16149 15905" 1462 | ] 1463 | }, 1464 | "execution_count": 323, 1465 | "metadata": {}, 1466 | "output_type": "execute_result" 1467 | } 1468 | ], 1469 | "source": [ 1470 | "pd.concat([c_country,e_country],axis = 1)" 1471 | ] 1472 | }, 1473 | { 1474 | "cell_type": "code", 1475 | "execution_count": 324, 1476 | "metadata": {}, 1477 | "outputs": [ 1478 | { 1479 | "data": { 1480 | "text/plain": [ 1481 | "" 1482 | ] 1483 | }, 1484 | "execution_count": 324, 1485 | "metadata": {}, 1486 | "output_type": "execute_result" 1487 | }, 1488 | { 1489 | "data": { 1490 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAE4CAYAAACwgj/eAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmcXFWZ//HPN2EJQtgDOgQmqFEIGBYDMqAgZIgwIKDCsAhEWaIOCuqA4jgOCKK4MsYNUQIB2VcRQUB2QoAkEMIS+BEhQAaEQBBZRAw8vz/OqaTStzrdfe+t9ML3/XrVq+ueuvXUqe6ueu4921VEYGZm1mxQb1fAzMz6HicHMzMrcHIwM7MCJwczMytwcjAzswInBzMzK3ByMDOzAicHMzMrcHIwM7OC5Xq7AmWtvfbaMWLEiN6uhplZvzFjxoznImJYd/btt8lhxIgRTJ8+vberYWbWb0h6vLv7ulnJzMwKnBzMzKzAycHMzAr6bZ+Dmb11/eMf/2DevHm89tprvV2VPmnIkCEMHz6c5ZdfvnQMJwcz63fmzZvH0KFDGTFiBJJ6uzp9SkTw/PPPM2/ePDbccMPScdysZGb9zmuvvcZaa63lxNCCJNZaa63KZ1VODmbWLzkxdK6O342Tg5mZFbjPwcz6vRHH/r7WeHNP3q3WeC1fY+5cbr/9dg444IAeP2/33Xfn/vvv56abbuIHP/gBV155Ze31c3Iw66dafSHOHdLJF83xL7a5NtZTc+fO5dxzz22ZHBYuXMhyy/Xu17OblczMSjjrrLMYPXo0m222GQcddBCPP/44Y8eOZfTo0YwdO5YnnngCgE996lMceeSRbLvttrzzne/k4osvBuDYY4/l1ltvZfPNN+eUU07hzDPPZJ999uGjH/0o48aNIyI45phj2HTTTXnf+97HBRdcsEzfn88czMx66IEHHuCkk05iypQprL322ixYsIDx48dz8MEHM378eCZNmsSRRx7J5ZdfDsDTTz/NbbfdxkMPPcQee+zB3nvvzcknn7xEk9CZZ57J1KlTmTVrFmuuuSaXXHIJM2fO5N577+W5555jq622Yvvtt19m79FnDmZmPXTDDTew9957s/baawOw5pprMnXq1EVNRAcddBC33Xbbov332msvBg0axKhRo3jmmWc6jbvzzjuz5pprAnDbbbex//77M3jwYNZdd1122GEHpk2b1sZ3tSQnBzOzHoqILoeLNj++4oorLvHczqy88srd2m9ZcHIwM+uhsWPHcuGFF/L8888DsGDBArbddlvOP/98AM455xw++MEPLjXG0KFDeemllzp9fPvtt+eCCy7gjTfeYP78+dxyyy1svfXW9b2JLrjPwcz6vWUx9LTZJptswte//nV22GEHBg8ezBZbbMHEiRM55JBD+P73v8+wYcM444wzlhpj9OjRLLfccmy22WZ86lOfYo011lji8Y997GNMnTqVzTbbDEl873vf4+1vfztz585t4ztbTL196lLWmDFjwhf7sbeyt/JQ1tmzZ7Pxxhv3djX6tFa/I0kzImJMd57vZiUzMytwcjAzs4JuJQdJq0u6WNJDkmZL+hdJa0q6TtIj+ecaeV9JmihpjqRZkrZsijM+7/+IpPFN5e+XdF9+zkR5RS0zs17V3TOHHwN/iIiNgM2A2cCxwPURMRK4Pm8D7AqMzLcJwC8AJK0JHAd8ANgaOK6RUPI+E5qet0u1t2VmZlV0mRwkrQpsD5wOEBGvR8RfgD2ByXm3ycBe+f6ewFmR3AGsLukdwEeA6yJiQUS8AFwH7JIfWzUipkbqHT+rKZaZmfWC7pw5vBOYD5wh6R5Jv5a0MrBuRDwNkH+uk/dfD3iy6fnzctnSyue1KC+QNEHSdEnT58+f342qm5lZGd2Z57AcsCXwhYi4U9KPWdyE1Eqr/oIoUV4sjDgNOA3SUNalVdrM3kKOX63meANr6G8Z3TlzmAfMi4g78/bFpGTxTG4SIv98tmn/9ZuePxx4qovy4S3Kzcysl3SZHCLiz8CTkt6bi8YCDwJXAI0RR+OB3+b7VwAH51FL2wAv5mana4BxktbIHdHjgGvyYy9J2iaPUjq4KZaZWZ/0m9/8hq233prNN9+cz3zmMzz++OOMHDmS5557jjfffJMPfehDXHvttcydO5eNNtqI8ePHM3r0aPbee29effXV3q5+l7o7WukLwDmSZgGbA98GTgZ2lvQIsHPeBrgKeBSYA/wK+A+AiFgAnAhMy7cTchnA54Bf5+f8Cbi62tsyM2uf2bNnc8EFFzBlyhRmzpzJ4MGDufnmm/nqV7/KZz/7WX74wx8yatQoxo0bB8DDDz/MhAkTmDVrFquuuio///nPe/kddK1baytFxEyg1ZTrsS32DeCITuJMAia1KJ8ObNqdupiZ9bbrr7+eGTNmsNVWWwHwt7/9jXXWWYfjjz+eiy66iFNPPZWZM2cu2n/99ddnu+22A+DAAw9k4sSJHH300b1S9+7ywntmZj0UEYwfP57vfOc7S5S/+uqrzJuXBl++/PLLDB06FKCwvHd/mOfr5TPMzHpo7NixXHzxxTz7bBqHs2DBAh5//HG++tWv8slPfpITTjiBww8/fNH+TzzxBFOnTgXgvPPO63I5777AZw5m1v8t46Gno0aN4lvf+hbjxo3jzTffZPnll+dHP/oR06ZNY8qUKQwePJhLLrmEM844gx133JGNN96YyZMn85nPfIaRI0fyuc99bpnWtwwnBzOzEvbdd1/23XffJcruuOOORfcvvfRSAObOncugQYM49dRTl2n9qnKzkpmZFTg5mJm10YgRI7j//vt7uxo95uRgZv1Sf72K5bJQx+/GycHM+p0hQ4bw/PPPO0G0EBE8//zzDBkypFIcd0ibWb8zfPhw5s2bh1dnbm3IkCEMHz686x2XwsnBzPqd5Zdfng033LC3qzGguVnJzMwKnBzMzKzAycHMzAqcHMzMrMDJwczMCpwczMyswMnBzMwKnBzMzKzAycHMzAqcHMzMrMDJwczMCpwczMyswMnBzMwKupUcJM2VdJ+kmZKm57I1JV0n6ZH8c41cLkkTJc2RNEvSlk1xxuf9H5E0vqn8/Tn+nPxc1f1Gzcys+3py5rBjRGweEWPy9rHA9RExErg+bwPsCozMtwnALyAlE+A44APA1sBxjYSS95nQ9LxdSr8jMzOrrEqz0p7A5Hx/MrBXU/lZkdwBrC7pHcBHgOsiYkFEvABcB+ySH1s1IqZGuqzTWU2xzMysF3Q3OQRwraQZkibksnUj4mmA/HOdXL4e8GTTc+flsqWVz2tRbmZmvaS7V4LbLiKekrQOcJ2kh5ayb6v+gihRXgycEtMEgA022GDpNTYzs9K6deYQEU/ln88Cl5H6DJ7JTULkn8/m3ecB6zc9fTjwVBflw1uUt6rHaRExJiLGDBs2rDtVNzOzErpMDpJWljS0cR8YB9wPXAE0RhyNB36b718BHJxHLW0DvJibna4BxklaI3dEjwOuyY+9JGmbPErp4KZYZmbWC7rTrLQucFkeXboccG5E/EHSNOBCSYcCTwD75P2vAv4NmAO8CnwaICIWSDoRmJb3OyEiFuT7nwPOBFYCrs43MzPrJV0mh4h4FNisRfnzwNgW5QEc0UmsScCkFuXTgU27UV8zM1sGPEPazMwKnBzMzKzAycHMzAqcHMzMrMDJwczMCpwczMyswMnBzMwKnBzMzKzAycHMzAqcHMzMrMDJwczMCpwczMyswMnBzMwKnBzMzKzAycHMzAqcHMzMrMDJwczMCpwczMyswMnBzMwKnBzMzKzAycHMzAqcHMzMrMDJwczMCrqdHCQNlnSPpCvz9oaS7pT0iKQLJK2Qy1fM23Py4yOaYnwtlz8s6SNN5bvksjmSjq3v7ZmZWRk9OXM4CpjdtP1d4JSIGAm8AByayw8FXoiIdwOn5P2QNArYD9gE2AX4eU44g4GfAbsCo4D9875mZtZLupUcJA0HdgN+nbcF7ARcnHeZDOyV7++Zt8mPj8377wmcHxF/j4jHgDnA1vk2JyIejYjXgfPzvmZm1ku6e+bwv8BXgDfz9lrAXyJiYd6eB6yX768HPAmQH38x77+ovMNzOis3M7Ne0mVykLQ78GxEzGgubrFrdPFYT8tb1WWCpOmSps+fP38ptTYzsyq6c+awHbCHpLmkJp+dSGcSq0taLu8zHHgq358HrA+QH18NWNBc3uE5nZUXRMRpETEmIsYMGzasG1U3M7MyukwOEfG1iBgeESNIHco3RMQngRuBvfNu44Hf5vtX5G3y4zdEROTy/fJopg2BkcBdwDRgZB79tEJ+jStqeXdmZlbKcl3v0qmvAudL+hZwD3B6Lj8dOFvSHNIZw34AEfGApAuBB4GFwBER8QaApM8D1wCDgUkR8UCFepmZWUU9Sg4RcRNwU77/KGmkUcd9XgP26eT5JwEntSi/CriqJ3UxM7P28QxpMzMrcHIwM7MCJwczMyuo0iHdPxy/WouyF5d9PczM+hGfOZiZWYGTg5mZFTg5mJlZgZODmZkVODmYmVmBk4OZmRU4OZiZWYGTg5mZFTg5mJlZgZODmZkVODmYmVmBk4OZmRU4OZiZWYGTg5mZFTg5mJlZgZODmZkVODmYmVmBk4OZmRU4OZiZWUGXyUHSEEl3SbpX0gOSvpnLN5R0p6RHJF0gaYVcvmLenpMfH9EU62u5/GFJH2kq3yWXzZF0bP1v08zMeqI7Zw5/B3aKiM2AzYFdJG0DfBc4JSJGAi8Ah+b9DwVeiIh3A6fk/ZA0CtgP2ATYBfi5pMGSBgM/A3YFRgH7533NzKyXdJkcInk5by6fbwHsBFycyycDe+X7e+Zt8uNjJSmXnx8Rf4+Ix4A5wNb5NiciHo2I14Hz875mZtZLutXnkI/wZwLPAtcBfwL+EhEL8y7zgPXy/fWAJwHy4y8CazWXd3hOZ+VmZtZLupUcIuKNiNgcGE460t+41W75pzp5rKflBZImSJouafr8+fO7rriZmZXSo9FKEfEX4CZgG2B1Scvlh4YDT+X784D1AfLjqwELmss7PKez8lavf1pEjImIMcOGDetJ1c3MrAe6M1ppmKTV8/2VgH8FZgM3Anvn3cYDv833r8jb5MdviIjI5fvl0UwbAiOBu4BpwMg8+mkFUqf1FXW8OTMzK2e5rnfhHcDkPKpoEHBhRFwp6UHgfEnfAu4BTs/7nw6cLWkO6YxhP4CIeEDShcCDwELgiIh4A0DS54FrgMHApIh4oLZ3aGZmPdZlcoiIWcAWLcofJfU/dCx/Ddink1gnASe1KL8KuKob9TUzs2XAM6TNzKzAycHMzAqcHMzMrMDJwczMCpwczMysoDtDWa0fGHHs7wtlc4ccUNzx+BeXQW3MrL/zmYOZmRU4OZiZWYGTg5mZFTg5mJlZgZODmZkVODmYmVmBk4OZmRU4OZiZWYGTg5mZFTg5mJlZgZODmZkVODmYmVmBk4OZmRU4OZiZWYGTg5mZFTg5mJlZgZODmZkVODmYmVlBl8lB0vqSbpQ0W9IDko7K5WtKuk7SI/nnGrlckiZKmiNplqQtm2KNz/s/Iml8U/n7Jd2XnzNRktrxZs3MrHu6c+awEPjPiNgY2AY4QtIo4Fjg+ogYCVyftwF2BUbm2wTgF5CSCXAc8AFga+C4RkLJ+0xoet4u1d+amZmV1WVyiIinI+LufP8lYDawHrAnMDnvNhnYK9/fEzgrkjuA1SW9A/gIcF1ELIiIF4DrgF3yY6tGxNSICOCsplhmZtYLetTnIGkEsAVwJ7BuRDwNKYEA6+Td1gOebHravFy2tPJ5Lcpbvf4ESdMlTZ8/f35Pqm5mZj3Q7eQgaRXgEuCLEfHXpe3aoixKlBcLI06LiDERMWbYsGFdVdnMzErqVnKQtDwpMZwTEZfm4mdykxD557O5fB6wftPThwNPdVE+vEW5mZn1ku6MVhJwOjA7In7U9NAVQGPE0Xjgt03lB+dRS9sAL+Zmp2uAcZLWyB3R44Br8mMvSdomv9bBTbHMzKwXLNeNfbYDDgLukzQzl/0XcDJwoaRDgSeAffJjVwH/BswBXgU+DRARCySdCEzL+50QEQvy/c8BZwIrAVfnm5mZ9ZIuk0NE3EbrfgGAsS32D+CITmJNAia1KJ8ObNpVXczMbNnwDGkzMytwcjAzswInBzMzK3ByMDOzgu6MVjIzG/BGHPv7Qtnck3frhZr0DU4OZmadOX61FmUvLvt69AI3K5mZWYHPHMzM+pFl1fzl5GBm1t+1ofnLzUpmZlbg5GBmZgVODmZmVuDkYGZmBU4OZmZW4ORgZmYFTg5mZlbg5GBmZgVODmZmVuDkYGZmBU4OZmZW4ORgZmYFTg5mZlbg5GBmZgVdJgdJkyQ9K+n+prI1JV0n6ZH8c41cLkkTJc2RNEvSlk3PGZ/3f0TS+Kby90u6Lz9noiTV/SbNzKxnunPmcCawS4eyY4HrI2IkcH3eBtgVGJlvE4BfQEomwHHAB4CtgeMaCSXvM6HpeR1fy8zMlrEuk0NE3AIs6FC8JzA5358M7NVUflYkdwCrS3oH8BHguohYEBEvANcBu+THVo2IqRERwFlNsczMrJeU7XNYNyKeBsg/18nl6wFPNu03L5ctrXxei3IzM+tFdV8mtFV/QZQobx1cmkBqgmKDDTYoU78+oeU1YIccUNyx4mX+zMzKKnvm8ExuEiL/fDaXzwPWb9pvOPBUF+XDW5S3FBGnRcSYiBgzbNiwklU3M7OulD1zuAIYD5ycf/62qfzzks4ndT6/GBFPS7oG+HZTJ/Q44GsRsUDSS5K2Ae4EDgZ+UrJOnRyRl41mZvbW1WVykHQe8GFgbUnzSKOOTgYulHQo8ASwT979KuDfgDnAq8CnAXISOBGYlvc7ISIandyfI42IWgm4Ot/MzKwXdZkcImL/Th4a22LfAI7oJM4kYFKL8unApl3Vw8zMlh3PkDYzs4K6RyuZdapVnxB4pJZZX+TkYNZBp0ns5N2WcU3Meo+Tg1l3Hb9aizKf4djA5D4HMzMrcHIwM7MCJwczMytwcjAzswInBzMzK/BopS54BVUzeyvymYOZmRU4OZiZWYGTg5mZFbjPwWwZ6HbfFbj/yvoEJwfr17yYn1l7ODmYWb/Tr0cR9pM1upwczGwRN39ZgzukzcyswGcO1ql+fepuZpX4zMHMzAp85mBmbeUz0P7JZw5mZlbg5GBmZgV9JjlI2kXSw5LmSDq2t+tjZvZW1ieSg6TBwM+AXYFRwP6SRvVurczM3rr6Sof01sCciHgUQNL5wJ7Ag71aKzOzClp3xvdCRUroE2cOwHrAk03b83KZmZn1AkVEb9cBSfsAH4mIw/L2QcDWEfGFDvtNACbkzfcCD3cj/NrAczVW1zEds6/Gc0zH7Mo/R8Sw7gTsK81K84D1m7aHA0913CkiTgNO60lgSdMjYky16jmmY9Yfsz/U0THfujH7SrPSNGCkpA0lrQDsB1zRy3UyM3vL6hNnDhGxUNLngWuAwcCkiHigl6tlZvaW1SeSA0BEXAVc1YbQPWqGckzHXIYx+0MdHfMtGrNPdEibmVnf0lf6HMzMrA9xcjAzswInB+v3JA2SdH9v18NsIOkzHdJ1krQGMBJYNFE9Im7pvRoVSRLwSeCdEXGCpA2At0fEXb1ctSVIGgZ8lbTmVfPvc6cKMQcDkyPiwOo1hIh4U9K9kjaIiCfqiNlM0jos+d5Lv4ak7YCZEfGKpAOBLYEfR8TjFeu4GfChvHlrRNxbJV6O2R8+R5cAk4CrI+LN3q5PZyRtGhG1HsC0++8z4DqkJR0GHEWaSDcT2AaYWuXLLMcdAhwKbMKSf4xDSsb7BfAmsFNEbJz/0NdGxFYV6tiOL/JrgQuAo4HPAuOB+RHx1bIxc9xrgI9GxOtV4jTFuwHYCrgLeKVRHhF7VIi5B/BD4J+AZ4F/BmZHxCYVYs4CNgNGA2cDpwMfj4gdKsQ8CjgcuDQXfQw4LSJ+UiFmuz5Hu1H8DJ1QId6/Ap/O9bsIODMiHqpYxxuBwhdjxc/RbcAKwJnAuRHxl9IVpH1/nyVExIC6AfeR/vFm5u2NgAtqiHsRcCLwJ9IX5LWkI76y8e7OP+9pKru3Yh2vJSWw2cAOpCOq71aMOSP/nNVUdnMNv89fkiY/fgP4cuNWId4OrW4V63gvsFbjbwTsSPrSrRKz8Xf/H+DQ5rIKMWcBKzdtr9z89yoZs/bPEXAqcBZpHbXj8mucXvV/KcdejXTw8iRwOylhLF8y1vubbtsBPwK+V0MdRwLfAeYA5wI796W/T8fbQOxzeC0iXgOQtGKko4j31hD33RHxDeCViJgM7Aa8r0K8f+TmlYBFR/1VT4vXiojTgX9ExM2Rzmq2qRjzH/nn05J2k7QF6WilqqeAK0n9XkObbqVExM3AQ01xZueyKv4REc8DgyQNiogbgc0rxnxJ0teAA4Hf5/+B5SvGFPBG0/YbuayKdnyOto2Ig4EXIuKbwL+w5LI5pUhaC/gUcBhwD/BjUnPddWXiRcSMptuUiPgy8IGq9YyIR4D/Jp3d7wBMlPSQpI+XCNeu77lFBmKfwzxJqwOXA9dJeoEW6zSV0PiS/IukTYE/AyMqxJsIXAasI+kkYG/SP04VS3yRk9531S/yb0laDfhP4CfAqsCXKsYkfzkgaWjajJerxJP078D3gZtIX4w/kXRMRFxcIexfJK0C3AKcI+lZYGGVegL7AgeQzhr+nPuavl8x5hnAnZIuy9t7kZqrqmjH5+hv+eerkv4JeB7YsEpASZeSjprPJjVTPp0fukDS9JIx12zaHEQ6g3h7xXqOJp3N7EZKWh+NiLvz72Eqi5sEu6td33OL1Xka0tdupOy8B7BCDbEOA9bIMR8ltUF/tmLMjYAjgM8DG9dQx91Jp9ebAjcCM4A9evvv0EldNyUd5T2ebzOATSrEuxdYp2l7GNWb6VYmLeeyHKkp8UjS2Vmv//5a1HXLXL+jgC1qjl3L54jUhLg68AnSwdXTwIkVY+7Uht/lY/kz/hjwCKm59oMVY94CHASs1OKxg/rC36fjbcB1SMOi0TDr0nRmFG0YxVJGh6OSgohYsKzq0h2SJgNHRe5Ayx3nP4ySHfFNcW8Hvh6pqQZJHwa+HRHblox3X0S8r2l7ECk5VGn6q52kbUhnYBuTOigHAy9HxGoVYz4QES/l7aHAqIi4s0LMDVqV1/U5krQiMCQiXqwh1qYUB2GcVTLWIOBfImJK1Xq1w7L8/hhwzUqSvkDq7HqGxW34QRodUibegRHxG0lfbvV4RPyohyFn5Po0twk3tgN4Z4k6fiUivifpJ7QeZXFkT2M2GR1NIysi4oXc71DVyo3EkOPeJGnlCvH+kEdAnZe396XkWl2SXqLF75H8N4qIVctVEYCfklYdvggYAxxM6qis4hekM4eGV1qU9dTvWfx/OYTU/PMwaaRRKfmgbTdSc+xyuazMZ6g55nHAh0nJ4SrSpYZvI3V891ikYdE/IPWH1EZSozO6YxLr6ee99u+Pzgy45EA6rX5vpI7EOjS+sEp3ljaLiEptrJ2YnX+WamPtwiBJa0TEC7DoyKWO/5tHJX2D1FYMqYP2sbLBIuIYSZ8gjS4RaVTRZV08rbNYtfytlxJ/jqTBEfEGcEY+i6pC0dQEkL/gKv2NOp5xSdoS+EyVmMDvgNdII23qmpOwN2lo8D0R8WlJ6wK/rhjz2vy/dGnz77WiM0gHraeQRr19mhKDBtr0/dHSQEwOTwKVT1UbIuKX+e7PI2J+1XiSNoqIh/KHrdXr3d3TmBHxu/xzcn6NVdNmamao6IfA7ZIaHbv7ACfVEPcQ4JukjjiR2mQ/XSVgRFwCXFK9akuqcxIcqTN2BWCmpO+R2t2rnDFBSrRHks4WAP6D1GZem0idp6Xn4GTDI6LUGfxS/C0nw4X5//5Zqh89f5n0N3lD0t+o54xxpYi4XpIiTXg8XtKtpITRY02TaDeMiBPbMYl2ICaHR4GbJP0e+HujsMqpa3a7pMdIE8IubRxJl/Bl0qVOf9jisQCqTLQZQzpCGZo29RfgkIiYUTZmRJyVR33sRPqQfDwiHiwbrynuC6QO1EqW0gTUeJ3SH+jOJsFRoWmF1Ck5iDQI4UukoZyfqBAP0vj+iaTRbgFcz+LL6ZbSoRl1EKmJqurB0dWSxkXEtRXjNJueR+38itTk8jJpImRpbTpzfC33ZzyidO2a/wPWqRDv5+RJtKT5Vy+RDoyqJvBFBlyHdG6DLIg8dLJi7K1J7cV7AQ8C50fEb6rGrUuefXtERNyatz9IOuPp8dGapFUj4q+ddYCV7fiS9DuW/mVeakazpBNII2DOJiWxTwJDI+J7ZeLlmPeSPnx/jIgtJO0I7B8Rpb54VfOyIe3U4XO0EJgLXBJ5bH3JmB8DfkNKNv+gniPy5vgjgFUjYlbFOB2PytcH3lHlqDyfdc0mjdY6kTSq8HsRcUfJeHdHxJaS7omILXLZvRGxWdk6Fl5joCWHZUHS2qRZk5+MiMEV4mxLU+cclB9lkeNNiYjtuirrZqwrI2L3fLbU/E/S+ECXOnWX1Fgm4uOkseON5Lo/MDci/qtk3Dsj4gNdlfUw5vSIGJOTxBa5+eKuiNi6Qszalg1p80CE2kl6lHRgdV9dbfmStm9VHhXWGFIblrapm6Q7gW2BaTlJDCPVsY7BIsAAalaS9L8R8cXOjkzLHpE2xV+VtGbNfsC7SBPYqnxJnJ3jzGTx7NagxCiLpv6LuyT9kjRiJ0gjdm4qU7+I2D3/rLUDLPKsZUknRkTzB/t3kqosGvaGpE8C55Pe+/4sOWu4jHZMgpsLTJF0BUuuAVWm2bNtAxFy/Tp6Mb/WL0ueQTwC3F9jJy/AMU33h5A+kzOo0DwLfKBxVA6LRuitUCFeO9Zrasck2iUMmOTA4lEvP2hT/HtJsxFPiIipNcQbQxqLXscHpWP/RXOTQK2nhpLeCxwdEYdXDDVM0jsj4tEcd0PSxLWyDiAtm/Bj0nueksuq2JM0uuZLpGaG1YDSi8RlT+VbY9mQ0hoDEUhr6izxZZ3Pbqt4jPT3aB4a/AzwHlL7/kElYj5N6g+8mpr6AyPio83buQmodFNi1o6lbY7+qBdwAAAV4UlEQVRuuj+E1M9U+kAjIs6RNAMYSzqb3ysiZnfxtB4ZMMmhqdN184j4cfNjSqtWVl1n5501H/HcT2pWebqrHbsSETtWr86SlKb7/4DUGXs5aeLWz0lrzLTqTO+pL5G+KBqjakZQYahkRMwlfZnXJiJegUVnjb/rYvfuxqzc99XCXZImNNqv8zDM75C+yMvaotWZXURsL+mBkjEfy7cV8q0d5pFm31dR+1F5i0EhUySV/k7Ko5Nepen/UjUvWT/g+hwaHTUdyhZ12pSIV2tzVVOcoaRF3O5iyaOoKktMr06aVDWCJfsxetz2nNs0f0Fa92UX4CuklSS/UaVTssNrrEhaQgTgoYj4+9L27yJWrUuq55ifIZ0p/I105FipvyXHbMdy0O8jrcB7EymZrwUcFhHzKsScDXyk8WWTv4z+EBGjqnye6tahv2UQ6TM1t2qnv6SNWHxUfn3Vo3K1Xq9pYkSUWixP0n20mKQYFZaT72jAnDlI2p/UjLBhh/bSoaQFvsqqu7nqCtLSHrd2KN+BNLytiquAO6hnktGKEXFmvv+wpKOBYyNN3KrLSNJKkkOAzZRmy5btkD+btCrrR0hf6J9kcZt8WUeT1nt6rmKcjjEbKjcvAETEffkI92zSkMbtqySG7D+B2yT9ifQFtCHwH0qz2CeXCZibZ75CMYFX6R9o7m9ZCJwXJZe+yAcYnwXeTfoM/TIiqvYxNTTPbF5IOoM6tGywaM8kxSUMmDMHSf9M+gf+DnBs00Mvkda2r+uPXImkK4H/6jjcLs9ROK5jG2oPYxfOmirEeojUqduYxXkOKfkKyk3W6xC/5bIHEbF3yXj35OGmsyJitKTlgWsqHpH/gTSv49WyMbr5OjdHtYv9nE4a3PBpUlPS/wI/jYifVaxX48xOpDO7SmeMatOFo+oi6QLSENtbSf+PcyPii71bq+6r8/MPA+jMIdKsw8epf02Uxulb4aH0sj2eQzCi1TjsiJiex2lXcbakw0nXSWhuqiozJ+Fp0nDdhj83bVearJfVvexB3UuqA3yNNPnxTpb8fZYeItpJ80Kl5aBJ/VeH5T6xx5QW4qs66ROWPLMbXfHMDvL1RiQdlUet3Vyl3R06/Xw2RlV9K3q2jM6oxhF5Tri1zTZW62s2vEga1vtsiXjtmKS4hAGTHBryH+G7pNmHovpEm93rqls2ZCmPrVQx9uukawN8ncUfmFKLcbWjk7uDupc9OC2PR/8GqelulXy/il8CN1DvWkC1Ni8ARMQpklbKHZIPR1rptFLMzs7sKLmgXdaO641cTRqyfG7e3i///Cvpkpw9ORNv1I+IWChVvV7SEg4lHbg2Fpv8MKkJ+D2SToiIszt7YieaR7otJC2UWO/SMVHzWui9fSNdgq/ytRE6ib0uKVnsTtO1A3oY4zzg8Bblh1L9Mox/Atbu7b9BN+v6c9Js0c+Sxr/fA5xRId7gNtTx9t7+PXWznh8lrZj6WN7eHLiiYsz7SEek9+btdYHfVYxZ+/VGgCmdlZGOynsS6w1SUvkrqTl6YdP9v1as5++AdZu21yWtK7Ymae5HT+N9qOP/PLBlnf9XA+7MAXgmah7vC6D6rjT2ReCyPGGrMbxtDGlo38cqVvMB0vC2Pi8i/iPfPTW37Vdd9uCxHOcC4IbIn5aKbpQ0gfTBrtRM10mzwiIR0dMrgTU7njT566Yca2aeN1JF7QvaRcSV+e6LpJVJ67CKpA9EvnaF0hI3q+THetTPGBVWO+iGERHxTNP2s8B7ImKBpH909qSluAaYJunfm+L+mmrLtC9hICaH6blj6XKW/EBX+fBBaqrZKnL7YB558UegR8kh/yG3VVqnpzEe+/cRcUPF+kE68pmZh0vW0kbeLmqx7IGk7aP8sgfvJR1BHwFMykOGz4+I2ypUszGJ7mtNZWXXzG80b6xDWvag8ffekfSlXuX/c2FEvNihGaRqcqx9Qbv8mTmc4lDrKheOOoz0916FdND2V+CwPKrqOxXi1u3WPBjlorz9CeCWXM+/dP60Tj1MPliVdGhE3A49XwJ8aQbMaKUGSWe0KI6K/4CoH1xpTNL4VuWRl/KuEHcNUudk8/DDKktdNOZ7NCxa9iCqDWtsxF6DNFO60tpX7ZC/IA6PfK1jSe8AfhYRZS4y34h5Omkl1mNJXzpHAstHxGdLxhNpee0n8/YI6lnQ7nbSSKAZNC1tEmmp9UqUrnOuaLowVV+Sf6fN1xu5jbSQYakvYC1eeG8k6Wx5EmkF5trOHAZccmgXSd8nXU2ueTmBWdFHhuE1KK0B05gZ+3BElDllbY53GOkCSsNJ60BtA0yt40u8w+usT1qlcv8KMXYg/V12BaaR+nBKf/FIehtpifUNImJC/iC+t6l5pEzM+yNi06btQaT/o9KzenM9vw6MI33xXEO6NnOVFVRnRMT7yz6/k5gzI2LzOmPmuLtRnDtRdZmTPk1Lrsb6NlLn+8cjorbWoAGXHCS9hzSzd92I2FRpGYg9IuJbNcT+OPBB0gfwlih5pbF2UboO82TS4m4iXStgfJWj/DxUcCvgjojYPM8c/WZE7Fu9xku8jkhfkqXOxJRWj50JXEjqjH2li6d0J+YFpKPcg/P/0kqkxFj6C07ST0lnYY3FEfcD5kTEF6rWt06SfgacGRHTaoz5LVInf6nLt3YS81TgbaTmuV+ThkjfFRGVRmvVTUted2QFYHnglahpufL8Gl4+Y2nyuOljSLMbG5n1/ipHZi1eY23g+Zo6PWujtBDXARHxcN5+D2nGaOkjQEnTImIrSTNJq1X+vY4jQNW87IHy9Seq1KlFzMaS3bWumZ8PMj6UN0sfZKj1yqmLRLWlWB4knYE+Tlo9tuy8nuaYL5GusPZ3arqegxZPemz8XIV0Ma5xZWMuC5L2AraO8kvUt+0guGEgdki/LSLu6tA5V3p2dJ5QdDKwgHSRjrOBtUnXVj44Iv5QpbI1W76RGAAi4v8pzRSuYl7umLwcuE7SC6Tx6VXVsuxBc5JpNS69Ymf86/lsoRH/XTR19JeVB0dUHSABadz8k6SzkDupt0Ny1xpjAe27wlr++aqkfyItlbPMrrNcVkRcLunYrvfs1K/IB8E53ixJ5wJODkvxXP4QNz7Qe1Nt5dOfAv9FGp99A7BrRNyRm1fOA/pScpieOycbE2qah8uWEhGN4bXH51FQq5EmHlV1MfBa5LWaJA2W9Lbo+VIVtV/LoMnxpL/v+pLOIXUmVrrOteqdpPl2YGfSMicHkCZCnRcRZVdNXSTSigOow/Wzq2rD4Ibf5YOX7wN3kz73v6pUyTboMJR5EGn4epWWh1oPgluKGidN9IUbaZjhH0nj/f+PNCpgRIV4M5vuz+7w2D29/X471GdFUgfqpaQlh79EWkCvSsyzu1NWIu4dwCpN26vQByedkVY43Y00gavyBEPaNEkz/+0/RVpC4Qs1xNuDNDnxFdIs7jeBByrGPIw0ue4F0iS4v5HmpJSNNwjYtsPvYLXe/p/ppK5nNN1+RRpAUGoibY53NWk9rbvz9t7A1bXWubd/aW38Y6xMuoZw1Th3t7rfansg3lq858HAgzXEndmdsh7EG0ZaOfcq0hneDVW+eHLM67tT1sOYhRm9FeOtSLrk6kWkEVrfANarIe69OTHek7d3BE6rGPM+0hnDzLy9EdVXBZha5++zHbf8mflSzTFrPQhudRtwzUpackGqRjv0i6Qx9DNLhNxM0l9Jp/8r5fvk7dpOt6tQ54sDAhAlOhElfY3UnNbxPb8OnFamnh28ImnLyKu7Sno/6UiyrHNI4713o2nFzzKBlJZufhuwdm4GaZy7r0q6XkIVtU3SlDSZNJHyatIIsvsr1q3ZPyLieUmDJA2KiBslfbdizNci4jVJSFoxIh5SurJgFdcqXdzo0sjfmn1NRLwhaQ/glBpjPgr8a55ENygiXqordsNAHK10Lqk9rzHJajfSEdVGwEURUfUSgn2O0nLlnYrcflwy9nci4mtd79njuFuRrvfc6Nx+B7BfRJTqQ2iMy2+MWsllpZbCVrpy4BdJieD/WJwc/gr8KiJ+WqaOOXZtkzQlvcni61A3f5DrGAX0R2Av0izjtUnLPWwVEdtWiHkZqc/mi6RVfV8gDaL4twoxGyOgFpI6pyu/93ZQut7GaqQDmOZrh5da+l5pOfVPUJxtXtv8joGYHK4BPhERL+ftVUidnx8jnT2M6s36tZvS0tdb5c27osRywB3ibUdqBnhF0oGktVt+XCXhNMVenrTsReN6AaUn7Em6IyK2yX//iaSkc3FEvKtCzC9ExE/KPr8/y0ekfyO16zeun31O9GwJ7KXF3yHH/ENEvF5HzL4sD+aAxUm8kcRKTSZVWkfsRYqzzeu4hG96jQGYHGYDmzX+4XKGnRkRG6sPXd6wHVRcHPBDQJnFAZtjziJdd2E0aRTU6aSZmKUuTiPpK42zN0n7RMRFTY99O8qP+96dtDTD+qTrXa9KampZ6lyAbsTdlLRsdfPomtLLVksanuu3HemL4jbgqKh+5bZaKS3c93TkWdZ5SO+6ka7V3dNYHa+wdnrUePGtNoyAqk1TM3fj7DNIzZ23RcRjFeLWOnerlUHtDN5LzgXukHSc0pr0U4Dz8pHQg71btbZrLA44PiIOJq1XVPWaBgtzW+6epDOGH7PkWvI9tV/T/Y7NVbuUDRoRV0bEixFxf0TsGBHvryExHEf6Iv8JqUP2e6RRPFWcQbrexD8B65GaP1s1NfW2i1jyGhZvsHjRuJ6aTGrqvY80f6K+o9u0vMstpCVDvpl/Hl9X/BoMzbdV8m0o6XdxtaT9lvbELtyudO3w9qmzd7uv3EhX12q0G4/p7fosw/d9X4ftQR3LSsS8mfQl/ghpXP3gKjFpGv5Lh6HAHbe7Ge9wYGS+L9IX7YvALGCLqr9P6r+mQa2jtNr4v9SqnveW/T023V+OGkf50YYRUMvo97tmld8D6UD3ddLqrLPy72FWnXUcUKOVtOQiZpUmf/VTf8ht7s2LA1Zdx2Zf0gSrQyLiz5I2IDVdlRWd3G+13R1HkRYdgzQZbDRpmN8WpL6HD7V+WrfUfk0D0iTNA1n8N9qfNKu3r5kvaY/IZ1+S9gSeKxmrnVdYa8cIqLaLdB2HKr+I2mewdzSgkkP+IN+rmheg6uskvZvUHnyMllwccCppiGdpOSGcA2yV2/XvimrXEa57aPDCWNyRvTtwVqRO0z9KqjoyrfZrGgCHkGbdn0JKhrfnsr7ms8A5SgsFirRMx8ElY23W4e+8UtP/QES1kUXtWt6lrSQ1RmuVEhGPS/og6az5DKVrZazS1fN6YiB2SN9AGq1zF03D/CJiz96rVXspXSPgv6LDevuSxgDHRURPrqPbMXbtndx1knQ3abjyC6RF4naKvHyEpNkRsXFNrzOCGq5p0N/k0X6KNoyjr1tfHAHVyRykNUkJ7OCIeKhk3ONIfRfvjYj3KK0rdVFEbFepwk0G1JlD9s2m+yIdRZe+RkA/MaLVl1ZETM9falXUcgW8Nvof0vpKg0lLdTcSww7Ao1UCq8ar1WnJVWgLoo9crU/SgRHxm04mkxIRP+qVinXQyQiom3u3Vi3t3mE7SCs6V11S/mOkptO7ASLiKUm1Lmw44JJDRNwsaXNSO/m/k9aFObV3a9V2S2uOWali7EGx5FyJ5+lDo9wi4so8CXBoRDSfpk8n9ZdUcUzT/UVXqyNN4Oqp5sl93wSOq1Cvdlo5/2zHCqp1mkzqy7iV1P4+itT/1KdEDfOBOvF6RISkxgKjK3f1hJ4aMM1KSuub78fiDr4LgKMjYqmzhwcCSeeR1hH6VYfyQ4FxUeHCPGp9Bbz7IuIrZWP2V6rhanU5zoCeb7MsqOmyvZKWI/WF1XaJzL5O0tGkuR07k2axHwKcGzVO2hxIyeFN0lHEoRExJ5c9GhFVR5f0eXlW9GWkoW2NUVpjSFec+lhE/Lli/D59BbxlJY8uKX21uqY4d/fVLzJJ/7OUhyMiTlxmlVmKjr/Dvvw7rVMeIHBuRNwuaWeaLg0bEdfV+VoDqVnpE6Qzhxvz1PLzodaLn/RZEfEMsK2kHUkLsQH8PiJuKBuzaQTUlGi6OI2k7SW9KyL+VLnifZyKV6vbgrRa6UDWqi18ZeBQ0iqtfSI50N4RUH3ZI8APJb2D1DpyTpRbULRLA+bMoSG3ve1Fal7aidQ2eVlEXNurFetn2jkCqk6Slnq0GCUXNsuxxzfCkBZ2mxsRt5eM1XwN4beRllqGPvxlljs4jyIlhguBH0bFtbqsHrmfbb98G0Jq9j0/Iv5fba8x0JJDM0lrAvsA+0bJBa7eqpa2dktze29va1rQrJUo83fPE76GR8TP8vZdpOtFBPCVvjKMt13y5+bLpAX3JpOWTSk9Jt/aS9IWwCRgdEQMri3uQE4OVp6kORHx7p4+NhBImkJaPvzJvD2TdBa6CnBGRIztzfq1Ux6A8HHSNTt+Fnl1Y+tblFY03oV05jCWtMzNeRFxeV2v0WeGJFqfM03S4R0L8wioPrM0iaSvNN3fp8Nj3y4ZdoVGYshui4gFedZ97UMG+5j/JC0K+N/AU5L+mm8vNbXxWy+RtLOkScA8YAJpeZx3RcS+dSYG8JmDdaLdI6Dq0jxKpa4RLF2cNf0pKlwjwqyK3Ix6LnBJRCxo52sNpNFKVqN2jIBqE3Vyv9V2d90p6fAW80Y+Q/W1lcxKi4gdl9VrOTnYUkXEjcDSOn17W92rvAJ8Cbhc0gHk5QlIy8CvSBoJZzbguVnJ+jVJb5DG5ou0VEjzENEhEbF8hdg7AZvkzQf64FmTWds4OZiZWYFHK5mZWYGTg5mZFTg5mJlZgZOD2TIg6YuS3tbb9TDrLndImy0DkuYCYyLiuRaPDY6IN5Z9rcw65zMHs0zSwZJmSbpX0tmS/lnS9bnsekkb5P3OlLR30/Nezj8/LOkmSRdLekjSOUqOJC1JcWNjoUBJL0s6QdKdwH9Luqwp3s6SLl2mb96sA0+CMwMkbUK6XvZ2EfFcXpl0MnBWREyWdAgwka4nwW1BmhvxFDAlx5uYr8m8Y9OZw8rA/RHxP/kiQrMlDYuI+cCngTNqf5NmPeAzB7NkJ+Dixpd3XrfmX0jr2ACcTboaXlfuioh5EfEmMBMY0cl+bwCX5NeKHP9ASavn17265Pswq4XPHMwS0fVyG43HF5IPrPJR/wpN+/y96f4bdP4Ze61DP8MZwO+A14CLImJhN+tt1hY+czBLrgf+XdJasOiCN7eT1suHdOGb2/L9uaS1lgD2BLqzRMdLwNDOHoyIp0hNUf8NnNmzqpvVz2cOZkBEPCDpJODmvF7TPcCRwCRJxwCNvgCAXwG/zVeIu57W113u6DTgaklPL2VlzXOAYRHxYJX3YlYHD2U16yMk/RS4JyJO7+26mDk5mPUBkmaQzkB2joi/d7W/Wbs5OZiZWYE7pM3MrMDJwczMCpwczMyswMnBzMwKnBzMzKzAycHMzAr+P7z9xgjlNa7UAAAAAElFTkSuQmCC\n", 1491 | "text/plain": [ 1492 | "" 1493 | ] 1494 | }, 1495 | "metadata": {}, 1496 | "output_type": "display_data" 1497 | } 1498 | ], 1499 | "source": [ 1500 | "(pd.concat([c_country,e_country],axis = 1)).plot(kind='bar')" 1501 | ] 1502 | }, 1503 | { 1504 | "cell_type": "markdown", 1505 | "metadata": {}, 1506 | "source": [ 1507 | "The sample is biased. For example, Argentina and Uruguay's exp group has a larger sample size than the control group" 1508 | ] 1509 | }, 1510 | { 1511 | "cell_type": "markdown", 1512 | "metadata": {}, 1513 | "source": [ 1514 | "#### We should look at the comparison in each segment(country)" 1515 | ] 1516 | }, 1517 | { 1518 | "cell_type": "code", 1519 | "execution_count": 327, 1520 | "metadata": {}, 1521 | "outputs": [], 1522 | "source": [ 1523 | "# get the conversion rate for each country in controll group\n", 1524 | "c_cr = pd.Series(controll.groupby('country').conversion.mean(),name = 'controll conversion rate')" 1525 | ] 1526 | }, 1527 | { 1528 | "cell_type": "code", 1529 | "execution_count": 328, 1530 | "metadata": {}, 1531 | "outputs": [], 1532 | "source": [ 1533 | "# get the conversion rate for each country in experiment group\n", 1534 | "e_cr = pd.Series(exp.groupby('country').conversion.mean(), name = 'exp conversion rate')" 1535 | ] 1536 | }, 1537 | { 1538 | "cell_type": "code", 1539 | "execution_count": 329, 1540 | "metadata": {}, 1541 | "outputs": [ 1542 | { 1543 | "data": { 1544 | "text/plain": [ 1545 | "country\n", 1546 | "Argentina 0.015071\n", 1547 | "Bolivia 0.049369\n", 1548 | "Chile 0.048107\n", 1549 | "Colombia 0.052089\n", 1550 | "Costa Rica 0.052256\n", 1551 | "Ecuador 0.049154\n", 1552 | "El Salvador 0.053554\n", 1553 | "Guatemala 0.050643\n", 1554 | "Honduras 0.050906\n", 1555 | "Mexico 0.049495\n", 1556 | "Nicaragua 0.052647\n", 1557 | "Panama 0.046796\n", 1558 | "Paraguay 0.048493\n", 1559 | "Peru 0.049914\n", 1560 | "Uruguay 0.012048\n", 1561 | "Venezuela 0.050344\n", 1562 | "Name: controll conversion rate, dtype: float64" 1563 | ] 1564 | }, 1565 | "execution_count": 329, 1566 | "metadata": {}, 1567 | "output_type": "execute_result" 1568 | } 1569 | ], 1570 | "source": [ 1571 | "c_cr" 1572 | ] 1573 | }, 1574 | { 1575 | "cell_type": "markdown", 1576 | "metadata": {}, 1577 | "source": [ 1578 | "Get all the t, and p values for each country" 1579 | ] 1580 | }, 1581 | { 1582 | "cell_type": "code", 1583 | "execution_count": 330, 1584 | "metadata": {}, 1585 | "outputs": [], 1586 | "source": [ 1587 | "country_list =list(controll.country.unique())" 1588 | ] 1589 | }, 1590 | { 1591 | "cell_type": "code", 1592 | "execution_count": 331, 1593 | "metadata": {}, 1594 | "outputs": [ 1595 | { 1596 | "data": { 1597 | "text/plain": [ 1598 | "['Mexico',\n", 1599 | " 'Colombia',\n", 1600 | " 'El Salvador',\n", 1601 | " 'Nicaragua',\n", 1602 | " 'Peru',\n", 1603 | " 'Chile',\n", 1604 | " 'Argentina',\n", 1605 | " 'Ecuador',\n", 1606 | " 'Venezuela',\n", 1607 | " 'Guatemala',\n", 1608 | " 'Honduras',\n", 1609 | " 'Panama',\n", 1610 | " 'Paraguay',\n", 1611 | " 'Costa Rica',\n", 1612 | " 'Bolivia',\n", 1613 | " 'Uruguay']" 1614 | ] 1615 | }, 1616 | "execution_count": 331, 1617 | "metadata": {}, 1618 | "output_type": "execute_result" 1619 | } 1620 | ], 1621 | "source": [ 1622 | "country_list" 1623 | ] 1624 | }, 1625 | { 1626 | "cell_type": "code", 1627 | "execution_count": 332, 1628 | "metadata": {}, 1629 | "outputs": [], 1630 | "source": [ 1631 | "lin = []\n", 1632 | "for c in country_list:\n", 1633 | " t,p = stats.ttest_ind(a=controll[controll['country']==c].conversion, \\\n", 1634 | " b=exp[exp['country']==c].conversion,equal_var = False)\n", 1635 | " #t_stat.append(t)\n", 1636 | " #p_value.append(p)\n", 1637 | " lin = lin + [[t,p]]" 1638 | ] 1639 | }, 1640 | { 1641 | "cell_type": "code", 1642 | "execution_count": 333, 1643 | "metadata": {}, 1644 | "outputs": [ 1645 | { 1646 | "data": { 1647 | "text/plain": [ 1648 | "[[-1.3866735952325449, 0.16554372211039645],\n", 1649 | " [0.79999178223708245, 0.42371907413141141],\n", 1650 | " [1.1549940887832975, 0.2481266743266678],\n", 1651 | " [-0.27880850314757355, 0.78040038589047944],\n", 1652 | " [-0.28982358545511927, 0.77195298851535477],\n", 1653 | " [-1.0303728644383661, 0.30284764308444695],\n", 1654 | " [0.9638326839451179, 0.33514654687468659],\n", 1655 | " [0.048257426198918048, 0.96151169060066222],\n", 1656 | " [0.56261424690935702, 0.57370152343872549],\n", 1657 | " [0.56496315146205101, 0.57210720819120686],\n", 1658 | " [0.72013284328217941, 0.47146285652575859],\n", 1659 | " [-0.378167043801935, 0.70532683727258894],\n", 1660 | " [-0.14628996329799995, 0.88369650349623641],\n", 1661 | " [-0.40176067651471453, 0.68787635370739864],\n", 1662 | " [0.35995817724402418, 0.71888524684510746],\n", 1663 | " [-0.15134316107212104, 0.87976397365142245]]" 1664 | ] 1665 | }, 1666 | "execution_count": 333, 1667 | "metadata": {}, 1668 | "output_type": "execute_result" 1669 | } 1670 | ], 1671 | "source": [ 1672 | "lin" 1673 | ] 1674 | }, 1675 | { 1676 | "cell_type": "code", 1677 | "execution_count": 334, 1678 | "metadata": {}, 1679 | "outputs": [], 1680 | "source": [ 1681 | "stats = pd.DataFrame(lin, columns=['t', 'p'], index = country_list)" 1682 | ] 1683 | }, 1684 | { 1685 | "cell_type": "code", 1686 | "execution_count": 335, 1687 | "metadata": {}, 1688 | "outputs": [ 1689 | { 1690 | "data": { 1691 | "text/html": [ 1692 | "
\n", 1693 | "\n", 1706 | "\n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " \n", 1716 | " \n", 1717 | " \n", 1718 | " \n", 1719 | " \n", 1720 | " \n", 1721 | " \n", 1722 | " \n", 1723 | " \n", 1724 | " \n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | " \n", 1774 | " \n", 1775 | " \n", 1776 | " \n", 1777 | " \n", 1778 | " \n", 1779 | " \n", 1780 | " \n", 1781 | " \n", 1782 | " \n", 1783 | " \n", 1784 | " \n", 1785 | " \n", 1786 | " \n", 1787 | " \n", 1788 | " \n", 1789 | " \n", 1790 | " \n", 1791 | " \n", 1792 | " \n", 1793 | " \n", 1794 | " \n", 1795 | " \n", 1796 | "
tp
Mexico-1.3866740.165544
Colombia0.7999920.423719
El Salvador1.1549940.248127
Nicaragua-0.2788090.780400
Peru-0.2898240.771953
Chile-1.0303730.302848
Argentina0.9638330.335147
Ecuador0.0482570.961512
Venezuela0.5626140.573702
Guatemala0.5649630.572107
Honduras0.7201330.471463
Panama-0.3781670.705327
Paraguay-0.1462900.883697
Costa Rica-0.4017610.687876
Bolivia0.3599580.718885
Uruguay-0.1513430.879764
\n", 1797 | "
" 1798 | ], 1799 | "text/plain": [ 1800 | " t p\n", 1801 | "Mexico -1.386674 0.165544\n", 1802 | "Colombia 0.799992 0.423719\n", 1803 | "El Salvador 1.154994 0.248127\n", 1804 | "Nicaragua -0.278809 0.780400\n", 1805 | "Peru -0.289824 0.771953\n", 1806 | "Chile -1.030373 0.302848\n", 1807 | "Argentina 0.963833 0.335147\n", 1808 | "Ecuador 0.048257 0.961512\n", 1809 | "Venezuela 0.562614 0.573702\n", 1810 | "Guatemala 0.564963 0.572107\n", 1811 | "Honduras 0.720133 0.471463\n", 1812 | "Panama -0.378167 0.705327\n", 1813 | "Paraguay -0.146290 0.883697\n", 1814 | "Costa Rica -0.401761 0.687876\n", 1815 | "Bolivia 0.359958 0.718885\n", 1816 | "Uruguay -0.151343 0.879764" 1817 | ] 1818 | }, 1819 | "execution_count": 335, 1820 | "metadata": {}, 1821 | "output_type": "execute_result" 1822 | } 1823 | ], 1824 | "source": [ 1825 | "stats" 1826 | ] 1827 | }, 1828 | { 1829 | "cell_type": "code", 1830 | "execution_count": 336, 1831 | "metadata": {}, 1832 | "outputs": [ 1833 | { 1834 | "data": { 1835 | "text/html": [ 1836 | "
\n", 1837 | "\n", 1850 | "\n", 1851 | " \n", 1852 | " \n", 1853 | " \n", 1854 | " \n", 1855 | " \n", 1856 | " \n", 1857 | " \n", 1858 | " \n", 1859 | " \n", 1860 | " \n", 1861 | " \n", 1862 | " \n", 1863 | " \n", 1864 | " \n", 1865 | " \n", 1866 | " \n", 1867 | " \n", 1868 | " \n", 1869 | " \n", 1870 | " \n", 1871 | " \n", 1872 | " \n", 1873 | " \n", 1874 | " \n", 1875 | " \n", 1876 | " \n", 1877 | " \n", 1878 | " \n", 1879 | " \n", 1880 | " \n", 1881 | " \n", 1882 | " \n", 1883 | " \n", 1884 | " \n", 1885 | " \n", 1886 | " \n", 1887 | " \n", 1888 | " \n", 1889 | " \n", 1890 | " \n", 1891 | " \n", 1892 | " \n", 1893 | " \n", 1894 | " \n", 1895 | " \n", 1896 | " \n", 1897 | " \n", 1898 | " \n", 1899 | " \n", 1900 | " \n", 1901 | " \n", 1902 | " \n", 1903 | " \n", 1904 | " \n", 1905 | " \n", 1906 | " \n", 1907 | " \n", 1908 | " \n", 1909 | " \n", 1910 | " \n", 1911 | " \n", 1912 | " \n", 1913 | " \n", 1914 | " \n", 1915 | " \n", 1916 | " \n", 1917 | " \n", 1918 | " \n", 1919 | " \n", 1920 | " \n", 1921 | " \n", 1922 | " \n", 1923 | " \n", 1924 | " \n", 1925 | " \n", 1926 | " \n", 1927 | " \n", 1928 | " \n", 1929 | " \n", 1930 | " \n", 1931 | " \n", 1932 | " \n", 1933 | " \n", 1934 | " \n", 1935 | " \n", 1936 | " \n", 1937 | " \n", 1938 | " \n", 1939 | " \n", 1940 | " \n", 1941 | " \n", 1942 | " \n", 1943 | " \n", 1944 | " \n", 1945 | " \n", 1946 | " \n", 1947 | " \n", 1948 | " \n", 1949 | " \n", 1950 | " \n", 1951 | " \n", 1952 | " \n", 1953 | " \n", 1954 | " \n", 1955 | " \n", 1956 | " \n", 1957 | " \n", 1958 | " \n", 1959 | " \n", 1960 | " \n", 1961 | " \n", 1962 | " \n", 1963 | " \n", 1964 | " \n", 1965 | " \n", 1966 | " \n", 1967 | " \n", 1968 | " \n", 1969 | " \n", 1970 | " \n", 1971 | " \n", 1972 | " \n", 1973 | " \n", 1974 | "
controll conversion rateexp conversion ratetp
Argentina0.0150710.0137250.9638330.335147
Bolivia0.0493690.0479010.3599580.718885
Chile0.0481070.051295-1.0303730.302848
Colombia0.0520890.0505710.7999920.423719
Costa Rica0.0522560.054738-0.4017610.687876
Ecuador0.0491540.0489880.0482570.961512
El Salvador0.0535540.0479471.1549940.248127
Guatemala0.0506430.0486470.5649630.572107
Honduras0.0509060.0475400.7201330.471463
Mexico0.0494950.051186-1.3866740.165544
Nicaragua0.0526470.054177-0.2788090.780400
Panama0.0467960.049370-0.3781670.705327
Paraguay0.0484930.049229-0.1462900.883697
Peru0.0499140.050604-0.2898240.771953
Uruguay0.0120480.012907-0.1513430.879764
Venezuela0.0503440.0489780.5626140.573702
\n", 1975 | "
" 1976 | ], 1977 | "text/plain": [ 1978 | " controll conversion rate exp conversion rate t p\n", 1979 | "Argentina 0.015071 0.013725 0.963833 0.335147\n", 1980 | "Bolivia 0.049369 0.047901 0.359958 0.718885\n", 1981 | "Chile 0.048107 0.051295 -1.030373 0.302848\n", 1982 | "Colombia 0.052089 0.050571 0.799992 0.423719\n", 1983 | "Costa Rica 0.052256 0.054738 -0.401761 0.687876\n", 1984 | "Ecuador 0.049154 0.048988 0.048257 0.961512\n", 1985 | "El Salvador 0.053554 0.047947 1.154994 0.248127\n", 1986 | "Guatemala 0.050643 0.048647 0.564963 0.572107\n", 1987 | "Honduras 0.050906 0.047540 0.720133 0.471463\n", 1988 | "Mexico 0.049495 0.051186 -1.386674 0.165544\n", 1989 | "Nicaragua 0.052647 0.054177 -0.278809 0.780400\n", 1990 | "Panama 0.046796 0.049370 -0.378167 0.705327\n", 1991 | "Paraguay 0.048493 0.049229 -0.146290 0.883697\n", 1992 | "Peru 0.049914 0.050604 -0.289824 0.771953\n", 1993 | "Uruguay 0.012048 0.012907 -0.151343 0.879764\n", 1994 | "Venezuela 0.050344 0.048978 0.562614 0.573702" 1995 | ] 1996 | }, 1997 | "execution_count": 336, 1998 | "metadata": {}, 1999 | "output_type": "execute_result" 2000 | } 2001 | ], 2002 | "source": [ 2003 | "pd.concat([c_cr,e_cr,stats],axis = 1)" 2004 | ] 2005 | }, 2006 | { 2007 | "cell_type": "markdown", 2008 | "metadata": {}, 2009 | "source": [ 2010 | "### Conclusion:\n", 2011 | "If we look at the A/B test results in each segment, we can see that the p values is not less than the alpha 0.05, which means that we cannot reject null hypothesis.

\n", 2012 | "Therefore, there is no significant improvement of the converstion rate after the change.

\n", 2013 | "Also, it's not becoming worse after the change." 2014 | ] 2015 | }, 2016 | { 2017 | "cell_type": "markdown", 2018 | "metadata": {}, 2019 | "source": [ 2020 | "#### Some extra\n", 2021 | "Below is the step by step calculation of t-statistics not using stats.ttest_ind() function" 2022 | ] 2023 | }, 2024 | { 2025 | "cell_type": "code", 2026 | "execution_count": 337, 2027 | "metadata": {}, 2028 | "outputs": [], 2029 | "source": [ 2030 | "# take Mexico as an example\n", 2031 | "controll_m = controll[controll['country']=='Mexico']\n", 2032 | "exp_m = exp[exp['country']=='Mexico']" 2033 | ] 2034 | }, 2035 | { 2036 | "cell_type": "markdown", 2037 | "metadata": {}, 2038 | "source": [ 2039 | "Calculate sample size" 2040 | ] 2041 | }, 2042 | { 2043 | "cell_type": "code", 2044 | "execution_count": 338, 2045 | "metadata": {}, 2046 | "outputs": [], 2047 | "source": [ 2048 | "na = len(controll_m)\n", 2049 | "nb = len(exp_m)" 2050 | ] 2051 | }, 2052 | { 2053 | "cell_type": "code", 2054 | "execution_count": 339, 2055 | "metadata": {}, 2056 | "outputs": [ 2057 | { 2058 | "name": "stdout", 2059 | "output_type": "stream", 2060 | "text": [ 2061 | "Sample size of controll group: 64209\n", 2062 | "Sample size of experiment group: 64275\n" 2063 | ] 2064 | } 2065 | ], 2066 | "source": [ 2067 | "print ('Sample size of controll group: {}'.format(na))\n", 2068 | "print ('Sample size of experiment group: {}'.format(nb))" 2069 | ] 2070 | }, 2071 | { 2072 | "cell_type": "markdown", 2073 | "metadata": {}, 2074 | "source": [ 2075 | "Degree of freedom" 2076 | ] 2077 | }, 2078 | { 2079 | "cell_type": "code", 2080 | "execution_count": 340, 2081 | "metadata": {}, 2082 | "outputs": [ 2083 | { 2084 | "name": "stdout", 2085 | "output_type": "stream", 2086 | "text": [ 2087 | "128482\n" 2088 | ] 2089 | } 2090 | ], 2091 | "source": [ 2092 | "df = na+nb-2\n", 2093 | "print (df)" 2094 | ] 2095 | }, 2096 | { 2097 | "cell_type": "markdown", 2098 | "metadata": {}, 2099 | "source": [ 2100 | "Calculate conversion rate(sample mean) for controlled and exp group for Mexico" 2101 | ] 2102 | }, 2103 | { 2104 | "cell_type": "code", 2105 | "execution_count": 341, 2106 | "metadata": {}, 2107 | "outputs": [], 2108 | "source": [ 2109 | "xa = controll_m.conversion.mean()\n", 2110 | "xb = exp_m.conversion.mean()" 2111 | ] 2112 | }, 2113 | { 2114 | "cell_type": "code", 2115 | "execution_count": 343, 2116 | "metadata": {}, 2117 | "outputs": [ 2118 | { 2119 | "name": "stdout", 2120 | "output_type": "stream", 2121 | "text": [ 2122 | "Conversion rate of controll group: 0.04949461913438926\n", 2123 | "Conversion rate of experiment group: 0.05118630882924932\n" 2124 | ] 2125 | } 2126 | ], 2127 | "source": [ 2128 | "# the conversion rate of test group is 0.17% higher, but is it significant enough? or it's due to chance.\n", 2129 | "print ('Conversion rate of controll group: {}'.format(xa))\n", 2130 | "print ('Conversion rate of experiment group: {}'.format(xb))" 2131 | ] 2132 | }, 2133 | { 2134 | "cell_type": "markdown", 2135 | "metadata": {}, 2136 | "source": [ 2137 | "Calculate standard deviation" 2138 | ] 2139 | }, 2140 | { 2141 | "cell_type": "code", 2142 | "execution_count": 344, 2143 | "metadata": {}, 2144 | "outputs": [], 2145 | "source": [ 2146 | "# in ipython notebook, use shift+tab to get function details(more tab more details)\n", 2147 | "# ddof is set default of 1. 1 is for sample std, 0 for population std\n", 2148 | "sa = controll_m.conversion.std()\n", 2149 | "sb = exp_m.conversion.std()" 2150 | ] 2151 | }, 2152 | { 2153 | "cell_type": "code", 2154 | "execution_count": 345, 2155 | "metadata": {}, 2156 | "outputs": [], 2157 | "source": [ 2158 | "# calculate standard error\n", 2159 | "se = pow(sa,2)/na+pow(sb,2)/nb" 2160 | ] 2161 | }, 2162 | { 2163 | "cell_type": "code", 2164 | "execution_count": 346, 2165 | "metadata": {}, 2166 | "outputs": [], 2167 | "source": [ 2168 | "t = (xb-xa)/np.sqrt(se)" 2169 | ] 2170 | }, 2171 | { 2172 | "cell_type": "code", 2173 | "execution_count": 347, 2174 | "metadata": {}, 2175 | "outputs": [ 2176 | { 2177 | "name": "stdout", 2178 | "output_type": "stream", 2179 | "text": [ 2180 | "1.38667359523\n" 2181 | ] 2182 | } 2183 | ], 2184 | "source": [ 2185 | "print (t)" 2186 | ] 2187 | }, 2188 | { 2189 | "cell_type": "markdown", 2190 | "metadata": {}, 2191 | "source": [ 2192 | "Look up t-table of alpha = 0.05 and df = 128482 to get the critical value. t = 1.96
\n", 2193 | "Since 1.39 is not greater than the critical value 1.96. so we cannot reject null" 2194 | ] 2195 | } 2196 | ], 2197 | "metadata": { 2198 | "kernelspec": { 2199 | "display_name": "Python 3", 2200 | "language": "python", 2201 | "name": "python3" 2202 | }, 2203 | "language_info": { 2204 | "codemirror_mode": { 2205 | "name": "ipython", 2206 | "version": 3 2207 | }, 2208 | "file_extension": ".py", 2209 | "mimetype": "text/x-python", 2210 | "name": "python", 2211 | "nbconvert_exporter": "python", 2212 | "pygments_lexer": "ipython3", 2213 | "version": "3.5.4" 2214 | } 2215 | }, 2216 | "nbformat": 4, 2217 | "nbformat_minor": 2 2218 | } 2219 | -------------------------------------------------------------------------------- /Employee_Retention_PeopleAnalytics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Employee Retention - People Analytics" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Goal:\n", 15 | "### 1. Predict Employee Retention\n", 16 | "#### ----create a table with 3 columns, day, employee_headcount, company_id\n", 17 | "### 2. What are the main factors drive employee churn" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 811, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import pandas as pd\n", 27 | "import numpy as np\n", 28 | "import matplotlib.pyplot as plt\n", 29 | "%matplotlib inline\n", 30 | "import datetime\n", 31 | "from ggplot import *" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 812, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "data = pd.read_csv(r'C:\\Users\\lshen\\Downloads\\employee_retention_data.csv')" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 813, 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "data": { 50 | "text/html": [ 51 | "
\n", 52 | "\n", 65 | "\n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | "
employee_idcompany_iddeptsenioritysalaryjoin_datequit_date
013021.07customer_service2889000.02014-03-242015-10-30
1825355.07marketing20183000.02013-04-292014-04-04
2927315.04marketing14101000.02014-10-13NaN
3662910.07customer_service20115000.02012-05-142013-06-07
4256971.02data_science23276000.02011-10-172014-08-22
\n", 131 | "
" 132 | ], 133 | "text/plain": [ 134 | " employee_id company_id dept seniority salary join_date \\\n", 135 | "0 13021.0 7 customer_service 28 89000.0 2014-03-24 \n", 136 | "1 825355.0 7 marketing 20 183000.0 2013-04-29 \n", 137 | "2 927315.0 4 marketing 14 101000.0 2014-10-13 \n", 138 | "3 662910.0 7 customer_service 20 115000.0 2012-05-14 \n", 139 | "4 256971.0 2 data_science 23 276000.0 2011-10-17 \n", 140 | "\n", 141 | " quit_date \n", 142 | "0 2015-10-30 \n", 143 | "1 2014-04-04 \n", 144 | "2 NaN \n", 145 | "3 2013-06-07 \n", 146 | "4 2014-08-22 " 147 | ] 148 | }, 149 | "execution_count": 813, 150 | "metadata": {}, 151 | "output_type": "execute_result" 152 | } 153 | ], 154 | "source": [ 155 | "data.head()" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 814, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "data": { 165 | "text/plain": [ 166 | "employee_id float64\n", 167 | "company_id int64\n", 168 | "dept object\n", 169 | "seniority int64\n", 170 | "salary float64\n", 171 | "join_date object\n", 172 | "quit_date object\n", 173 | "dtype: object" 174 | ] 175 | }, 176 | "execution_count": 814, 177 | "metadata": {}, 178 | "output_type": "execute_result" 179 | } 180 | ], 181 | "source": [ 182 | "data.dtypes" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 815, 188 | "metadata": {}, 189 | "outputs": [ 190 | { 191 | "data": { 192 | "text/plain": [ 193 | "(24702, 7)" 194 | ] 195 | }, 196 | "execution_count": 815, 197 | "metadata": {}, 198 | "output_type": "execute_result" 199 | } 200 | ], 201 | "source": [ 202 | "data.shape" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 816, 208 | "metadata": {}, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "text/plain": [ 213 | "employee_id 0\n", 214 | "company_id 0\n", 215 | "dept 0\n", 216 | "seniority 0\n", 217 | "salary 0\n", 218 | "join_date 0\n", 219 | "quit_date 11192\n", 220 | "dtype: int64" 221 | ] 222 | }, 223 | "execution_count": 816, 224 | "metadata": {}, 225 | "output_type": "execute_result" 226 | } 227 | ], 228 | "source": [ 229 | "data.isnull().sum()" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "### change into proper data types" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 817, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "# change join and quit date's type to date time\n", 246 | "# one way -----data['join_date'] = data.join_date.astype(datetime.datetime)\n", 247 | "data['join_date'] = pd.to_datetime(data.join_date)\n", 248 | "data['quit_date'] = pd.to_datetime(data.quit_date)" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 818, 254 | "metadata": {}, 255 | "outputs": [ 256 | { 257 | "data": { 258 | "text/plain": [ 259 | "employee_id float64\n", 260 | "company_id int64\n", 261 | "dept object\n", 262 | "seniority int64\n", 263 | "salary float64\n", 264 | "join_date datetime64[ns]\n", 265 | "quit_date datetime64[ns]\n", 266 | "dtype: object" 267 | ] 268 | }, 269 | "execution_count": 818, 270 | "metadata": {}, 271 | "output_type": "execute_result" 272 | } 273 | ], 274 | "source": [ 275 | "data.dtypes" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 819, 281 | "metadata": {}, 282 | "outputs": [ 283 | { 284 | "data": { 285 | "text/html": [ 286 | "
\n", 287 | "\n", 300 | "\n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | "
employee_idcompany_iddeptsenioritysalaryjoin_datequit_date
count24702.00000024702.0000002470224702.00000024702.0000002470213510
uniqueNaNNaN6NaNNaN995664
topNaNNaNcustomer_serviceNaNNaN2012-01-03 00:00:002015-05-08 00:00:00
freqNaNNaN9180NaNNaN105111
firstNaNNaNNaNNaNNaN2011-01-24 00:00:002011-10-13 00:00:00
lastNaNNaNNaNNaNNaN2015-12-10 00:00:002015-12-09 00:00:00
mean501604.4035303.426969NaN14.127803138183.345478NaNNaN
std288909.0261012.700011NaN8.08952076058.184573NaNNaN
min36.0000001.000000NaN1.00000017000.000000NaNNaN
25%250133.7500001.000000NaN7.00000079000.000000NaNNaN
50%500793.0000002.000000NaN14.000000123000.000000NaNNaN
75%753137.2500005.000000NaN21.000000187000.000000NaNNaN
max999969.00000012.000000NaN99.000000408000.000000NaNNaN
\n", 446 | "
" 447 | ], 448 | "text/plain": [ 449 | " employee_id company_id dept seniority \\\n", 450 | "count 24702.000000 24702.000000 24702 24702.000000 \n", 451 | "unique NaN NaN 6 NaN \n", 452 | "top NaN NaN customer_service NaN \n", 453 | "freq NaN NaN 9180 NaN \n", 454 | "first NaN NaN NaN NaN \n", 455 | "last NaN NaN NaN NaN \n", 456 | "mean 501604.403530 3.426969 NaN 14.127803 \n", 457 | "std 288909.026101 2.700011 NaN 8.089520 \n", 458 | "min 36.000000 1.000000 NaN 1.000000 \n", 459 | "25% 250133.750000 1.000000 NaN 7.000000 \n", 460 | "50% 500793.000000 2.000000 NaN 14.000000 \n", 461 | "75% 753137.250000 5.000000 NaN 21.000000 \n", 462 | "max 999969.000000 12.000000 NaN 99.000000 \n", 463 | "\n", 464 | " salary join_date quit_date \n", 465 | "count 24702.000000 24702 13510 \n", 466 | "unique NaN 995 664 \n", 467 | "top NaN 2012-01-03 00:00:00 2015-05-08 00:00:00 \n", 468 | "freq NaN 105 111 \n", 469 | "first NaN 2011-01-24 00:00:00 2011-10-13 00:00:00 \n", 470 | "last NaN 2015-12-10 00:00:00 2015-12-09 00:00:00 \n", 471 | "mean 138183.345478 NaN NaN \n", 472 | "std 76058.184573 NaN NaN \n", 473 | "min 17000.000000 NaN NaN \n", 474 | "25% 79000.000000 NaN NaN \n", 475 | "50% 123000.000000 NaN NaN \n", 476 | "75% 187000.000000 NaN NaN \n", 477 | "max 408000.000000 NaN NaN " 478 | ] 479 | }, 480 | "execution_count": 819, 481 | "metadata": {}, 482 | "output_type": "execute_result" 483 | } 484 | ], 485 | "source": [ 486 | "data.describe(include = 'all')" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "### Get new hire number for each company by each day" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 820, 499 | "metadata": {}, 500 | "outputs": [], 501 | "source": [ 502 | "new_hire_by_date = data.groupby(['company_id','join_date'], as_index = False).employee_id.count()" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 821, 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [ 511 | "new_hire_by_date.columns = ['company_id','day','new_hire_count']" 512 | ] 513 | }, 514 | { 515 | "cell_type": "code", 516 | "execution_count": 822, 517 | "metadata": {}, 518 | "outputs": [ 519 | { 520 | "data": { 521 | "text/html": [ 522 | "
\n", 523 | "\n", 536 | "\n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | "
company_iddaynew_hire_count
012011-01-2425
112011-01-252
212011-01-262
312011-01-3130
412011-02-017
\n", 578 | "
" 579 | ], 580 | "text/plain": [ 581 | " company_id day new_hire_count\n", 582 | "0 1 2011-01-24 25\n", 583 | "1 1 2011-01-25 2\n", 584 | "2 1 2011-01-26 2\n", 585 | "3 1 2011-01-31 30\n", 586 | "4 1 2011-02-01 7" 587 | ] 588 | }, 589 | "execution_count": 822, 590 | "metadata": {}, 591 | "output_type": "execute_result" 592 | } 593 | ], 594 | "source": [ 595 | "new_hire_by_date.head()" 596 | ] 597 | }, 598 | { 599 | "cell_type": "code", 600 | "execution_count": 823, 601 | "metadata": {}, 602 | "outputs": [ 603 | { 604 | "data": { 605 | "text/html": [ 606 | "
\n", 607 | "\n", 620 | "\n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | "
company_iddaynew_hire_count
5125122014-05-192
5126122014-10-131
5127122015-03-231
5128122015-07-061
5129122015-07-271
\n", 662 | "
" 663 | ], 664 | "text/plain": [ 665 | " company_id day new_hire_count\n", 666 | "5125 12 2014-05-19 2\n", 667 | "5126 12 2014-10-13 1\n", 668 | "5127 12 2015-03-23 1\n", 669 | "5128 12 2015-07-06 1\n", 670 | "5129 12 2015-07-27 1" 671 | ] 672 | }, 673 | "execution_count": 823, 674 | "metadata": {}, 675 | "output_type": "execute_result" 676 | } 677 | ], 678 | "source": [ 679 | "new_hire_by_date.tail()" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "### Get quitted number for each company each day" 687 | ] 688 | }, 689 | { 690 | "cell_type": "code", 691 | "execution_count": 824, 692 | "metadata": {}, 693 | "outputs": [], 694 | "source": [ 695 | "quit_by_date = data.groupby(['company_id','quit_date'],as_index=False).employee_id.count()" 696 | ] 697 | }, 698 | { 699 | "cell_type": "code", 700 | "execution_count": 825, 701 | "metadata": {}, 702 | "outputs": [], 703 | "source": [ 704 | "quit_by_date.columns = ['company_id','day','quit_count']" 705 | ] 706 | }, 707 | { 708 | "cell_type": "code", 709 | "execution_count": 826, 710 | "metadata": {}, 711 | "outputs": [ 712 | { 713 | "data": { 714 | "text/html": [ 715 | "
\n", 716 | "\n", 729 | "\n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | "
company_iddayquit_count
012011-10-211
112011-11-111
212011-11-221
312011-11-251
412011-12-091
\n", 771 | "
" 772 | ], 773 | "text/plain": [ 774 | " company_id day quit_count\n", 775 | "0 1 2011-10-21 1\n", 776 | "1 1 2011-11-11 1\n", 777 | "2 1 2011-11-22 1\n", 778 | "3 1 2011-11-25 1\n", 779 | "4 1 2011-12-09 1" 780 | ] 781 | }, 782 | "execution_count": 826, 783 | "metadata": {}, 784 | "output_type": "execute_result" 785 | } 786 | ], 787 | "source": [ 788 | "quit_by_date.head()" 789 | ] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "metadata": {}, 794 | "source": [ 795 | "### Create a dataframe storing the date from start to end" 796 | ] 797 | }, 798 | { 799 | "cell_type": "code", 800 | "execution_count": 827, 801 | "metadata": {}, 802 | "outputs": [], 803 | "source": [ 804 | "start_date = '2011-01-23'\n", 805 | "end_date = '2015-12-13'" 806 | ] 807 | }, 808 | { 809 | "cell_type": "code", 810 | "execution_count": 828, 811 | "metadata": {}, 812 | "outputs": [], 813 | "source": [ 814 | "# continuous day dataframe\n", 815 | "d = pd.DataFrame(pd.date_range(start_date, end_date),columns = ['day'])" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": 829, 821 | "metadata": {}, 822 | "outputs": [ 823 | { 824 | "data": { 825 | "text/html": [ 826 | "
\n", 827 | "\n", 840 | "\n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | "
day
02011-01-23
12011-01-24
22011-01-25
32011-01-26
42011-01-27
\n", 870 | "
" 871 | ], 872 | "text/plain": [ 873 | " day\n", 874 | "0 2011-01-23\n", 875 | "1 2011-01-24\n", 876 | "2 2011-01-25\n", 877 | "3 2011-01-26\n", 878 | "4 2011-01-27" 879 | ] 880 | }, 881 | "execution_count": 829, 882 | "metadata": {}, 883 | "output_type": "execute_result" 884 | } 885 | ], 886 | "source": [ 887 | "d.head()" 888 | ] 889 | }, 890 | { 891 | "cell_type": "markdown", 892 | "metadata": {}, 893 | "source": [ 894 | "### Get the company list" 895 | ] 896 | }, 897 | { 898 | "cell_type": "code", 899 | "execution_count": 830, 900 | "metadata": {}, 901 | "outputs": [], 902 | "source": [ 903 | "company_list = data.company_id.unique()" 904 | ] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": 831, 909 | "metadata": {}, 910 | "outputs": [ 911 | { 912 | "data": { 913 | "text/plain": [ 914 | "array([ 7, 4, 2, 9, 1, 6, 10, 5, 3, 8, 11, 12], dtype=int64)" 915 | ] 916 | }, 917 | "execution_count": 831, 918 | "metadata": {}, 919 | "output_type": "execute_result" 920 | } 921 | ], 922 | "source": [ 923 | "company_list" 924 | ] 925 | }, 926 | { 927 | "cell_type": "code", 928 | "execution_count": 832, 929 | "metadata": {}, 930 | "outputs": [], 931 | "source": [ 932 | "company_list.sort()" 933 | ] 934 | }, 935 | { 936 | "cell_type": "code", 937 | "execution_count": 833, 938 | "metadata": {}, 939 | "outputs": [ 940 | { 941 | "data": { 942 | "text/plain": [ 943 | "array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=int64)" 944 | ] 945 | }, 946 | "execution_count": 833, 947 | "metadata": {}, 948 | "output_type": "execute_result" 949 | } 950 | ], 951 | "source": [ 952 | "company_list" 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": 834, 958 | "metadata": {}, 959 | "outputs": [], 960 | "source": [ 961 | "c = pd.DataFrame(company_list,columns=['company_id'])" 962 | ] 963 | }, 964 | { 965 | "cell_type": "markdown", 966 | "metadata": {}, 967 | "source": [ 968 | "### Cross Join date and company list" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": 835, 974 | "metadata": {}, 975 | "outputs": [], 976 | "source": [ 977 | "# merge on a dummy column and drop it\n", 978 | "headcount = d.assign(foo = 1).merge(c.assign(foo=1)).drop('foo',1)" 979 | ] 980 | }, 981 | { 982 | "cell_type": "code", 983 | "execution_count": 836, 984 | "metadata": {}, 985 | "outputs": [ 986 | { 987 | "data": { 988 | "text/html": [ 989 | "
\n", 990 | "\n", 1003 | "\n", 1004 | " \n", 1005 | " \n", 1006 | " \n", 1007 | " \n", 1008 | " \n", 1009 | " \n", 1010 | " \n", 1011 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1015 | " \n", 1016 | " \n", 1017 | " \n", 1018 | " \n", 1019 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | "
daycompany_id
214272015-12-138
214282015-12-139
214292015-12-1310
214302015-12-1311
214312015-12-1312
\n", 1039 | "
" 1040 | ], 1041 | "text/plain": [ 1042 | " day company_id\n", 1043 | "21427 2015-12-13 8\n", 1044 | "21428 2015-12-13 9\n", 1045 | "21429 2015-12-13 10\n", 1046 | "21430 2015-12-13 11\n", 1047 | "21431 2015-12-13 12" 1048 | ] 1049 | }, 1050 | "execution_count": 836, 1051 | "metadata": {}, 1052 | "output_type": "execute_result" 1053 | } 1054 | ], 1055 | "source": [ 1056 | "headcount.tail()" 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "markdown", 1061 | "metadata": {}, 1062 | "source": [ 1063 | "### merge with new_hire and quit data" 1064 | ] 1065 | }, 1066 | { 1067 | "cell_type": "code", 1068 | "execution_count": 837, 1069 | "metadata": {}, 1070 | "outputs": [], 1071 | "source": [ 1072 | "headcount = (headcount.merge(new_hire_by_date, how='left',\\\n", 1073 | " on=['day','company_id']).fillna(0)).merge(quit_by_date, how='left',\\\n", 1074 | " on =['day','company_id']).fillna(0)" 1075 | ] 1076 | }, 1077 | { 1078 | "cell_type": "code", 1079 | "execution_count": 838, 1080 | "metadata": {}, 1081 | "outputs": [ 1082 | { 1083 | "data": { 1084 | "text/html": [ 1085 | "
\n", 1086 | "\n", 1099 | "\n", 1100 | " \n", 1101 | " \n", 1102 | " \n", 1103 | " \n", 1104 | " \n", 1105 | " \n", 1106 | " \n", 1107 | " \n", 1108 | " \n", 1109 | " \n", 1110 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1132 | " \n", 1133 | " \n", 1134 | " \n", 1135 | " \n", 1136 | " \n", 1137 | " \n", 1138 | " \n", 1139 | " \n", 1140 | " \n", 1141 | " \n", 1142 | " \n", 1143 | " \n", 1144 | " \n", 1145 | " \n", 1146 | " \n", 1147 | " \n", 1148 | " \n", 1149 | " \n", 1150 | " \n", 1151 | " \n", 1152 | " \n", 1153 | " \n", 1154 | " \n", 1155 | " \n", 1156 | " \n", 1157 | " \n", 1158 | " \n", 1159 | " \n", 1160 | " \n", 1161 | " \n", 1162 | " \n", 1163 | " \n", 1164 | " \n", 1165 | " \n", 1166 | " \n", 1167 | " \n", 1168 | " \n", 1169 | " \n", 1170 | " \n", 1171 | " \n", 1172 | " \n", 1173 | " \n", 1174 | " \n", 1175 | " \n", 1176 | " \n", 1177 | " \n", 1178 | " \n", 1179 | " \n", 1180 | " \n", 1181 | " \n", 1182 | " \n", 1183 | " \n", 1184 | " \n", 1185 | " \n", 1186 | " \n", 1187 | " \n", 1188 | " \n", 1189 | " \n", 1190 | " \n", 1191 | " \n", 1192 | " \n", 1193 | " \n", 1194 | " \n", 1195 | " \n", 1196 | " \n", 1197 | " \n", 1198 | " \n", 1199 | " \n", 1200 | " \n", 1201 | " \n", 1202 | " \n", 1203 | " \n", 1204 | " \n", 1205 | " \n", 1206 | " \n", 1207 | " \n", 1208 | " \n", 1209 | " \n", 1210 | " \n", 1211 | " \n", 1212 | " \n", 1213 | " \n", 1214 | " \n", 1215 | " \n", 1216 | " \n", 1217 | " \n", 1218 | " \n", 1219 | " \n", 1220 | " \n", 1221 | " \n", 1222 | " \n", 1223 | " \n", 1224 | " \n", 1225 | " \n", 1226 | " \n", 1227 | " \n", 1228 | " \n", 1229 | " \n", 1230 | " \n", 1231 | " \n", 1232 | " \n", 1233 | " \n", 1234 | " \n", 1235 | " \n", 1236 | " \n", 1237 | " \n", 1238 | " \n", 1239 | " \n", 1240 | " \n", 1241 | " \n", 1242 | " \n", 1243 | " \n", 1244 | " \n", 1245 | " \n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | "
daycompany_idnew_hire_countquit_count
02011-01-2310.00.0
12011-01-2320.00.0
22011-01-2330.00.0
32011-01-2340.00.0
42011-01-2350.00.0
52011-01-2360.00.0
62011-01-2370.00.0
72011-01-2380.00.0
82011-01-2390.00.0
92011-01-23100.00.0
102011-01-23110.00.0
112011-01-23120.00.0
122011-01-24125.00.0
132011-01-24217.00.0
142011-01-2439.00.0
152011-01-24412.00.0
162011-01-2455.00.0
172011-01-2463.00.0
182011-01-2471.00.0
192011-01-2486.00.0
202011-01-2493.00.0
212011-01-24100.00.0
222011-01-24110.00.0
232011-01-24120.00.0
\n", 1280 | "
" 1281 | ], 1282 | "text/plain": [ 1283 | " day company_id new_hire_count quit_count\n", 1284 | "0 2011-01-23 1 0.0 0.0\n", 1285 | "1 2011-01-23 2 0.0 0.0\n", 1286 | "2 2011-01-23 3 0.0 0.0\n", 1287 | "3 2011-01-23 4 0.0 0.0\n", 1288 | "4 2011-01-23 5 0.0 0.0\n", 1289 | "5 2011-01-23 6 0.0 0.0\n", 1290 | "6 2011-01-23 7 0.0 0.0\n", 1291 | "7 2011-01-23 8 0.0 0.0\n", 1292 | "8 2011-01-23 9 0.0 0.0\n", 1293 | "9 2011-01-23 10 0.0 0.0\n", 1294 | "10 2011-01-23 11 0.0 0.0\n", 1295 | "11 2011-01-23 12 0.0 0.0\n", 1296 | "12 2011-01-24 1 25.0 0.0\n", 1297 | "13 2011-01-24 2 17.0 0.0\n", 1298 | "14 2011-01-24 3 9.0 0.0\n", 1299 | "15 2011-01-24 4 12.0 0.0\n", 1300 | "16 2011-01-24 5 5.0 0.0\n", 1301 | "17 2011-01-24 6 3.0 0.0\n", 1302 | "18 2011-01-24 7 1.0 0.0\n", 1303 | "19 2011-01-24 8 6.0 0.0\n", 1304 | "20 2011-01-24 9 3.0 0.0\n", 1305 | "21 2011-01-24 10 0.0 0.0\n", 1306 | "22 2011-01-24 11 0.0 0.0\n", 1307 | "23 2011-01-24 12 0.0 0.0" 1308 | ] 1309 | }, 1310 | "execution_count": 838, 1311 | "metadata": {}, 1312 | "output_type": "execute_result" 1313 | } 1314 | ], 1315 | "source": [ 1316 | "headcount.head(24)" 1317 | ] 1318 | }, 1319 | { 1320 | "cell_type": "markdown", 1321 | "metadata": {}, 1322 | "source": [ 1323 | "Calculate net headcount change per day" 1324 | ] 1325 | }, 1326 | { 1327 | "cell_type": "code", 1328 | "execution_count": 839, 1329 | "metadata": {}, 1330 | "outputs": [], 1331 | "source": [ 1332 | "headcount['head_count_net_change']=headcount.new_hire_count - headcount.quit_count" 1333 | ] 1334 | }, 1335 | { 1336 | "cell_type": "markdown", 1337 | "metadata": {}, 1338 | "source": [ 1339 | "### Answer#1:Get the headcount per day per company" 1340 | ] 1341 | }, 1342 | { 1343 | "cell_type": "markdown", 1344 | "metadata": {}, 1345 | "source": [ 1346 | "Get the cumulative sum of headcount per day per company" 1347 | ] 1348 | }, 1349 | { 1350 | "cell_type": "code", 1351 | "execution_count": 840, 1352 | "metadata": {}, 1353 | "outputs": [], 1354 | "source": [ 1355 | "cumsums = headcount[['company_id','head_count_net_change']].groupby(['company_id']).cumsum()" 1356 | ] 1357 | }, 1358 | { 1359 | "cell_type": "code", 1360 | "execution_count": 841, 1361 | "metadata": {}, 1362 | "outputs": [], 1363 | "source": [ 1364 | "cumsums.columns = ['head_count']" 1365 | ] 1366 | }, 1367 | { 1368 | "cell_type": "code", 1369 | "execution_count": 842, 1370 | "metadata": {}, 1371 | "outputs": [ 1372 | { 1373 | "data": { 1374 | "text/plain": [ 1375 | "21432" 1376 | ] 1377 | }, 1378 | "execution_count": 842, 1379 | "metadata": {}, 1380 | "output_type": "execute_result" 1381 | } 1382 | ], 1383 | "source": [ 1384 | "len(cumsums)" 1385 | ] 1386 | }, 1387 | { 1388 | "cell_type": "code", 1389 | "execution_count": 843, 1390 | "metadata": {}, 1391 | "outputs": [], 1392 | "source": [ 1393 | "headcount = pd.concat([headcount,cumsums], axis = 1)" 1394 | ] 1395 | }, 1396 | { 1397 | "cell_type": "code", 1398 | "execution_count": 844, 1399 | "metadata": {}, 1400 | "outputs": [ 1401 | { 1402 | "data": { 1403 | "text/html": [ 1404 | "
\n", 1405 | "\n", 1418 | "\n", 1419 | " \n", 1420 | " \n", 1421 | " \n", 1422 | " \n", 1423 | " \n", 1424 | " \n", 1425 | " \n", 1426 | " \n", 1427 | " \n", 1428 | " \n", 1429 | " \n", 1430 | " \n", 1431 | " \n", 1432 | " \n", 1433 | " \n", 1434 | " \n", 1435 | " \n", 1436 | " \n", 1437 | " \n", 1438 | " \n", 1439 | " \n", 1440 | " \n", 1441 | " \n", 1442 | " \n", 1443 | " \n", 1444 | " \n", 1445 | " \n", 1446 | " \n", 1447 | " \n", 1448 | " \n", 1449 | " \n", 1450 | " \n", 1451 | " \n", 1452 | " \n", 1453 | " \n", 1454 | " \n", 1455 | " \n", 1456 | " \n", 1457 | " \n", 1458 | " \n", 1459 | " \n", 1460 | " \n", 1461 | " \n", 1462 | " \n", 1463 | " \n", 1464 | " \n", 1465 | " \n", 1466 | " \n", 1467 | " \n", 1468 | " \n", 1469 | " \n", 1470 | " \n", 1471 | " \n", 1472 | " \n", 1473 | " \n", 1474 | " \n", 1475 | " \n", 1476 | " \n", 1477 | "
daycompany_idnew_hire_countquit_counthead_count_net_changehead_count
214272015-12-1380.00.00.0468.0
214282015-12-1390.00.00.0432.0
214292015-12-13100.00.00.0385.0
214302015-12-13110.00.00.04.0
214312015-12-13120.00.00.012.0
\n", 1478 | "
" 1479 | ], 1480 | "text/plain": [ 1481 | " day company_id new_hire_count quit_count \\\n", 1482 | "21427 2015-12-13 8 0.0 0.0 \n", 1483 | "21428 2015-12-13 9 0.0 0.0 \n", 1484 | "21429 2015-12-13 10 0.0 0.0 \n", 1485 | "21430 2015-12-13 11 0.0 0.0 \n", 1486 | "21431 2015-12-13 12 0.0 0.0 \n", 1487 | "\n", 1488 | " head_count_net_change head_count \n", 1489 | "21427 0.0 468.0 \n", 1490 | "21428 0.0 432.0 \n", 1491 | "21429 0.0 385.0 \n", 1492 | "21430 0.0 4.0 \n", 1493 | "21431 0.0 12.0 " 1494 | ] 1495 | }, 1496 | "execution_count": 844, 1497 | "metadata": {}, 1498 | "output_type": "execute_result" 1499 | } 1500 | ], 1501 | "source": [ 1502 | "headcount.tail()" 1503 | ] 1504 | }, 1505 | { 1506 | "cell_type": "markdown", 1507 | "metadata": {}, 1508 | "source": [ 1509 | "### Check the factors drive employee churn" 1510 | ] 1511 | }, 1512 | { 1513 | "cell_type": "markdown", 1514 | "metadata": {}, 1515 | "source": [ 1516 | "### check employment length" 1517 | ] 1518 | }, 1519 | { 1520 | "cell_type": "markdown", 1521 | "metadata": {}, 1522 | "source": [ 1523 | "Get the timedelta between join and quit" 1524 | ] 1525 | }, 1526 | { 1527 | "cell_type": "code", 1528 | "execution_count": 845, 1529 | "metadata": {}, 1530 | "outputs": [], 1531 | "source": [ 1532 | "data['emp_length'] = data.quit_date-data.join_date" 1533 | ] 1534 | }, 1535 | { 1536 | "cell_type": "code", 1537 | "execution_count": 846, 1538 | "metadata": {}, 1539 | "outputs": [ 1540 | { 1541 | "data": { 1542 | "text/html": [ 1543 | "
\n", 1544 | "\n", 1557 | "\n", 1558 | " \n", 1559 | " \n", 1560 | " \n", 1561 | " \n", 1562 | " \n", 1563 | " \n", 1564 | " \n", 1565 | " \n", 1566 | " \n", 1567 | " \n", 1568 | " \n", 1569 | " \n", 1570 | " \n", 1571 | " \n", 1572 | " \n", 1573 | " \n", 1574 | " \n", 1575 | " \n", 1576 | " \n", 1577 | " \n", 1578 | " \n", 1579 | " \n", 1580 | " \n", 1581 | " \n", 1582 | " \n", 1583 | " \n", 1584 | " \n", 1585 | " \n", 1586 | " \n", 1587 | " \n", 1588 | " \n", 1589 | " \n", 1590 | " \n", 1591 | " \n", 1592 | " \n", 1593 | " \n", 1594 | " \n", 1595 | " \n", 1596 | " \n", 1597 | " \n", 1598 | " \n", 1599 | " \n", 1600 | " \n", 1601 | " \n", 1602 | " \n", 1603 | " \n", 1604 | " \n", 1605 | " \n", 1606 | " \n", 1607 | " \n", 1608 | " \n", 1609 | " \n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | "
employee_idcompany_iddeptsenioritysalaryjoin_datequit_dateemp_length
013021.07customer_service2889000.02014-03-242015-10-30585 days
1825355.07marketing20183000.02013-04-292014-04-04340 days
2927315.04marketing14101000.02014-10-13NaTNaT
3662910.07customer_service20115000.02012-05-142013-06-07389 days
4256971.02data_science23276000.02011-10-172014-08-221040 days
\n", 1629 | "
" 1630 | ], 1631 | "text/plain": [ 1632 | " employee_id company_id dept seniority salary join_date \\\n", 1633 | "0 13021.0 7 customer_service 28 89000.0 2014-03-24 \n", 1634 | "1 825355.0 7 marketing 20 183000.0 2013-04-29 \n", 1635 | "2 927315.0 4 marketing 14 101000.0 2014-10-13 \n", 1636 | "3 662910.0 7 customer_service 20 115000.0 2012-05-14 \n", 1637 | "4 256971.0 2 data_science 23 276000.0 2011-10-17 \n", 1638 | "\n", 1639 | " quit_date emp_length \n", 1640 | "0 2015-10-30 585 days \n", 1641 | "1 2014-04-04 340 days \n", 1642 | "2 NaT NaT \n", 1643 | "3 2013-06-07 389 days \n", 1644 | "4 2014-08-22 1040 days " 1645 | ] 1646 | }, 1647 | "execution_count": 846, 1648 | "metadata": {}, 1649 | "output_type": "execute_result" 1650 | } 1651 | ], 1652 | "source": [ 1653 | "data.head()" 1654 | ] 1655 | }, 1656 | { 1657 | "cell_type": "code", 1658 | "execution_count": 847, 1659 | "metadata": {}, 1660 | "outputs": [ 1661 | { 1662 | "data": { 1663 | "text/plain": [ 1664 | "employee_id float64\n", 1665 | "company_id int64\n", 1666 | "dept object\n", 1667 | "seniority int64\n", 1668 | "salary float64\n", 1669 | "join_date datetime64[ns]\n", 1670 | "quit_date datetime64[ns]\n", 1671 | "emp_length timedelta64[ns]\n", 1672 | "dtype: object" 1673 | ] 1674 | }, 1675 | "execution_count": 847, 1676 | "metadata": {}, 1677 | "output_type": "execute_result" 1678 | } 1679 | ], 1680 | "source": [ 1681 | "data.dtypes" 1682 | ] 1683 | }, 1684 | { 1685 | "cell_type": "code", 1686 | "execution_count": 848, 1687 | "metadata": {}, 1688 | "outputs": [ 1689 | { 1690 | "data": { 1691 | "text/plain": [ 1692 | "count 13510\n", 1693 | "mean 613 days 11:41:01.643227\n", 1694 | "std 328 days 14:56:33.800149\n", 1695 | "min 102 days 00:00:00\n", 1696 | "25% 361 days 00:00:00\n", 1697 | "50% 417 days 00:00:00\n", 1698 | "75% 781 days 00:00:00\n", 1699 | "max 1726 days 00:00:00\n", 1700 | "Name: emp_length, dtype: object" 1701 | ] 1702 | }, 1703 | "execution_count": 848, 1704 | "metadata": {}, 1705 | "output_type": "execute_result" 1706 | } 1707 | ], 1708 | "source": [ 1709 | "data.emp_length.describe()" 1710 | ] 1711 | }, 1712 | { 1713 | "cell_type": "code", 1714 | "execution_count": 862, 1715 | "metadata": {}, 1716 | "outputs": [ 1717 | { 1718 | "data": { 1719 | "text/plain": [ 1720 | "375 days 370\n", 1721 | "361 days 368\n", 1722 | "354 days 367\n", 1723 | "368 days 333\n", 1724 | "382 days 325\n", 1725 | "Name: emp_length, dtype: int64" 1726 | ] 1727 | }, 1728 | "execution_count": 862, 1729 | "metadata": {}, 1730 | "output_type": "execute_result" 1731 | } 1732 | ], 1733 | "source": [ 1734 | "data.emp_length.value_counts().head()" 1735 | ] 1736 | }, 1737 | { 1738 | "cell_type": "code", 1739 | "execution_count": 860, 1740 | "metadata": {}, 1741 | "outputs": [ 1742 | { 1743 | "data": { 1744 | "text/plain": [ 1745 | "" 1746 | ] 1747 | }, 1748 | "execution_count": 860, 1749 | "metadata": {}, 1750 | "output_type": "execute_result" 1751 | }, 1752 | { 1753 | "data": { 1754 | "image/png": "\n", 1755 | "text/plain": [ 1756 | "" 1757 | ] 1758 | }, 1759 | "metadata": {}, 1760 | "output_type": "display_data" 1761 | } 1762 | ], 1763 | "source": [ 1764 | "# need to convert timedelta datatype to day or hour or min or second before plot\n", 1765 | "((data.emp_length.dropna() / np.timedelta64(1, 'D'))).hist(bins=100)" 1766 | ] 1767 | }, 1768 | { 1769 | "cell_type": "markdown", 1770 | "metadata": {}, 1771 | "source": [ 1772 | "Observation:
\n", 1773 | "- Very high churn rate at the beginning of the second year of employment
\n", 1774 | "- relatively high churn rate between 1.5 to 2 years of employment" 1775 | ] 1776 | }, 1777 | { 1778 | "cell_type": "markdown", 1779 | "metadata": {}, 1780 | "source": [ 1781 | "### Dig deeper" 1782 | ] 1783 | }, 1784 | { 1785 | "cell_type": "markdown", 1786 | "metadata": {}, 1787 | "source": [ 1788 | "Since it has such a clear pattern, let's dig into deeper.
\n", 1789 | "Break into two groups: quitted early and not(if they haven’t been in the current\n", 1790 | "company for at least 13 months, we remove them)
\n", 1791 | "Let's define the early quitters are the ones quitted before 13 months" 1792 | ] 1793 | }, 1794 | { 1795 | "cell_type": "code", 1796 | "execution_count": 930, 1797 | "metadata": {}, 1798 | "outputs": [], 1799 | "source": [ 1800 | "# get data quitted before 13 months\n", 1801 | "early_quitter = data[data.emp_length/np.timedelta64(1,'D') < 365+30]" 1802 | ] 1803 | }, 1804 | { 1805 | "cell_type": "code", 1806 | "execution_count": 929, 1807 | "metadata": {}, 1808 | "outputs": [ 1809 | { 1810 | "data": { 1811 | "text/html": [ 1812 | "
\n", 1813 | "\n", 1826 | "\n", 1827 | " \n", 1828 | " \n", 1829 | " \n", 1830 | " \n", 1831 | " \n", 1832 | " \n", 1833 | " \n", 1834 | " \n", 1835 | " \n", 1836 | " \n", 1837 | " \n", 1838 | " \n", 1839 | " \n", 1840 | " \n", 1841 | " \n", 1842 | " \n", 1843 | " \n", 1844 | " \n", 1845 | " \n", 1846 | " \n", 1847 | " \n", 1848 | " \n", 1849 | " \n", 1850 | " \n", 1851 | " \n", 1852 | " \n", 1853 | " \n", 1854 | " \n", 1855 | " \n", 1856 | " \n", 1857 | " \n", 1858 | " \n", 1859 | " \n", 1860 | " \n", 1861 | " \n", 1862 | " \n", 1863 | " \n", 1864 | " \n", 1865 | " \n", 1866 | " \n", 1867 | " \n", 1868 | " \n", 1869 | " \n", 1870 | " \n", 1871 | " \n", 1872 | " \n", 1873 | " \n", 1874 | " \n", 1875 | " \n", 1876 | " \n", 1877 | " \n", 1878 | " \n", 1879 | " \n", 1880 | " \n", 1881 | " \n", 1882 | " \n", 1883 | " \n", 1884 | " \n", 1885 | " \n", 1886 | " \n", 1887 | " \n", 1888 | " \n", 1889 | " \n", 1890 | " \n", 1891 | " \n", 1892 | " \n", 1893 | " \n", 1894 | " \n", 1895 | " \n", 1896 | " \n", 1897 | "
employee_idcompany_iddeptsenioritysalaryjoin_datequit_dateemp_length
1825355.07marketing20183000.02013-04-292014-04-04340 days
3662910.07customer_service20115000.02012-05-142013-06-07389 days
12939058.01marketing148000.02012-12-102013-11-15340 days
14461248.02sales20201000.02013-09-162014-08-22340 days
21219944.06customer_service1598000.02012-06-252013-05-31340 days
\n", 1898 | "
" 1899 | ], 1900 | "text/plain": [ 1901 | " employee_id company_id dept seniority salary join_date \\\n", 1902 | "1 825355.0 7 marketing 20 183000.0 2013-04-29 \n", 1903 | "3 662910.0 7 customer_service 20 115000.0 2012-05-14 \n", 1904 | "12 939058.0 1 marketing 1 48000.0 2012-12-10 \n", 1905 | "14 461248.0 2 sales 20 201000.0 2013-09-16 \n", 1906 | "21 219944.0 6 customer_service 15 98000.0 2012-06-25 \n", 1907 | "\n", 1908 | " quit_date emp_length \n", 1909 | "1 2014-04-04 340 days \n", 1910 | "3 2013-06-07 389 days \n", 1911 | "12 2013-11-15 340 days \n", 1912 | "14 2014-08-22 340 days \n", 1913 | "21 2013-05-31 340 days " 1914 | ] 1915 | }, 1916 | "execution_count": 929, 1917 | "metadata": {}, 1918 | "output_type": "execute_result" 1919 | } 1920 | ], 1921 | "source": [ 1922 | "early_quitter.head()" 1923 | ] 1924 | }, 1925 | { 1926 | "cell_type": "code", 1927 | "execution_count": 931, 1928 | "metadata": {}, 1929 | "outputs": [], 1930 | "source": [ 1931 | "last_day = pd.to_datetime(\"2015-12-13\")" 1932 | ] 1933 | }, 1934 | { 1935 | "cell_type": "code", 1936 | "execution_count": 932, 1937 | "metadata": {}, 1938 | "outputs": [ 1939 | { 1940 | "data": { 1941 | "text/plain": [ 1942 | "Timestamp('2015-12-13 00:00:00')" 1943 | ] 1944 | }, 1945 | "execution_count": 932, 1946 | "metadata": {}, 1947 | "output_type": "execute_result" 1948 | } 1949 | ], 1950 | "source": [ 1951 | "last_day" 1952 | ] 1953 | }, 1954 | { 1955 | "cell_type": "code", 1956 | "execution_count": 944, 1957 | "metadata": {}, 1958 | "outputs": [], 1959 | "source": [ 1960 | "# get the data not early quitter and exclude the ones employed less than 13 months\n", 1961 | "longer_emp = data[((last_day - data.join_date)/np.timedelta64(1,'D') >365+30)\\\n", 1962 | " &(data.emp_length/np.timedelta64(1,'D') > 365+30)]" 1963 | ] 1964 | }, 1965 | { 1966 | "cell_type": "code", 1967 | "execution_count": 945, 1968 | "metadata": {}, 1969 | "outputs": [ 1970 | { 1971 | "data": { 1972 | "text/html": [ 1973 | "
\n", 1974 | "\n", 1987 | "\n", 1988 | " \n", 1989 | " \n", 1990 | " \n", 1991 | " \n", 1992 | " \n", 1993 | " \n", 1994 | " \n", 1995 | " \n", 1996 | " \n", 1997 | " \n", 1998 | " \n", 1999 | " \n", 2000 | " \n", 2001 | " \n", 2002 | " \n", 2003 | " \n", 2004 | " \n", 2005 | " \n", 2006 | " \n", 2007 | " \n", 2008 | " \n", 2009 | " \n", 2010 | " \n", 2011 | " \n", 2012 | " \n", 2013 | " \n", 2014 | " \n", 2015 | " \n", 2016 | " \n", 2017 | " \n", 2018 | " \n", 2019 | " \n", 2020 | " \n", 2021 | " \n", 2022 | " \n", 2023 | " \n", 2024 | " \n", 2025 | " \n", 2026 | " \n", 2027 | " \n", 2028 | " \n", 2029 | " \n", 2030 | " \n", 2031 | " \n", 2032 | " \n", 2033 | " \n", 2034 | " \n", 2035 | " \n", 2036 | " \n", 2037 | " \n", 2038 | " \n", 2039 | " \n", 2040 | " \n", 2041 | " \n", 2042 | " \n", 2043 | " \n", 2044 | " \n", 2045 | " \n", 2046 | " \n", 2047 | " \n", 2048 | " \n", 2049 | " \n", 2050 | " \n", 2051 | " \n", 2052 | " \n", 2053 | " \n", 2054 | " \n", 2055 | " \n", 2056 | " \n", 2057 | " \n", 2058 | "
employee_idcompany_iddeptsenioritysalaryjoin_datequit_dateemp_length
013021.07customer_service2889000.02014-03-242015-10-30585 days
4256971.02data_science23276000.02011-10-172014-08-221040 days
5509529.04data_science14165000.02012-01-302013-08-30578 days
8172999.09engineer7160000.02012-12-102015-10-231047 days
10892155.06customer_service1372000.02012-11-122015-02-27837 days
\n", 2059 | "
" 2060 | ], 2061 | "text/plain": [ 2062 | " employee_id company_id dept seniority salary join_date \\\n", 2063 | "0 13021.0 7 customer_service 28 89000.0 2014-03-24 \n", 2064 | "4 256971.0 2 data_science 23 276000.0 2011-10-17 \n", 2065 | "5 509529.0 4 data_science 14 165000.0 2012-01-30 \n", 2066 | "8 172999.0 9 engineer 7 160000.0 2012-12-10 \n", 2067 | "10 892155.0 6 customer_service 13 72000.0 2012-11-12 \n", 2068 | "\n", 2069 | " quit_date emp_length \n", 2070 | "0 2015-10-30 585 days \n", 2071 | "4 2014-08-22 1040 days \n", 2072 | "5 2013-08-30 578 days \n", 2073 | "8 2015-10-23 1047 days \n", 2074 | "10 2015-02-27 837 days " 2075 | ] 2076 | }, 2077 | "execution_count": 945, 2078 | "metadata": {}, 2079 | "output_type": "execute_result" 2080 | } 2081 | ], 2082 | "source": [ 2083 | "longer_emp.head()" 2084 | ] 2085 | }, 2086 | { 2087 | "cell_type": "markdown", 2088 | "metadata": {}, 2089 | "source": [ 2090 | "might use decision tree here to model it" 2091 | ] 2092 | }, 2093 | { 2094 | "cell_type": "code", 2095 | "execution_count": 949, 2096 | "metadata": {}, 2097 | "outputs": [ 2098 | { 2099 | "data": { 2100 | "text/plain": [ 2101 | "count 5654.000000\n", 2102 | "mean 131393.880439\n", 2103 | "std 65464.211853\n", 2104 | "min 17000.000000\n", 2105 | "25% 81000.000000\n", 2106 | "50% 122000.000000\n", 2107 | "75% 173000.000000\n", 2108 | "max 372000.000000\n", 2109 | "Name: salary, dtype: float64" 2110 | ] 2111 | }, 2112 | "execution_count": 949, 2113 | "metadata": {}, 2114 | "output_type": "execute_result" 2115 | } 2116 | ], 2117 | "source": [ 2118 | "early_quitter.salary.describe()" 2119 | ] 2120 | }, 2121 | { 2122 | "cell_type": "code", 2123 | "execution_count": 950, 2124 | "metadata": {}, 2125 | "outputs": [ 2126 | { 2127 | "data": { 2128 | "text/plain": [ 2129 | "count 7795.000000\n", 2130 | "mean 138768.313021\n", 2131 | "std 75379.904785\n", 2132 | "min 19000.000000\n", 2133 | "25% 80000.000000\n", 2134 | "50% 123000.000000\n", 2135 | "75% 187000.000000\n", 2136 | "max 379000.000000\n", 2137 | "Name: salary, dtype: float64" 2138 | ] 2139 | }, 2140 | "execution_count": 950, 2141 | "metadata": {}, 2142 | "output_type": "execute_result" 2143 | } 2144 | ], 2145 | "source": [ 2146 | "longer_emp.salary.describe()" 2147 | ] 2148 | }, 2149 | { 2150 | "cell_type": "markdown", 2151 | "metadata": {}, 2152 | "source": [ 2153 | "### Check week of year-quit time" 2154 | ] 2155 | }, 2156 | { 2157 | "cell_type": "code", 2158 | "execution_count": 850, 2159 | "metadata": {}, 2160 | "outputs": [ 2161 | { 2162 | "data": { 2163 | "text/html": [ 2164 | "
\n", 2165 | "\n", 2178 | "\n", 2179 | " \n", 2180 | " \n", 2181 | " \n", 2182 | " \n", 2183 | " \n", 2184 | " \n", 2185 | " \n", 2186 | " \n", 2187 | " \n", 2188 | " \n", 2189 | " \n", 2190 | " \n", 2191 | " \n", 2192 | " \n", 2193 | " \n", 2194 | " \n", 2195 | " \n", 2196 | " \n", 2197 | " \n", 2198 | " \n", 2199 | " \n", 2200 | " \n", 2201 | " \n", 2202 | " \n", 2203 | " \n", 2204 | " \n", 2205 | " \n", 2206 | " \n", 2207 | " \n", 2208 | " \n", 2209 | " \n", 2210 | " \n", 2211 | " \n", 2212 | " \n", 2213 | " \n", 2214 | " \n", 2215 | " \n", 2216 | " \n", 2217 | " \n", 2218 | " \n", 2219 | " \n", 2220 | " \n", 2221 | " \n", 2222 | " \n", 2223 | " \n", 2224 | " \n", 2225 | " \n", 2226 | " \n", 2227 | " \n", 2228 | " \n", 2229 | " \n", 2230 | " \n", 2231 | " \n", 2232 | " \n", 2233 | " \n", 2234 | " \n", 2235 | " \n", 2236 | " \n", 2237 | " \n", 2238 | " \n", 2239 | " \n", 2240 | " \n", 2241 | " \n", 2242 | " \n", 2243 | " \n", 2244 | " \n", 2245 | " \n", 2246 | " \n", 2247 | " \n", 2248 | " \n", 2249 | "
employee_idcompany_iddeptsenioritysalaryjoin_datequit_dateemp_length
013021.07customer_service2889000.02014-03-242015-10-30585 days
1825355.07marketing20183000.02013-04-292014-04-04340 days
2927315.04marketing14101000.02014-10-13NaTNaT
3662910.07customer_service20115000.02012-05-142013-06-07389 days
4256971.02data_science23276000.02011-10-172014-08-221040 days
\n", 2250 | "
" 2251 | ], 2252 | "text/plain": [ 2253 | " employee_id company_id dept seniority salary join_date \\\n", 2254 | "0 13021.0 7 customer_service 28 89000.0 2014-03-24 \n", 2255 | "1 825355.0 7 marketing 20 183000.0 2013-04-29 \n", 2256 | "2 927315.0 4 marketing 14 101000.0 2014-10-13 \n", 2257 | "3 662910.0 7 customer_service 20 115000.0 2012-05-14 \n", 2258 | "4 256971.0 2 data_science 23 276000.0 2011-10-17 \n", 2259 | "\n", 2260 | " quit_date emp_length \n", 2261 | "0 2015-10-30 585 days \n", 2262 | "1 2014-04-04 340 days \n", 2263 | "2 NaT NaT \n", 2264 | "3 2013-06-07 389 days \n", 2265 | "4 2014-08-22 1040 days " 2266 | ] 2267 | }, 2268 | "execution_count": 850, 2269 | "metadata": {}, 2270 | "output_type": "execute_result" 2271 | } 2272 | ], 2273 | "source": [ 2274 | "data.head()" 2275 | ] 2276 | }, 2277 | { 2278 | "cell_type": "code", 2279 | "execution_count": 868, 2280 | "metadata": {}, 2281 | "outputs": [], 2282 | "source": [ 2283 | "# get week of the year\n", 2284 | "week = data.quit_date.dropna().dt.week" 2285 | ] 2286 | }, 2287 | { 2288 | "cell_type": "code", 2289 | "execution_count": 869, 2290 | "metadata": {}, 2291 | "outputs": [ 2292 | { 2293 | "data": { 2294 | "text/plain": [ 2295 | "" 2296 | ] 2297 | }, 2298 | "execution_count": 869, 2299 | "metadata": {}, 2300 | "output_type": "execute_result" 2301 | }, 2302 | { 2303 | "data": { 2304 | "image/png": "\n", 2305 | "text/plain": [ 2306 | "" 2307 | ] 2308 | }, 2309 | "metadata": {}, 2310 | "output_type": "display_data" 2311 | } 2312 | ], 2313 | "source": [ 2314 | "week.hist(bins = 100)" 2315 | ] 2316 | }, 2317 | { 2318 | "cell_type": "markdown", 2319 | "metadata": {}, 2320 | "source": [ 2321 | "Observation:
\n", 2322 | "No significant pattern" 2323 | ] 2324 | }, 2325 | { 2326 | "cell_type": "markdown", 2327 | "metadata": {}, 2328 | "source": [ 2329 | "### Check if different dept matters" 2330 | ] 2331 | }, 2332 | { 2333 | "cell_type": "code", 2334 | "execution_count": 876, 2335 | "metadata": {}, 2336 | "outputs": [], 2337 | "source": [ 2338 | "# dept quitted\n", 2339 | "dept_q = data[data['quit_date'].notnull()].dept" 2340 | ] 2341 | }, 2342 | { 2343 | "cell_type": "code", 2344 | "execution_count": 879, 2345 | "metadata": {}, 2346 | "outputs": [ 2347 | { 2348 | "data": { 2349 | "text/plain": [ 2350 | "customer_service 0.554902\n", 2351 | "data_science 0.527273\n", 2352 | "design 0.563768\n", 2353 | "engineer 0.512031\n", 2354 | "marketing 0.562993\n", 2355 | "sales 0.570933\n", 2356 | "Name: dept, dtype: float64" 2357 | ] 2358 | }, 2359 | "execution_count": 879, 2360 | "metadata": {}, 2361 | "output_type": "execute_result" 2362 | } 2363 | ], 2364 | "source": [ 2365 | "# percentage of churned in each dept\n", 2366 | "dept.value_counts()/data.dept.value_counts()" 2367 | ] 2368 | }, 2369 | { 2370 | "cell_type": "markdown", 2371 | "metadata": {}, 2372 | "source": [ 2373 | "Observation:\n", 2374 | "No significant diff" 2375 | ] 2376 | }, 2377 | { 2378 | "cell_type": "markdown", 2379 | "metadata": {}, 2380 | "source": [ 2381 | "### Seniority" 2382 | ] 2383 | }, 2384 | { 2385 | "cell_type": "code", 2386 | "execution_count": 888, 2387 | "metadata": {}, 2388 | "outputs": [], 2389 | "source": [ 2390 | "s = data[data['quit_date'].notnull()].seniority.value_counts()/data.seniority.value_counts()" 2391 | ] 2392 | }, 2393 | { 2394 | "cell_type": "code", 2395 | "execution_count": 889, 2396 | "metadata": {}, 2397 | "outputs": [ 2398 | { 2399 | "data": { 2400 | "text/plain": [ 2401 | "1 0.499419\n", 2402 | "2 0.530786\n", 2403 | "3 0.507378\n", 2404 | "4 0.471508\n", 2405 | "5 0.569444\n", 2406 | "6 0.601053\n", 2407 | "7 0.550647\n", 2408 | "8 0.581349\n", 2409 | "9 0.552966\n", 2410 | "10 0.564186\n", 2411 | "11 0.554113\n", 2412 | "12 0.590081\n", 2413 | "13 0.559284\n", 2414 | "14 0.552174\n", 2415 | "15 0.554336\n", 2416 | "16 0.570513\n", 2417 | "17 0.535274\n", 2418 | "18 0.524083\n", 2419 | "19 0.546154\n", 2420 | "20 0.555687\n", 2421 | "21 0.581841\n", 2422 | "22 0.530105\n", 2423 | "23 0.547771\n", 2424 | "24 0.535666\n", 2425 | "25 0.563636\n", 2426 | "26 0.517291\n", 2427 | "27 0.532710\n", 2428 | "28 0.547009\n", 2429 | "29 0.492013\n", 2430 | "98 1.000000\n", 2431 | "99 1.000000\n", 2432 | "Name: seniority, dtype: float64" 2433 | ] 2434 | }, 2435 | "execution_count": 889, 2436 | "metadata": {}, 2437 | "output_type": "execute_result" 2438 | } 2439 | ], 2440 | "source": [ 2441 | "s" 2442 | ] 2443 | }, 2444 | { 2445 | "cell_type": "code", 2446 | "execution_count": 890, 2447 | "metadata": {}, 2448 | "outputs": [], 2449 | "source": [ 2450 | "s_df = pd.DataFrame(s)" 2451 | ] 2452 | }, 2453 | { 2454 | "cell_type": "code", 2455 | "execution_count": 892, 2456 | "metadata": { 2457 | "scrolled": true 2458 | }, 2459 | "outputs": [ 2460 | { 2461 | "data": { 2462 | "text/plain": [ 2463 | "" 2464 | ] 2465 | }, 2466 | "execution_count": 892, 2467 | "metadata": {}, 2468 | "output_type": "execute_result" 2469 | }, 2470 | { 2471 | "data": { 2472 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD/CAYAAAAKVJb/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAFo9JREFUeJzt3X20VfV95/H3V0ABwcTCzYMiuawWGhGf6i1JYxKZqBGiIpOYWZJkSFxJWJmp0tTgqEtHjdOH1Jmpba0xNa0mcZKg0Y6wxqsmqZJqfQKMQRFZpZToHcyEosUkHSs4v/ljb8xhc+4951zO4dz78/1aay/2w/f89u/se/bn7L3PPodIKSFJystB3e6AJKn9DHdJypDhLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjJkuEtShgx3ScrQ2G6teOrUqam3t7dbq5ekUWndunX/lFLqaVTXtXDv7e1l7dq13Vq9JI1KEfHjZuq8LCNJGTLcJSlDhrskZahr19zr2bVrFwMDA7zyyivd7sqoMn78eKZNm8a4ceO63RVJI0TDcI+Im4GzgJ+mlObUWR7AnwIfAv4F+FRK6YnhdGZgYIDJkyfT29tL0awaSSmxY8cOBgYGmDFjRre7I2mEaOayzNeA+UMsXwDMLIelwI3D7cwrr7zClClTDPYWRARTpkzxbEfSXhqGe0rpb4EXhyg5B/hGKjwKvDki3j7cDhnsrXObSapqxweqRwLP10wPlPPesNauXcuyZcuG/ZjVq1fz8MMPd6Jrkt4g2vGBar3Dxrr/63ZELKW4dMP06dMbNtx76d371bGqrV86s63tDaavr4++vr6m63fv3r3XY1avXs2kSZN4z3ve06kuSjqA6mXZYHnUSu1Q2nHkPgAcVTM9DdhWrzCldFNKqS+l1NfT0/Dbs13xi1/8gjPPPJPjjz+eOXPmcNttt7Fu3TpOOeUUTjrpJM444wxeeOEFAObNm8cll1zC3LlzmTVrFg8++CBQhPNZZ50FwIsvvsiiRYs47rjjePe738369esBuPrqq1m6dCkf/OAHWbJkyeuP2bp1K1/5yle47rrrOOGEE3jwwQeZMWMGu3btAuDll1+mt7f39WlJqqcdR+6rgAsiYgXwLmBnSumFNrTbFffeey9HHHEEd99dvHvu3LmTBQsWsHLlSnp6erjtttu4/PLLufnmm4HiqPvxxx+nv7+fL37xi3z/+9/fq72rrrqKE088kbvuuov777+fJUuW8OSTTwKwbt06HnroISZMmMDq1auB4mcZPve5zzFp0iSWL18OFG8id999N4sWLWLFihV85CMf8bZHSUNq5lbIbwPzgKkRMQBcBYwDSCl9BeinuA1yM8WtkOd3qrMHwrHHHsvy5cu55JJLOOusszj88MN5+umnOf300wF47bXXePvbf/l58Yc//GEATjrpJLZu3bpPew899BB33nknAB/4wAfYsWMHO3fuBGDhwoVMmDChYZ8+85nPcO2117Jo0SJuueUWvvrVr+7v05SUuYbhnlJa3GB5An67bT3qslmzZrFu3Tr6+/u57LLLOP300znmmGN45JFH6tYfcsghAIwZM4bdu3fvs7zYPHvbc3fLoYce2lSfTj75ZLZu3coPfvADXnvtNebM2efrBpK0F39+oGLbtm1MnDiRT3ziEyxfvpzHHnuM7du3vx7uu3btYsOGDU239/73v59vfvObQHEtfurUqRx22GFDPmby5Mn87Gc/22vekiVLWLx4MeefP6pPjCQdICPq5wdGgqeeeoqLL76Ygw46iHHjxnHjjTcyduxYli1bxs6dO9m9ezef//znOeaYY5pq7+qrr+b888/nuOOOY+LEiXz9619v+Jizzz6bc889l5UrV3L99dfzvve9j49//ONcccUVLF485ImUJAEQ9S4bHAh9fX2p+nvuGzdu5Oijj+5Kf0a6O+64g5UrV3LrrbfWXe62k0audt4KGRHrUkoN77X2yH0UuPDCC7nnnnvo7+/vdlckjRKG+yhw/fXXd7sLkkYZP1CVpAyNuHDv1mcAo5nbTFLViAr38ePHs2PHDsOqBXt+z338+PHd7oqkEWREXXOfNm0aAwMDbN++vdtdGVX2/E9MkrTHiAr3cePG+b8JSVIbjKjLMpKk9jDcJSlDhrskZchwl6QMGe6SlCHDXZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDBnukpQhw12SMmS4S1KGDHdJypDhLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjLUVLhHxPyI2BQRmyPi0jrLp0fEAxHxw4hYHxEfan9XJUnNahjuETEGuAFYAMwGFkfE7ErZFcDtKaUTgfOAL7e7o5Kk5jVz5D4X2JxS2pJSehVYAZxTqUnAYeX4m4Bt7euiJKlVY5uoORJ4vmZ6AHhXpeZq4LsRcSFwKHBaW3onSRqWZo7co868VJleDHwtpTQN+BBwa0Ts03ZELI2ItRGxdvv27a33VpLUlGbCfQA4qmZ6Gvtedvk0cDtASukRYDwwtdpQSummlFJfSqmvp6dneD2WJDXUTLivAWZGxIyIOJjiA9NVlZrngFMBIuJoinD30FySuqRhuKeUdgMXAPcBGynuitkQEddExMKy7AvAZyPiR8C3gU+llKqXbiRJB0gzH6iSUuoH+ivzrqwZfwY4ub1dkyQNl99QlaQMGe6SlCHDXZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDBnukpQhw12SMmS4S1KGDHdJypDhLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjJkuEtShgx3ScqQ4S5JGTLcJSlDhrskZchwl6QMGe6SlCHDXZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDDUV7hExPyI2RcTmiLh0kJp/FxHPRMSGiPhWe7spSWrF2EYFETEGuAE4HRgA1kTEqpTSMzU1M4HLgJNTSi9FxFs61WFJUmPNHLnPBTanlLaklF4FVgDnVGo+C9yQUnoJIKX00/Z2U5LUimbC/Ujg+ZrpgXJerVnArIj4u4h4NCLmt6uDkqTWNbwsA0SdealOOzOBecA04MGImJNS+ue9GopYCiwFmD59esudHW16L717n3lbv3RmF3oi6Y2mmSP3AeComulpwLY6NStTSrtSSv8IbKII+72klG5KKfWllPp6enqG22dJUgPNHLmvAWZGxAzgfwPnAR+r1NwFLAa+FhFTKS7TbGlnRzU6efYidUfDI/eU0m7gAuA+YCNwe0ppQ0RcExELy7L7gB0R8QzwAHBxSmlHpzotSRpaM0fupJT6gf7KvCtrxhNwUTmog+odCYNHwxqcZ09vTH5DVZIy1NSRu37Jo6B8+bdVTgx3AQabRgZfh+1juGtE8LMEqb0Md7VsNB1djaa+tiLX56X2MdxHiE7srLkeDef6vKR2MtylEWI0vWmNpr62Iqfn5a2QkpQhj9ylYfCad/NyOhoeTUZduLtTSZ2T6/6V6/MaipdlJClDo+7IXRpNvCShbjHcJY1Kb8RLLa3wsowkZSjbI3dPhyW9kXnkLkkZGhFH7l47k6T2GhHhLkmjzUi/9OtlGUnKkOEuSRky3CUpQ15zZ+RfO5OkVnnkLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjJkuEtShgx3ScqQ4S5JGTLcJSlDTYV7RMyPiE0RsTkiLh2i7tyISBHR174uSpJa1TDcI2IMcAOwAJgNLI6I2XXqJgPLgMfa3UlJUmuaOXKfC2xOKW1JKb0KrADOqVP3X4BrgVfa2D9J0jA0E+5HAs/XTA+U814XEScCR6WU/lcb+yZJGqZmwj3qzEuvL4w4CLgO+ELDhiKWRsTaiFi7ffv25nspSWpJM+E+ABxVMz0N2FYzPRmYA6yOiK3Au4FV9T5UTSndlFLqSyn19fT0DL/XkqQhNRPua4CZETEjIg4GzgNW7VmYUtqZUpqaUupNKfUCjwILU0prO9JjSVJDDcM9pbQbuAC4D9gI3J5S2hAR10TEwk53UJLUuqb+m72UUj/QX5l35SC18/a/W5Kk/eE3VCUpQ4a7JGXIcJekDBnukpQhw12SMmS4S1KGDHdJypDhLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjJkuEtShgx3ScqQ4S5JGTLcJSlDhrskZchwl6QMGe6SlCHDXZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDBnukpQhw12SMmS4S1KGDHdJypDhLkkZMtwlKUNNhXtEzI+ITRGxOSIurbP8ooh4JiLWR8TfRMQ72t9VSVKzGoZ7RIwBbgAWALOBxRExu1L2Q6AvpXQccAdwbbs7KklqXjNH7nOBzSmlLSmlV4EVwDm1BSmlB1JK/1JOPgpMa283JUmtaCbcjwSer5keKOcN5tPAPfvTKUnS/hnbRE3UmZfqFkZ8AugDThlk+VJgKcD06dOb7KIkqVXNHLkPAEfVTE8DtlWLIuI04HJgYUrpX+s1lFK6KaXUl1Lq6+npGU5/JUlNaCbc1wAzI2JGRBwMnAesqi2IiBOBv6AI9p+2v5uSpFY0DPeU0m7gAuA+YCNwe0ppQ0RcExELy7L/CkwCvhMRT0bEqkGakyQdAM1ccyel1A/0V+ZdWTN+Wpv7JUnaD35DVZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDBnukpQhw12SMmS4S1KGDHdJypDhLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjJkuEtShgx3ScqQ4S5JGTLcJSlDhrskZchwl6QMGe6SlCHDXZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDBnukpShpsI9IuZHxKaI2BwRl9ZZfkhE3FYufywietvdUUlS8xqGe0SMAW4AFgCzgcURMbtS9mngpZTSrwHXAX/U7o5KkprXzJH7XGBzSmlLSulVYAVwTqXmHODr5fgdwKkREe3rpiSpFc2E+5HA8zXTA+W8ujUppd3ATmBKOzooSWpdpJSGLoj4KHBGSukz5fS/B+amlC6sqdlQ1gyU0/9Q1uyotLUUWFpO/jqwqbK6qcA/Ndn30VTb7fV3qrbb6+9UbbfX36nabq+/U7XdXn+nagere0dKqafho1NKQw7AbwH31UxfBlxWqbkP+K1yfGzZoWjUdp11rc2xttvr93n5vEbC+n1enXte9YZmLsusAWZGxIyIOBg4D1hVqVkFfLIcPxe4P5W9kyQdeGMbFaSUdkfEBRRH52OAm1NKGyLiGop3llXAXwG3RsRm4EWKNwBJUpc0DHeAlFI/0F+Zd2XN+CvAR9vQn5syre32+jtV2+31d6q22+vvVG2319+p2m6vv1O1rbS5j4YfqEqSRh9/fkCSMmS4S1KGsgz3iJgbEb9Zjs+OiIsi4kNNPO4bne/d8ETEwRGxJCJOK6c/FhF/HhG/HRHjut0/SSPLqLnmHhHvpPgm7GMppZ/XzJ+fUrq3Zvoqit/BGQt8D3gXsBo4jeJ+/d8v66q3cwbwb4D7AVJKC4foy3spfpbh6ZTSdyvL3gVsTCm9HBETgEuB3wCeAf4gpbSzpnYZ8D9TSrXfAB5snd8sn9NE4J+BScBfA6dS/B0/Wan/VeDfAkcBu4G/B75du37pQIuIt6SUftrmNqekyhcmReMvMXVrAM6vGV9G8W3Wu4CtwDk1y56oPO4pils2JwIvA4eV8ycA62sfB/wPYB5wSvnvC+X4KZU2H68Z/yzwJHAV8HfApZXaDcDYcvwm4E+A95b1f12p3QlsAx4E/iPQM8T2WF/+Oxb4P8CYcjpqn1fN9voecAXwMPBl4Pcp3mDmdftv2+bXyVs61O6Ubj+3On16E/Al4FlgRzlsLOe9uYV27qlMHwb8IXAr8LHKsi9Xpt8G3EjxY4JTgKvLfe524O2V2l+pDFPK/fdw4Fdq6uZXnuNfAeuBbwFvrbT5JWBqOd4HbAE2Az+us98+Ue4Dv9rENukDHigz4ahy/9lJ8T2fEyu1k4Bryn19J7AdeBT4VKfbbOn10u0X7BAb+7ma8aeASeV4L7AW+J1y+oeVx/2w3ng5/WTN+EHA75Yb/IRy3pZB+lLb5hrKEAYOBZ6q1G6sfXENtv497Zb9+GD5gt4O3EvxhbDJldqngYPLHeNne3YOYHztOmu2157wnwisLsen19kmbQ8MuhwWZW3bA4Puh8V9wCXA2yrb7xLge5Xa3xhkOAl4oVJ7Z7kNFlF8IfFO4JBBXsP3AhdSnJGuL9c9vZy3slL7/4B/rAy7yn+31G7XmvG/BH4PeAfF/nlX9bVdM/4A8Jvl+Cwq3+gs1/PfgOeAx8v2jhjk7/U4xRn/YorfyTq3nH8q8EildiXwKWAacBHwn4GZFD+e+AedbLOVodsBvn6Q4SngX2vqnqmzM9wL/DH7BuZjwMRy/KDKDv5EnT5MA74D/Dk1byiVmh9RBMiUOi+galh+h/KsA7gF6Kt58a2p1FZ3nHHAQuDbwPbKst+lCJ0fUxyZ/w3w1XJbXVXdAfjlznk4sK5m2dOV2rYHBl0Oi+p6aFNg0P2w2DTEvrSpMv0axSXGB+oM/7dSW92HLqc4K51S5+9Ve6DzXIN2lpd/32Nrt2Gdvj8xRBvV6Wf55Znxo4P9Heu0+z6KM9iflNtgaQvPq7qP/6gyvab89yDg2U622crQlpAe7kBxeeGEcqerHXqBbTV191MeXdfMGwt8A3itMv+QQdY1tfZFVmf5mQzyDklxdLiFMkQog5DiTab64nsT8DXgHyjeaHaVj/kBcPxQf+DKsgl15h1BGSbAmyl+6mFunbrfoQjKm8qdYc+bTQ/wt5XatgdGnW1yQMOinN/2wKD7YfFd4D9Rc+YBvJXiDfH7lTaeBmYOsm2er0xvpOZAqJz3SYoziR8P1lfg9wbbVjXz9hw8/TEwmTpnxxS/NHsR8IVyX4maZdVLjheW2+EDFGd5fwK8H/gicOtgr4GaeWOA+cAtlfmPUJxBf5TiAGpROf8U9j2gexh4bzl+Nnv/9tamNrS5cLA2WxlafkA7B4pT5fcOsuxblRfI2wapO7mL/Z8IzBhk2WTgeIqj2rcOUjOrg307hiL839mgru2B0e2wKOvaHhgjICwOp/iPcJ4FXqL4qY+N5bzqZalzgV8fZNssqkxfC5xWp24+8PeVeddQXiKtzP814I4hXmdnU1xq+kmdZVdVhj2XPd8GfKNO/TzgNorLmk9RfHt+KTCuUreihf3leIqz2HuAdwJ/SnHjwgbgPXVqHy+XP7RnO1McPC1r0OZLZZsnN2hzVr02Wxk6EiwOo2eoBMaLlcA4vFLbVGB0OyzK5e0KjLE1NZ0Ki+Oa3bHLtk6rbjNqPmOo1J66n7UL2tUuxU0Nc+rVtqmv+1t7dIu1Df8O5ev44vLv/9+B/wC8aZDXzJ7aPytrPzdYbVOvweE+0CH/gZo7ltpV2842K2HR9r5263kNVktrd421Unthh2qb6kMn2hxmu8+2s7as+y5N3LXWSm3Tr53hPMjhjTEwyAfM+1PbiTZHQu2BWD+t3TU2amq7vf4OP69m71prurbZoalfhVS+ImL9YIsorr23XNuJNkdCbbfXT7Hz/xwgpbQ1IuYBd0TEO8paRmltt9ffydqxFDciHELxGREppecG+VZ5K7UNGe56K3AGxQc9tYLi9HA4tZ1ocyTUdnv9P4mIE1JKTwKklH4eEWcBNwPHVh47mmq7vf5O1f4lsCYiHqX4gP6PACKih+KzLYZZ25zhHO475DPQ5B1LrdR2os2RUDsC1t/0XWOjqbbb6+9wbVN3rbVa28wwan5bRpLUvCx/FVKS3ugMd0nKkOEuSRky3CUpQ4a7JGXo/wN76dTXKFbruQAAAABJRU5ErkJggg==\n", 2473 | "text/plain": [ 2474 | "" 2475 | ] 2476 | }, 2477 | "metadata": {}, 2478 | "output_type": "display_data" 2479 | } 2480 | ], 2481 | "source": [ 2482 | "s_df.plot(kind = 'bar')" 2483 | ] 2484 | }, 2485 | { 2486 | "cell_type": "markdown", 2487 | "metadata": {}, 2488 | "source": [ 2489 | "Observation:\n", 2490 | "No significant diff" 2491 | ] 2492 | }, 2493 | { 2494 | "cell_type": "markdown", 2495 | "metadata": {}, 2496 | "source": [ 2497 | "## Conclusions:" 2498 | ] 2499 | }, 2500 | { 2501 | "cell_type": "markdown", 2502 | "metadata": {}, 2503 | "source": [ 2504 | "1. Empolyee quit at their working anniversaries and has a extremly high churn rate at the first and second year.
\n", 2505 | "2. Salary is an import factor.(need to dig deeper).Employees with low and high salaries are less likely to quit. Probably because employees with high\n", 2506 | "salaries are happy there and employees with low salaries are not that marketable, so they have a\n", 2507 | "hard time finding a new job." 2508 | ] 2509 | } 2510 | ], 2511 | "metadata": { 2512 | "kernelspec": { 2513 | "display_name": "Python 3", 2514 | "language": "python", 2515 | "name": "python3" 2516 | }, 2517 | "language_info": { 2518 | "codemirror_mode": { 2519 | "name": "ipython", 2520 | "version": 3 2521 | }, 2522 | "file_extension": ".py", 2523 | "mimetype": "text/x-python", 2524 | "name": "python", 2525 | "nbconvert_exporter": "python", 2526 | "pygments_lexer": "ipython3", 2527 | "version": "3.5.4" 2528 | } 2529 | }, 2530 | "nbformat": 4, 2531 | "nbformat_minor": 2 2532 | } 2533 | -------------------------------------------------------------------------------- /Machine_Learning_Algorithms_Python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Machine Learning Algorithms" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "3 Types of ML Algorithms
\n", 15 | "1. Supervised Learning - consisits of target variable and predictors. Regression and Classification. Models:Regression, KNN, Decision Tree, Random Forest, Logistics Regression
\n", 16 | "2. Unsupervised Learning - Do not have any outcome variables to predict. Clustering. Segement customer, or picture. Models: Apriori, K-means
\n", 17 | "3. Reinforcement learning - The machine is trained to make specific decisions. Markov Decision Process" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "- Training data: data used to fit the model\n", 25 | "- Test data" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "Split data into training and test data set" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 3, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "from sklearn.model_selection import train_test_split\n", 42 | "#train_X, test_X, train_y, test_y = train_test_split(X, y,random_state = 0)" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "## Linear Regression" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "Minimize the sum of squared difference of distince between obersed value and estimated value" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 1, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "from sklearn import linear_model" 66 | ] 67 | }, 68 | { 69 | "cell_type": "markdown", 70 | "metadata": {}, 71 | "source": [ 72 | " Identify feature and response variable(s) and values must be numeric and numpy arrays
\n", 73 | " Load train and test data sets" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 3, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "# instantiate a model--make an instance\n", 83 | "lr = linear_model.LinearRegression()" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 5, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "# Train the model with trainning set and make prediction\n", 93 | "\n", 94 | "# lr.fit(X_train, y_train)\n", 95 | "# y_pred = lr.predict(X_test)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 6, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "#Equation coefficient and Intercept\n", 105 | "\n", 106 | "# print('Coefficient: \\n', linear.coef_)\n", 107 | "# print('Intercept: \\n', linear.intercept_)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "## Logistic Regression" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "Classification: used to estimate discrete values---Binary values like 0/1, yes/no, true/false" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "Fit data to a logit function, and predicts probability" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 9, 134 | "metadata": {}, 135 | "outputs": [], 136 | "source": [ 137 | "from sklearn.linear_model import LogisticRegression" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 10, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "# instantiate a model--make an instance\n", 147 | "logreg = LogisticRegression()" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 11, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "# logreg.fit(X_train, y_train)\n", 157 | "# y_pred = logreg.predict(X_test)" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 12, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "#Equation coefficient and Intercept\n", 167 | "# print('Coefficient: \\n', logreg.coef_)\n", 168 | "# print('Intercept: \\n', logreg.intercept_)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "# Decision Tree" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "Classification" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "Split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible.
\n", 190 | "Tree's depth - how many splits it makes before coming to a prediction" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 17, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "from sklearn import tree" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 23, 205 | "metadata": {}, 206 | "outputs": [], 207 | "source": [ 208 | "# For Classification\n", 209 | "# lgorithm default is gini, others - entropy\n", 210 | "dt = tree.DecisionTreeClassifier(criterion = 'gini')" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 24, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "# For Regression\n", 220 | "dt = tree.DecisionTreeRegressor()" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "## KNN - K-nearest Neighbors" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "Both Classification(more widely used) and Regression" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": {}, 240 | "source": [ 241 | "- KNN is computationaly expensive
\n", 242 | "- Variables should be normalized, else higher range varibles can bias it
\n", 243 | "- Remove outlier, noise before doing KNN" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 25, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "from sklearn.neighbors import KNeighborsClassifier" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 26, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "# default value of n_neighbors is 5\n", 262 | "knn = KNeighborsClassifier(n_neighbors=6)" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "## Random Forest" 270 | ] 271 | }, 272 | { 273 | "cell_type": "markdown", 274 | "metadata": {}, 275 | "source": [ 276 | "In Random Forest, there are a collection of decision trees. " 277 | ] 278 | }, 279 | { 280 | "cell_type": "code", 281 | "execution_count": 46, 282 | "metadata": {}, 283 | "outputs": [], 284 | "source": [ 285 | "from sklearn.ensemble import RandomForestClassifier" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 47, 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [ 294 | "rf = RandomForestClassifier()" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "from sklearn.ensemble import RandomForestRegressor\n", 304 | "rf = RandomForestRegressor()" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 51, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "if 0:\n", 314 | " print ('Lin')" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "metadata": {}, 320 | "source": [ 321 | "## K-Means" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "Unsupervised learning - Clustering" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "How K-means forms cluster:
\n", 336 | "1. K-means picks k number of points for each cluster known as centroids.
\n", 337 | "2. Each data point forms a cluster with the closest centroids i.e. k clusters.
\n", 338 | "3. Finds the centroid of each cluster based on existing cluster members. Here we have new centroids.
\n", 339 | "4. As we have new centroids, repeat step 2 and 3. Find the closest distance for each data point from new centroids and get associated with new k-clusters. Repeat this process until convergence occurs i.e. centroids does not change." 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "How to determine value of K:
\n", 347 | "Sum of square of difference between centroid and the data points --- the number of cluster increases, the value keeps decreasing. If draw a plot, sum of square distince decreases sharply up to some value of k. Here, we can find the optimum number of cluster." 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 27, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "from sklearn.cluster import KMeans" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 28, 362 | "metadata": {}, 363 | "outputs": [], 364 | "source": [ 365 | "kmeans = KMeans(n_clusters = 3, random_state = 0)" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": 4, 371 | "metadata": {}, 372 | "outputs": [], 373 | "source": [ 374 | "# kmeans.fit(X_train, y_train)\n", 375 | "# kmeans.predict(X_test)" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": {}, 382 | "outputs": [], 383 | "source": [] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "## Model Validation" 390 | ] 391 | }, 392 | { 393 | "cell_type": "markdown", 394 | "metadata": {}, 395 | "source": [ 396 | "- MAE(Mean Absolute Error): absolute difference between predicted and actual value" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": 1, 402 | "metadata": {}, 403 | "outputs": [], 404 | "source": [ 405 | "from sklearn.metrics import mean_absolute_error\n", 406 | "# mean_absolute_error(y,predicted_y)" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": {}, 412 | "source": [ 413 | "## Cross Validation" 414 | ] 415 | }, 416 | { 417 | "cell_type": "markdown", 418 | "metadata": {}, 419 | "source": [ 420 | "By improving the accuracy score, we might get into the situation of over-fitting. Cross Validation helps to achieve more generalized relationships." 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "#### Method: k-fold cross validation" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "Steps:
\n", 435 | "1. Randomly split data into k folds.
\n", 436 | "2. For each k folds, build and train the model on k-1 folds of the data set, and test the model on the kth fold.
\n", 437 | "3. Record error/accuracy.
\n", 438 | "4. Repeat until each of the k fold of data has served as test set.
\n", 439 | "5. The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model." 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": {}, 445 | "source": [ 446 | "How to choose value of K --- often use k = 10
\n", 447 | "Lower value of k is more biased.
\n", 448 | "Large value of k is less biased, but can suffer from large variability." 449 | ] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "execution_count": 2, 454 | "metadata": {}, 455 | "outputs": [], 456 | "source": [ 457 | "from sklearn.model_selection import KFold\n", 458 | "from sklearn.model_selection import cross_val_score" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": 5, 464 | "metadata": {}, 465 | "outputs": [], 466 | "source": [ 467 | "kf = KFold(n_splits = 10, random_state=0)\n", 468 | "modelCV = RandomForestClassifier()\n", 469 | "scoring = \"accuracy\"\n", 470 | "# results = cross_val_score(modelCV, X_train, y_train, cv=kf,scoring = scoring)\n", 471 | "# print ('10-fold cross validation average accuracy: {}'.format(results.mean()))" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "metadata": {}, 478 | "outputs": [], 479 | "source": [] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": null, 491 | "metadata": {}, 492 | "outputs": [], 493 | "source": [] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "metadata": {}, 499 | "outputs": [], 500 | "source": [] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": {}, 505 | "source": [ 506 | "## Underfitting, Overfitting and Model Optimization" 507 | ] 508 | }, 509 | { 510 | "cell_type": "markdown", 511 | "metadata": {}, 512 | "source": [ 513 | "Now that we have a way to measure model accuracy, we can experiment with altenative models and see which gives the best predictions. (different options built with the model)" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "Overfitting: model matches with training data almost perfectly, but does poorly in validation and other new data.
\n", 521 | "Underfitting: Model fails to capture important distinctions and patterns in the data
\n", 522 | "eg. for the decision tree model, more max_leaf_nodes, the more move from underfitting overfitting." 523 | ] 524 | } 525 | ], 526 | "metadata": { 527 | "kernelspec": { 528 | "display_name": "Python 3", 529 | "language": "python", 530 | "name": "python3" 531 | }, 532 | "language_info": { 533 | "codemirror_mode": { 534 | "name": "ipython", 535 | "version": 3 536 | }, 537 | "file_extension": ".py", 538 | "mimetype": "text/x-python", 539 | "name": "python", 540 | "nbconvert_exporter": "python", 541 | "pygments_lexer": "ipython3", 542 | "version": "3.6.4" 543 | } 544 | }, 545 | "nbformat": 4, 546 | "nbformat_minor": 2 547 | } 548 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data-Analysis-Machine-Learning-with-Python 2 | Data Analysis and Machine Learning with Python to Solve Business Problems 3 | -------------------------------------------------------------------------------- /raw-data/readme: -------------------------------------------------------------------------------- 1 | 2 | --------------------------------------------------------------------------------