├── AB_Testing.ipynb
├── Conversion Rate.ipynb
├── Employee_Retention_PeopleAnalytics.ipynb
├── Identify_Fraudulent_Activities.ipynb
├── Machine_Learning_Algorithms_Python.ipynb
├── README.md
└── raw-data
└── readme
/AB_Testing.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# A/B Test"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "A/B testing is a controlled experiment with two variants - A/B--controll and experiement group. It's a hypothesi tesing to check if there is any statistical/practical difference between the controll and experiment group.
\n",
15 | "A/B tesint plays a vital rol in website optimization."
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "## Goal:\n",
23 | "### 1. Analyze results from an A/B Test\n",
24 | "### 2. Design an algorithm to automate some steps"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "#### Problem description:
\n",
32 | "Company XYZ is a world-wide e-commerce company and its Spain-based users have a much higher conversion rate than any other spanish-speaking countries. All spanish-speaking countries' website was transalated by a Spaniard.
\n",
33 | "They have a hypothesis that website which are translated by local people will have a higher conversion rate. Therefor, they designed the A/B test to test the hypothesis."
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 348,
39 | "metadata": {},
40 | "outputs": [],
41 | "source": [
42 | "import pandas as pd\n",
43 | "import numpy as np\n",
44 | "from scipy import stats\n",
45 | "import matplotlib.pyplot as plt\n",
46 | "%matplotlib inline\n",
47 | "import seaborn as sns"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 288,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": [
56 | "# load two tables into pandas data frame\n",
57 | "test = pd.read_csv(r'C:\\Users\\lshen\\Downloads\\Translation_Test\\test_table.csv')\n",
58 | "user = pd.read_csv(r'C:\\Users\\lshen\\Downloads\\Translation_Test\\user_table.csv')"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "### Step 1: Data Exploration"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 289,
71 | "metadata": {},
72 | "outputs": [
73 | {
74 | "data": {
75 | "text/html": [
76 | "
\n",
77 | "\n",
90 | "
\n",
91 | " \n",
92 | " \n",
93 | " | \n",
94 | " user_id | \n",
95 | " date | \n",
96 | " source | \n",
97 | " device | \n",
98 | " browser_language | \n",
99 | " ads_channel | \n",
100 | " browser | \n",
101 | " conversion | \n",
102 | " test | \n",
103 | "
\n",
104 | " \n",
105 | " \n",
106 | " \n",
107 | " 0 | \n",
108 | " 315281 | \n",
109 | " 2015-12-03 | \n",
110 | " Direct | \n",
111 | " Web | \n",
112 | " ES | \n",
113 | " NaN | \n",
114 | " IE | \n",
115 | " 1 | \n",
116 | " 0 | \n",
117 | "
\n",
118 | " \n",
119 | " 1 | \n",
120 | " 497851 | \n",
121 | " 2015-12-04 | \n",
122 | " Ads | \n",
123 | " Web | \n",
124 | " ES | \n",
125 | " Google | \n",
126 | " IE | \n",
127 | " 0 | \n",
128 | " 1 | \n",
129 | "
\n",
130 | " \n",
131 | " 2 | \n",
132 | " 848402 | \n",
133 | " 2015-12-04 | \n",
134 | " Ads | \n",
135 | " Web | \n",
136 | " ES | \n",
137 | " Facebook | \n",
138 | " Chrome | \n",
139 | " 0 | \n",
140 | " 0 | \n",
141 | "
\n",
142 | " \n",
143 | " 3 | \n",
144 | " 290051 | \n",
145 | " 2015-12-03 | \n",
146 | " Ads | \n",
147 | " Mobile | \n",
148 | " Other | \n",
149 | " Facebook | \n",
150 | " Android_App | \n",
151 | " 0 | \n",
152 | " 1 | \n",
153 | "
\n",
154 | " \n",
155 | " 4 | \n",
156 | " 548435 | \n",
157 | " 2015-11-30 | \n",
158 | " Ads | \n",
159 | " Web | \n",
160 | " ES | \n",
161 | " Google | \n",
162 | " FireFox | \n",
163 | " 0 | \n",
164 | " 1 | \n",
165 | "
\n",
166 | " \n",
167 | "
\n",
168 | "
"
169 | ],
170 | "text/plain": [
171 | " user_id date source device browser_language ads_channel \\\n",
172 | "0 315281 2015-12-03 Direct Web ES NaN \n",
173 | "1 497851 2015-12-04 Ads Web ES Google \n",
174 | "2 848402 2015-12-04 Ads Web ES Facebook \n",
175 | "3 290051 2015-12-03 Ads Mobile Other Facebook \n",
176 | "4 548435 2015-11-30 Ads Web ES Google \n",
177 | "\n",
178 | " browser conversion test \n",
179 | "0 IE 1 0 \n",
180 | "1 IE 0 1 \n",
181 | "2 Chrome 0 0 \n",
182 | "3 Android_App 0 1 \n",
183 | "4 FireFox 0 1 "
184 | ]
185 | },
186 | "execution_count": 289,
187 | "metadata": {},
188 | "output_type": "execute_result"
189 | }
190 | ],
191 | "source": [
192 | "test.head()"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": 290,
198 | "metadata": {},
199 | "outputs": [
200 | {
201 | "data": {
202 | "text/html": [
203 | "\n",
204 | "\n",
217 | "
\n",
218 | " \n",
219 | " \n",
220 | " | \n",
221 | " user_id | \n",
222 | " sex | \n",
223 | " age | \n",
224 | " country | \n",
225 | "
\n",
226 | " \n",
227 | " \n",
228 | " \n",
229 | " 0 | \n",
230 | " 765821 | \n",
231 | " M | \n",
232 | " 20 | \n",
233 | " Mexico | \n",
234 | "
\n",
235 | " \n",
236 | " 1 | \n",
237 | " 343561 | \n",
238 | " F | \n",
239 | " 27 | \n",
240 | " Nicaragua | \n",
241 | "
\n",
242 | " \n",
243 | " 2 | \n",
244 | " 118744 | \n",
245 | " M | \n",
246 | " 23 | \n",
247 | " Colombia | \n",
248 | "
\n",
249 | " \n",
250 | " 3 | \n",
251 | " 987753 | \n",
252 | " F | \n",
253 | " 27 | \n",
254 | " Venezuela | \n",
255 | "
\n",
256 | " \n",
257 | " 4 | \n",
258 | " 554597 | \n",
259 | " F | \n",
260 | " 20 | \n",
261 | " Spain | \n",
262 | "
\n",
263 | " \n",
264 | "
\n",
265 | "
"
266 | ],
267 | "text/plain": [
268 | " user_id sex age country\n",
269 | "0 765821 M 20 Mexico\n",
270 | "1 343561 F 27 Nicaragua\n",
271 | "2 118744 M 23 Colombia\n",
272 | "3 987753 F 27 Venezuela\n",
273 | "4 554597 F 20 Spain"
274 | ]
275 | },
276 | "execution_count": 290,
277 | "metadata": {},
278 | "output_type": "execute_result"
279 | }
280 | ],
281 | "source": [
282 | "user.head()"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": 291,
288 | "metadata": {},
289 | "outputs": [
290 | {
291 | "name": "stdout",
292 | "output_type": "stream",
293 | "text": [
294 | "Total number of user_id: 453321\n",
295 | "Total number of user_id: 453321\n"
296 | ]
297 | }
298 | ],
299 | "source": [
300 | "# check if test table's user_id is unique---Yes, one user_id has only one record\n",
301 | "print ('Total number of user_id: {}'.format(test.user_id.size))\n",
302 | "print ('Total number of user_id: {}'.format(test.user_id.nunique()))"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": 292,
308 | "metadata": {},
309 | "outputs": [
310 | {
311 | "name": "stdout",
312 | "output_type": "stream",
313 | "text": [
314 | "Total records in test table: 453321\n",
315 | "Total records in user table: 452867\n"
316 | ]
317 | }
318 | ],
319 | "source": [
320 | "print ('Total records in test table: {}'.format(len(test)))\n",
321 | "print ('Total records in user table: {}'.format(len(user)))"
322 | ]
323 | },
324 | {
325 | "cell_type": "markdown",
326 | "metadata": {},
327 | "source": [
328 | "From above code, we can see that some user_id don't exist in user table. Since the analysis is based on different countries, and it's very import variable, so we will drop the records that don't have demographic information."
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": 293,
334 | "metadata": {},
335 | "outputs": [],
336 | "source": [
337 | "# merge two tables based on user_id, which will return the records with demographic info.\n",
338 | "data = test.merge(user,how = 'inner', on='user_id')"
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "execution_count": 294,
344 | "metadata": {},
345 | "outputs": [
346 | {
347 | "data": {
348 | "text/html": [
349 | "\n",
350 | "\n",
363 | "
\n",
364 | " \n",
365 | " \n",
366 | " | \n",
367 | " user_id | \n",
368 | " date | \n",
369 | " source | \n",
370 | " device | \n",
371 | " browser_language | \n",
372 | " ads_channel | \n",
373 | " browser | \n",
374 | " conversion | \n",
375 | " test | \n",
376 | " sex | \n",
377 | " age | \n",
378 | " country | \n",
379 | "
\n",
380 | " \n",
381 | " \n",
382 | " \n",
383 | " 0 | \n",
384 | " 315281 | \n",
385 | " 2015-12-03 | \n",
386 | " Direct | \n",
387 | " Web | \n",
388 | " ES | \n",
389 | " NaN | \n",
390 | " IE | \n",
391 | " 1 | \n",
392 | " 0 | \n",
393 | " M | \n",
394 | " 32 | \n",
395 | " Spain | \n",
396 | "
\n",
397 | " \n",
398 | " 1 | \n",
399 | " 497851 | \n",
400 | " 2015-12-04 | \n",
401 | " Ads | \n",
402 | " Web | \n",
403 | " ES | \n",
404 | " Google | \n",
405 | " IE | \n",
406 | " 0 | \n",
407 | " 1 | \n",
408 | " M | \n",
409 | " 21 | \n",
410 | " Mexico | \n",
411 | "
\n",
412 | " \n",
413 | " 2 | \n",
414 | " 848402 | \n",
415 | " 2015-12-04 | \n",
416 | " Ads | \n",
417 | " Web | \n",
418 | " ES | \n",
419 | " Facebook | \n",
420 | " Chrome | \n",
421 | " 0 | \n",
422 | " 0 | \n",
423 | " M | \n",
424 | " 34 | \n",
425 | " Spain | \n",
426 | "
\n",
427 | " \n",
428 | " 3 | \n",
429 | " 290051 | \n",
430 | " 2015-12-03 | \n",
431 | " Ads | \n",
432 | " Mobile | \n",
433 | " Other | \n",
434 | " Facebook | \n",
435 | " Android_App | \n",
436 | " 0 | \n",
437 | " 1 | \n",
438 | " F | \n",
439 | " 22 | \n",
440 | " Mexico | \n",
441 | "
\n",
442 | " \n",
443 | " 4 | \n",
444 | " 548435 | \n",
445 | " 2015-11-30 | \n",
446 | " Ads | \n",
447 | " Web | \n",
448 | " ES | \n",
449 | " Google | \n",
450 | " FireFox | \n",
451 | " 0 | \n",
452 | " 1 | \n",
453 | " M | \n",
454 | " 19 | \n",
455 | " Mexico | \n",
456 | "
\n",
457 | " \n",
458 | "
\n",
459 | "
"
460 | ],
461 | "text/plain": [
462 | " user_id date source device browser_language ads_channel \\\n",
463 | "0 315281 2015-12-03 Direct Web ES NaN \n",
464 | "1 497851 2015-12-04 Ads Web ES Google \n",
465 | "2 848402 2015-12-04 Ads Web ES Facebook \n",
466 | "3 290051 2015-12-03 Ads Mobile Other Facebook \n",
467 | "4 548435 2015-11-30 Ads Web ES Google \n",
468 | "\n",
469 | " browser conversion test sex age country \n",
470 | "0 IE 1 0 M 32 Spain \n",
471 | "1 IE 0 1 M 21 Mexico \n",
472 | "2 Chrome 0 0 M 34 Spain \n",
473 | "3 Android_App 0 1 F 22 Mexico \n",
474 | "4 FireFox 0 1 M 19 Mexico "
475 | ]
476 | },
477 | "execution_count": 294,
478 | "metadata": {},
479 | "output_type": "execute_result"
480 | }
481 | ],
482 | "source": [
483 | "data.head()"
484 | ]
485 | },
486 | {
487 | "cell_type": "code",
488 | "execution_count": 295,
489 | "metadata": {},
490 | "outputs": [
491 | {
492 | "data": {
493 | "text/plain": [
494 | "(452867, 12)"
495 | ]
496 | },
497 | "execution_count": 295,
498 | "metadata": {},
499 | "output_type": "execute_result"
500 | }
501 | ],
502 | "source": [
503 | "data.shape"
504 | ]
505 | },
506 | {
507 | "cell_type": "code",
508 | "execution_count": 296,
509 | "metadata": {},
510 | "outputs": [
511 | {
512 | "data": {
513 | "text/plain": [
514 | "user_id int64\n",
515 | "date object\n",
516 | "source object\n",
517 | "device object\n",
518 | "browser_language object\n",
519 | "ads_channel object\n",
520 | "browser object\n",
521 | "conversion int64\n",
522 | "test int64\n",
523 | "sex object\n",
524 | "age int64\n",
525 | "country object\n",
526 | "dtype: object"
527 | ]
528 | },
529 | "execution_count": 296,
530 | "metadata": {},
531 | "output_type": "execute_result"
532 | }
533 | ],
534 | "source": [
535 | "# check columns' data types\n",
536 | "data.dtypes"
537 | ]
538 | },
539 | {
540 | "cell_type": "code",
541 | "execution_count": 297,
542 | "metadata": {},
543 | "outputs": [
544 | {
545 | "data": {
546 | "text/html": [
547 | "\n",
548 | "\n",
561 | "
\n",
562 | " \n",
563 | " \n",
564 | " | \n",
565 | " user_id | \n",
566 | " date | \n",
567 | " source | \n",
568 | " device | \n",
569 | " browser_language | \n",
570 | " ads_channel | \n",
571 | " browser | \n",
572 | " conversion | \n",
573 | " test | \n",
574 | " sex | \n",
575 | " age | \n",
576 | " country | \n",
577 | "
\n",
578 | " \n",
579 | " \n",
580 | " \n",
581 | " count | \n",
582 | " 452867.000000 | \n",
583 | " 452867 | \n",
584 | " 452867 | \n",
585 | " 452867 | \n",
586 | " 452867 | \n",
587 | " 181693 | \n",
588 | " 452867 | \n",
589 | " 452867.000000 | \n",
590 | " 452867.000000 | \n",
591 | " 452867 | \n",
592 | " 452867.000000 | \n",
593 | " 452867 | \n",
594 | "
\n",
595 | " \n",
596 | " unique | \n",
597 | " NaN | \n",
598 | " 5 | \n",
599 | " 3 | \n",
600 | " 2 | \n",
601 | " 3 | \n",
602 | " 5 | \n",
603 | " 7 | \n",
604 | " NaN | \n",
605 | " NaN | \n",
606 | " 2 | \n",
607 | " NaN | \n",
608 | " 17 | \n",
609 | "
\n",
610 | " \n",
611 | " top | \n",
612 | " NaN | \n",
613 | " 2015-12-04 | \n",
614 | " Ads | \n",
615 | " Web | \n",
616 | " ES | \n",
617 | " Facebook | \n",
618 | " Android_App | \n",
619 | " NaN | \n",
620 | " NaN | \n",
621 | " M | \n",
622 | " NaN | \n",
623 | " Mexico | \n",
624 | "
\n",
625 | " \n",
626 | " freq | \n",
627 | " NaN | \n",
628 | " 141024 | \n",
629 | " 181693 | \n",
630 | " 251316 | \n",
631 | " 377160 | \n",
632 | " 68358 | \n",
633 | " 154977 | \n",
634 | " NaN | \n",
635 | " NaN | \n",
636 | " 264485 | \n",
637 | " NaN | \n",
638 | " 128484 | \n",
639 | "
\n",
640 | " \n",
641 | " mean | \n",
642 | " 499944.805166 | \n",
643 | " NaN | \n",
644 | " NaN | \n",
645 | " NaN | \n",
646 | " NaN | \n",
647 | " NaN | \n",
648 | " NaN | \n",
649 | " 0.049560 | \n",
650 | " 0.476462 | \n",
651 | " NaN | \n",
652 | " 27.130740 | \n",
653 | " NaN | \n",
654 | "
\n",
655 | " \n",
656 | " std | \n",
657 | " 288676.264784 | \n",
658 | " NaN | \n",
659 | " NaN | \n",
660 | " NaN | \n",
661 | " NaN | \n",
662 | " NaN | \n",
663 | " NaN | \n",
664 | " 0.217034 | \n",
665 | " 0.499446 | \n",
666 | " NaN | \n",
667 | " 6.776678 | \n",
668 | " NaN | \n",
669 | "
\n",
670 | " \n",
671 | " min | \n",
672 | " 1.000000 | \n",
673 | " NaN | \n",
674 | " NaN | \n",
675 | " NaN | \n",
676 | " NaN | \n",
677 | " NaN | \n",
678 | " NaN | \n",
679 | " 0.000000 | \n",
680 | " 0.000000 | \n",
681 | " NaN | \n",
682 | " 18.000000 | \n",
683 | " NaN | \n",
684 | "
\n",
685 | " \n",
686 | " 25% | \n",
687 | " 249819.000000 | \n",
688 | " NaN | \n",
689 | " NaN | \n",
690 | " NaN | \n",
691 | " NaN | \n",
692 | " NaN | \n",
693 | " NaN | \n",
694 | " 0.000000 | \n",
695 | " 0.000000 | \n",
696 | " NaN | \n",
697 | " 22.000000 | \n",
698 | " NaN | \n",
699 | "
\n",
700 | " \n",
701 | " 50% | \n",
702 | " 500019.000000 | \n",
703 | " NaN | \n",
704 | " NaN | \n",
705 | " NaN | \n",
706 | " NaN | \n",
707 | " NaN | \n",
708 | " NaN | \n",
709 | " 0.000000 | \n",
710 | " 0.000000 | \n",
711 | " NaN | \n",
712 | " 26.000000 | \n",
713 | " NaN | \n",
714 | "
\n",
715 | " \n",
716 | " 75% | \n",
717 | " 749543.000000 | \n",
718 | " NaN | \n",
719 | " NaN | \n",
720 | " NaN | \n",
721 | " NaN | \n",
722 | " NaN | \n",
723 | " NaN | \n",
724 | " 0.000000 | \n",
725 | " 1.000000 | \n",
726 | " NaN | \n",
727 | " 31.000000 | \n",
728 | " NaN | \n",
729 | "
\n",
730 | " \n",
731 | " max | \n",
732 | " 1000000.000000 | \n",
733 | " NaN | \n",
734 | " NaN | \n",
735 | " NaN | \n",
736 | " NaN | \n",
737 | " NaN | \n",
738 | " NaN | \n",
739 | " 1.000000 | \n",
740 | " 1.000000 | \n",
741 | " NaN | \n",
742 | " 70.000000 | \n",
743 | " NaN | \n",
744 | "
\n",
745 | " \n",
746 | "
\n",
747 | "
"
748 | ],
749 | "text/plain": [
750 | " user_id date source device browser_language \\\n",
751 | "count 452867.000000 452867 452867 452867 452867 \n",
752 | "unique NaN 5 3 2 3 \n",
753 | "top NaN 2015-12-04 Ads Web ES \n",
754 | "freq NaN 141024 181693 251316 377160 \n",
755 | "mean 499944.805166 NaN NaN NaN NaN \n",
756 | "std 288676.264784 NaN NaN NaN NaN \n",
757 | "min 1.000000 NaN NaN NaN NaN \n",
758 | "25% 249819.000000 NaN NaN NaN NaN \n",
759 | "50% 500019.000000 NaN NaN NaN NaN \n",
760 | "75% 749543.000000 NaN NaN NaN NaN \n",
761 | "max 1000000.000000 NaN NaN NaN NaN \n",
762 | "\n",
763 | " ads_channel browser conversion test sex \\\n",
764 | "count 181693 452867 452867.000000 452867.000000 452867 \n",
765 | "unique 5 7 NaN NaN 2 \n",
766 | "top Facebook Android_App NaN NaN M \n",
767 | "freq 68358 154977 NaN NaN 264485 \n",
768 | "mean NaN NaN 0.049560 0.476462 NaN \n",
769 | "std NaN NaN 0.217034 0.499446 NaN \n",
770 | "min NaN NaN 0.000000 0.000000 NaN \n",
771 | "25% NaN NaN 0.000000 0.000000 NaN \n",
772 | "50% NaN NaN 0.000000 0.000000 NaN \n",
773 | "75% NaN NaN 0.000000 1.000000 NaN \n",
774 | "max NaN NaN 1.000000 1.000000 NaN \n",
775 | "\n",
776 | " age country \n",
777 | "count 452867.000000 452867 \n",
778 | "unique NaN 17 \n",
779 | "top NaN Mexico \n",
780 | "freq NaN 128484 \n",
781 | "mean 27.130740 NaN \n",
782 | "std 6.776678 NaN \n",
783 | "min 18.000000 NaN \n",
784 | "25% 22.000000 NaN \n",
785 | "50% 26.000000 NaN \n",
786 | "75% 31.000000 NaN \n",
787 | "max 70.000000 NaN "
788 | ]
789 | },
790 | "execution_count": 297,
791 | "metadata": {},
792 | "output_type": "execute_result"
793 | }
794 | ],
795 | "source": [
796 | "data.describe(include = 'all')"
797 | ]
798 | },
799 | {
800 | "cell_type": "code",
801 | "execution_count": 298,
802 | "metadata": {},
803 | "outputs": [
804 | {
805 | "data": {
806 | "text/plain": [
807 | "user_id 0\n",
808 | "date 0\n",
809 | "source 0\n",
810 | "device 0\n",
811 | "browser_language 0\n",
812 | "ads_channel 271174\n",
813 | "browser 0\n",
814 | "conversion 0\n",
815 | "test 0\n",
816 | "sex 0\n",
817 | "age 0\n",
818 | "country 0\n",
819 | "dtype: int64"
820 | ]
821 | },
822 | "execution_count": 298,
823 | "metadata": {},
824 | "output_type": "execute_result"
825 | }
826 | ],
827 | "source": [
828 | "# check if there is any null values\n",
829 | "# about 60% ads_channel values are missing \n",
830 | "data.isnull().sum()"
831 | ]
832 | },
833 | {
834 | "cell_type": "code",
835 | "execution_count": 299,
836 | "metadata": {},
837 | "outputs": [
838 | {
839 | "data": {
840 | "text/plain": [
841 | "2015-12-04 141024\n",
842 | "2015-12-03 99399\n",
843 | "2015-11-30 70948\n",
844 | "2015-12-01 70915\n",
845 | "2015-12-02 70581\n",
846 | "Name: date, dtype: int64"
847 | ]
848 | },
849 | "execution_count": 299,
850 | "metadata": {},
851 | "output_type": "execute_result"
852 | }
853 | ],
854 | "source": [
855 | "data.date.value_counts()"
856 | ]
857 | },
858 | {
859 | "cell_type": "code",
860 | "execution_count": 300,
861 | "metadata": {},
862 | "outputs": [
863 | {
864 | "data": {
865 | "text/plain": [
866 | "Ads 181693\n",
867 | "SEO 180436\n",
868 | "Direct 90738\n",
869 | "Name: source, dtype: int64"
870 | ]
871 | },
872 | "execution_count": 300,
873 | "metadata": {},
874 | "output_type": "execute_result"
875 | }
876 | ],
877 | "source": [
878 | "data.source.value_counts()"
879 | ]
880 | },
881 | {
882 | "cell_type": "code",
883 | "execution_count": 301,
884 | "metadata": {},
885 | "outputs": [
886 | {
887 | "data": {
888 | "text/plain": [
889 | "Web 251316\n",
890 | "Mobile 201551\n",
891 | "Name: device, dtype: int64"
892 | ]
893 | },
894 | "execution_count": 301,
895 | "metadata": {},
896 | "output_type": "execute_result"
897 | }
898 | ],
899 | "source": [
900 | "data.device.value_counts()"
901 | ]
902 | },
903 | {
904 | "cell_type": "code",
905 | "execution_count": 302,
906 | "metadata": {},
907 | "outputs": [
908 | {
909 | "data": {
910 | "text/plain": [
911 | "ES 377160\n",
912 | "EN 63079\n",
913 | "Other 12628\n",
914 | "Name: browser_language, dtype: int64"
915 | ]
916 | },
917 | "execution_count": 302,
918 | "metadata": {},
919 | "output_type": "execute_result"
920 | }
921 | ],
922 | "source": [
923 | "data.browser_language.value_counts()"
924 | ]
925 | },
926 | {
927 | "cell_type": "code",
928 | "execution_count": 303,
929 | "metadata": {},
930 | "outputs": [
931 | {
932 | "data": {
933 | "text/plain": [
934 | "Facebook 68358\n",
935 | "Google 68113\n",
936 | "Yahoo 27409\n",
937 | "Bing 13670\n",
938 | "Other 4143\n",
939 | "Name: ads_channel, dtype: int64"
940 | ]
941 | },
942 | "execution_count": 303,
943 | "metadata": {},
944 | "output_type": "execute_result"
945 | }
946 | ],
947 | "source": [
948 | "data.ads_channel.value_counts()"
949 | ]
950 | },
951 | {
952 | "cell_type": "code",
953 | "execution_count": 304,
954 | "metadata": {},
955 | "outputs": [
956 | {
957 | "data": {
958 | "text/plain": [
959 | "Android_App 154977\n",
960 | "Chrome 101822\n",
961 | "IE 61656\n",
962 | "Iphone_App 46574\n",
963 | "Safari 41033\n",
964 | "FireFox 40721\n",
965 | "Opera 6084\n",
966 | "Name: browser, dtype: int64"
967 | ]
968 | },
969 | "execution_count": 304,
970 | "metadata": {},
971 | "output_type": "execute_result"
972 | }
973 | ],
974 | "source": [
975 | "data.browser.value_counts()"
976 | ]
977 | },
978 | {
979 | "cell_type": "code",
980 | "execution_count": 305,
981 | "metadata": {},
982 | "outputs": [
983 | {
984 | "data": {
985 | "text/plain": [
986 | "0 430423\n",
987 | "1 22444\n",
988 | "Name: conversion, dtype: int64"
989 | ]
990 | },
991 | "execution_count": 305,
992 | "metadata": {},
993 | "output_type": "execute_result"
994 | }
995 | ],
996 | "source": [
997 | "data.conversion.value_counts()"
998 | ]
999 | },
1000 | {
1001 | "cell_type": "code",
1002 | "execution_count": 306,
1003 | "metadata": {},
1004 | "outputs": [
1005 | {
1006 | "data": {
1007 | "text/plain": [
1008 | "0 237093\n",
1009 | "1 215774\n",
1010 | "Name: test, dtype: int64"
1011 | ]
1012 | },
1013 | "execution_count": 306,
1014 | "metadata": {},
1015 | "output_type": "execute_result"
1016 | }
1017 | ],
1018 | "source": [
1019 | "data.test.value_counts()"
1020 | ]
1021 | },
1022 | {
1023 | "cell_type": "code",
1024 | "execution_count": 307,
1025 | "metadata": {},
1026 | "outputs": [
1027 | {
1028 | "data": {
1029 | "text/plain": [
1030 | "Mexico 128484\n",
1031 | "Colombia 54060\n",
1032 | "Spain 51782\n",
1033 | "Argentina 46733\n",
1034 | "Peru 33666\n",
1035 | "Venezuela 32054\n",
1036 | "Chile 19737\n",
1037 | "Ecuador 15895\n",
1038 | "Guatemala 15125\n",
1039 | "Bolivia 11124\n",
1040 | "Honduras 8568\n",
1041 | "El Salvador 8175\n",
1042 | "Paraguay 7347\n",
1043 | "Nicaragua 6723\n",
1044 | "Costa Rica 5309\n",
1045 | "Uruguay 4134\n",
1046 | "Panama 3951\n",
1047 | "Name: country, dtype: int64"
1048 | ]
1049 | },
1050 | "execution_count": 307,
1051 | "metadata": {},
1052 | "output_type": "execute_result"
1053 | }
1054 | ],
1055 | "source": [
1056 | "data.country.value_counts()"
1057 | ]
1058 | },
1059 | {
1060 | "cell_type": "markdown",
1061 | "metadata": {},
1062 | "source": [
1063 | "Let's first check and confirm that before test, Spain converts more than the other countrys"
1064 | ]
1065 | },
1066 | {
1067 | "cell_type": "code",
1068 | "execution_count": 358,
1069 | "metadata": {},
1070 | "outputs": [
1071 | {
1072 | "data": {
1073 | "text/plain": [
1074 | "country\n",
1075 | "Spain 0.079719\n",
1076 | "El Salvador 0.053554\n",
1077 | "Nicaragua 0.052647\n",
1078 | "Costa Rica 0.052256\n",
1079 | "Colombia 0.052089\n",
1080 | "Honduras 0.050906\n",
1081 | "Guatemala 0.050643\n",
1082 | "Venezuela 0.050344\n",
1083 | "Peru 0.049914\n",
1084 | "Mexico 0.049495\n",
1085 | "Bolivia 0.049369\n",
1086 | "Ecuador 0.049154\n",
1087 | "Paraguay 0.048493\n",
1088 | "Chile 0.048107\n",
1089 | "Panama 0.046796\n",
1090 | "Argentina 0.015071\n",
1091 | "Uruguay 0.012048\n",
1092 | "Name: conversion, dtype: float64"
1093 | ]
1094 | },
1095 | "execution_count": 358,
1096 | "metadata": {},
1097 | "output_type": "execute_result"
1098 | }
1099 | ],
1100 | "source": [
1101 | "# Yes, Spain has the highest conversion rate.\n",
1102 | "data[data['test']==0].groupby('country').conversion.mean().sort_values(ascending = False)"
1103 | ]
1104 | },
1105 | {
1106 | "cell_type": "code",
1107 | "execution_count": 362,
1108 | "metadata": {},
1109 | "outputs": [],
1110 | "source": [
1111 | "# group by country, and do NOT set country as index\n",
1112 | "data_country = data[data['test']==0].groupby('country', as_index = False).conversion.mean()"
1113 | ]
1114 | },
1115 | {
1116 | "cell_type": "code",
1117 | "execution_count": 368,
1118 | "metadata": {},
1119 | "outputs": [
1120 | {
1121 | "data": {
1122 | "text/plain": [
1123 | ""
1124 | ]
1125 | },
1126 | "execution_count": 368,
1127 | "metadata": {},
1128 | "output_type": "execute_result"
1129 | },
1130 | {
1131 | "data": {
1132 | "image/png": "\n",
1133 | "text/plain": [
1134 | ""
1135 | ]
1136 | },
1137 | "metadata": {},
1138 | "output_type": "display_data"
1139 | }
1140 | ],
1141 | "source": [
1142 | "g = sns.factorplot(x = 'country', y = 'conversion', \\\n",
1143 | " data = data[data['test']==0].groupby('country', as_index = False).conversion.mean(),\\\n",
1144 | " kind = 'bar', size = 3, aspect = 5)\n",
1145 | "g.set_xticklabels(rotation=30)"
1146 | ]
1147 | },
1148 | {
1149 | "cell_type": "markdown",
1150 | "metadata": {},
1151 | "source": [
1152 | "### Step2: Calculate t statistics"
1153 | ]
1154 | },
1155 | {
1156 | "cell_type": "markdown",
1157 | "metadata": {},
1158 | "source": [
1159 | "#### Hypothesis:\n",
1160 | "Null Hypothesis: the population mean(conversion rate) of local-translation is the same as the population mean of Spaniard-translation. mu1 = mu2
\n",
1161 | "Alternative Hypothesis: mu1 != mu2
\n",
1162 | "And let's use a signifigance level alpha < 0.05 and we're doing a two-tail test"
1163 | ]
1164 | },
1165 | {
1166 | "cell_type": "markdown",
1167 | "metadata": {},
1168 | "source": [
1169 | "Breake the data into two groups: controlled and experiment group, without Spain data"
1170 | ]
1171 | },
1172 | {
1173 | "cell_type": "code",
1174 | "execution_count": 309,
1175 | "metadata": {},
1176 | "outputs": [],
1177 | "source": [
1178 | "controll = data[(data['test']==0)&(data['country']!='Spain')]\n",
1179 | "exp = data[(data['test']==1)&(data['country']!='Spain')]"
1180 | ]
1181 | },
1182 | {
1183 | "cell_type": "markdown",
1184 | "metadata": {},
1185 | "source": [
1186 | "#### I will use below example to explain Simpson's Paradox"
1187 | ]
1188 | },
1189 | {
1190 | "cell_type": "markdown",
1191 | "metadata": {},
1192 | "source": [
1193 | "Compare the control and test groups in all country as a whole"
1194 | ]
1195 | },
1196 | {
1197 | "cell_type": "code",
1198 | "execution_count": 314,
1199 | "metadata": {},
1200 | "outputs": [
1201 | {
1202 | "name": "stdout",
1203 | "output_type": "stream",
1204 | "text": [
1205 | "The avg conversion rate of controll group: 0.04829179055749524\n",
1206 | "The avg conversion rate of exp group: 0.043411161678422794\n"
1207 | ]
1208 | }
1209 | ],
1210 | "source": [
1211 | "# calculate the mean conversion rate for both groups\n",
1212 | "print ('The avg conversion rate of controll group: {}'.format(controll.conversion.mean()))\n",
1213 | "print ('The avg conversion rate of exp group: {}'.format(exp.conversion.mean()))"
1214 | ]
1215 | },
1216 | {
1217 | "cell_type": "code",
1218 | "execution_count": 315,
1219 | "metadata": {},
1220 | "outputs": [],
1221 | "source": [
1222 | "# calculate t statistics and p value\n",
1223 | "t,p = stats.ttest_ind(a=controll['conversion'], b=exp['conversion'],equal_var = False)"
1224 | ]
1225 | },
1226 | {
1227 | "cell_type": "code",
1228 | "execution_count": 316,
1229 | "metadata": {},
1230 | "outputs": [
1231 | {
1232 | "name": "stdout",
1233 | "output_type": "stream",
1234 | "text": [
1235 | "7.35389520308 1.92891785778e-13\n"
1236 | ]
1237 | }
1238 | ],
1239 | "source": [
1240 | "print (t,p)"
1241 | ]
1242 | },
1243 | {
1244 | "cell_type": "markdown",
1245 | "metadata": {},
1246 | "source": [
1247 | "If we look at the above analysis, the test group is doing significally worse than control group. It seems that after the change, the conversion rates drops significantly!
\n",
1248 | "When things seems not in the way that we expected, there must be something wrong.
\n",
1249 | "Let's dive deeper into the sample"
1250 | ]
1251 | },
1252 | {
1253 | "cell_type": "code",
1254 | "execution_count": 319,
1255 | "metadata": {},
1256 | "outputs": [
1257 | {
1258 | "data": {
1259 | "text/plain": [
1260 | "date\n",
1261 | "2015-11-30 0.051204\n",
1262 | "2015-12-01 0.046249\n",
1263 | "2015-12-02 0.048472\n",
1264 | "2015-12-03 0.049255\n",
1265 | "2015-12-04 0.047085\n",
1266 | "Name: conversion, dtype: float64"
1267 | ]
1268 | },
1269 | "execution_count": 319,
1270 | "metadata": {},
1271 | "output_type": "execute_result"
1272 | }
1273 | ],
1274 | "source": [
1275 | "# the conversion rate in test group are constantly ower throughout the days\n",
1276 | "controll.groupby('date').conversion.mean()"
1277 | ]
1278 | },
1279 | {
1280 | "cell_type": "code",
1281 | "execution_count": 320,
1282 | "metadata": {},
1283 | "outputs": [
1284 | {
1285 | "data": {
1286 | "text/plain": [
1287 | "date\n",
1288 | "2015-11-30 0.043878\n",
1289 | "2015-12-01 0.041371\n",
1290 | "2015-12-02 0.044216\n",
1291 | "2015-12-03 0.043898\n",
1292 | "2015-12-04 0.043459\n",
1293 | "Name: conversion, dtype: float64"
1294 | ]
1295 | },
1296 | "execution_count": 320,
1297 | "metadata": {},
1298 | "output_type": "execute_result"
1299 | }
1300 | ],
1301 | "source": [
1302 | "exp.groupby('date').conversion.mean()"
1303 | ]
1304 | },
1305 | {
1306 | "cell_type": "code",
1307 | "execution_count": 321,
1308 | "metadata": {},
1309 | "outputs": [],
1310 | "source": [
1311 | "c_country = pd.Series(controll.groupby('country').size(), name = 'controll')"
1312 | ]
1313 | },
1314 | {
1315 | "cell_type": "code",
1316 | "execution_count": 322,
1317 | "metadata": {},
1318 | "outputs": [],
1319 | "source": [
1320 | "e_country = pd.Series(exp.groupby('country').size(), name = 'exp')"
1321 | ]
1322 | },
1323 | {
1324 | "cell_type": "code",
1325 | "execution_count": 323,
1326 | "metadata": {},
1327 | "outputs": [
1328 | {
1329 | "data": {
1330 | "text/html": [
1331 | "\n",
1332 | "\n",
1345 | "
\n",
1346 | " \n",
1347 | " \n",
1348 | " | \n",
1349 | " controll | \n",
1350 | " exp | \n",
1351 | "
\n",
1352 | " \n",
1353 | " country | \n",
1354 | " | \n",
1355 | " | \n",
1356 | "
\n",
1357 | " \n",
1358 | " \n",
1359 | " \n",
1360 | " Argentina | \n",
1361 | " 9356 | \n",
1362 | " 37377 | \n",
1363 | "
\n",
1364 | " \n",
1365 | " Bolivia | \n",
1366 | " 5550 | \n",
1367 | " 5574 | \n",
1368 | "
\n",
1369 | " \n",
1370 | " Chile | \n",
1371 | " 9853 | \n",
1372 | " 9884 | \n",
1373 | "
\n",
1374 | " \n",
1375 | " Colombia | \n",
1376 | " 27088 | \n",
1377 | " 26972 | \n",
1378 | "
\n",
1379 | " \n",
1380 | " Costa Rica | \n",
1381 | " 2660 | \n",
1382 | " 2649 | \n",
1383 | "
\n",
1384 | " \n",
1385 | " Ecuador | \n",
1386 | " 8036 | \n",
1387 | " 7859 | \n",
1388 | "
\n",
1389 | " \n",
1390 | " El Salvador | \n",
1391 | " 4108 | \n",
1392 | " 4067 | \n",
1393 | "
\n",
1394 | " \n",
1395 | " Guatemala | \n",
1396 | " 7622 | \n",
1397 | " 7503 | \n",
1398 | "
\n",
1399 | " \n",
1400 | " Honduras | \n",
1401 | " 4361 | \n",
1402 | " 4207 | \n",
1403 | "
\n",
1404 | " \n",
1405 | " Mexico | \n",
1406 | " 64209 | \n",
1407 | " 64275 | \n",
1408 | "
\n",
1409 | " \n",
1410 | " Nicaragua | \n",
1411 | " 3419 | \n",
1412 | " 3304 | \n",
1413 | "
\n",
1414 | " \n",
1415 | " Panama | \n",
1416 | " 1966 | \n",
1417 | " 1985 | \n",
1418 | "
\n",
1419 | " \n",
1420 | " Paraguay | \n",
1421 | " 3650 | \n",
1422 | " 3697 | \n",
1423 | "
\n",
1424 | " \n",
1425 | " Peru | \n",
1426 | " 16869 | \n",
1427 | " 16797 | \n",
1428 | "
\n",
1429 | " \n",
1430 | " Uruguay | \n",
1431 | " 415 | \n",
1432 | " 3719 | \n",
1433 | "
\n",
1434 | " \n",
1435 | " Venezuela | \n",
1436 | " 16149 | \n",
1437 | " 15905 | \n",
1438 | "
\n",
1439 | " \n",
1440 | "
\n",
1441 | "
"
1442 | ],
1443 | "text/plain": [
1444 | " controll exp\n",
1445 | "country \n",
1446 | "Argentina 9356 37377\n",
1447 | "Bolivia 5550 5574\n",
1448 | "Chile 9853 9884\n",
1449 | "Colombia 27088 26972\n",
1450 | "Costa Rica 2660 2649\n",
1451 | "Ecuador 8036 7859\n",
1452 | "El Salvador 4108 4067\n",
1453 | "Guatemala 7622 7503\n",
1454 | "Honduras 4361 4207\n",
1455 | "Mexico 64209 64275\n",
1456 | "Nicaragua 3419 3304\n",
1457 | "Panama 1966 1985\n",
1458 | "Paraguay 3650 3697\n",
1459 | "Peru 16869 16797\n",
1460 | "Uruguay 415 3719\n",
1461 | "Venezuela 16149 15905"
1462 | ]
1463 | },
1464 | "execution_count": 323,
1465 | "metadata": {},
1466 | "output_type": "execute_result"
1467 | }
1468 | ],
1469 | "source": [
1470 | "pd.concat([c_country,e_country],axis = 1)"
1471 | ]
1472 | },
1473 | {
1474 | "cell_type": "code",
1475 | "execution_count": 324,
1476 | "metadata": {},
1477 | "outputs": [
1478 | {
1479 | "data": {
1480 | "text/plain": [
1481 | ""
1482 | ]
1483 | },
1484 | "execution_count": 324,
1485 | "metadata": {},
1486 | "output_type": "execute_result"
1487 | },
1488 | {
1489 | "data": {
1490 | "image/png": "\n",
1491 | "text/plain": [
1492 | ""
1493 | ]
1494 | },
1495 | "metadata": {},
1496 | "output_type": "display_data"
1497 | }
1498 | ],
1499 | "source": [
1500 | "(pd.concat([c_country,e_country],axis = 1)).plot(kind='bar')"
1501 | ]
1502 | },
1503 | {
1504 | "cell_type": "markdown",
1505 | "metadata": {},
1506 | "source": [
1507 | "The sample is biased. For example, Argentina and Uruguay's exp group has a larger sample size than the control group"
1508 | ]
1509 | },
1510 | {
1511 | "cell_type": "markdown",
1512 | "metadata": {},
1513 | "source": [
1514 | "#### We should look at the comparison in each segment(country)"
1515 | ]
1516 | },
1517 | {
1518 | "cell_type": "code",
1519 | "execution_count": 327,
1520 | "metadata": {},
1521 | "outputs": [],
1522 | "source": [
1523 | "# get the conversion rate for each country in controll group\n",
1524 | "c_cr = pd.Series(controll.groupby('country').conversion.mean(),name = 'controll conversion rate')"
1525 | ]
1526 | },
1527 | {
1528 | "cell_type": "code",
1529 | "execution_count": 328,
1530 | "metadata": {},
1531 | "outputs": [],
1532 | "source": [
1533 | "# get the conversion rate for each country in experiment group\n",
1534 | "e_cr = pd.Series(exp.groupby('country').conversion.mean(), name = 'exp conversion rate')"
1535 | ]
1536 | },
1537 | {
1538 | "cell_type": "code",
1539 | "execution_count": 329,
1540 | "metadata": {},
1541 | "outputs": [
1542 | {
1543 | "data": {
1544 | "text/plain": [
1545 | "country\n",
1546 | "Argentina 0.015071\n",
1547 | "Bolivia 0.049369\n",
1548 | "Chile 0.048107\n",
1549 | "Colombia 0.052089\n",
1550 | "Costa Rica 0.052256\n",
1551 | "Ecuador 0.049154\n",
1552 | "El Salvador 0.053554\n",
1553 | "Guatemala 0.050643\n",
1554 | "Honduras 0.050906\n",
1555 | "Mexico 0.049495\n",
1556 | "Nicaragua 0.052647\n",
1557 | "Panama 0.046796\n",
1558 | "Paraguay 0.048493\n",
1559 | "Peru 0.049914\n",
1560 | "Uruguay 0.012048\n",
1561 | "Venezuela 0.050344\n",
1562 | "Name: controll conversion rate, dtype: float64"
1563 | ]
1564 | },
1565 | "execution_count": 329,
1566 | "metadata": {},
1567 | "output_type": "execute_result"
1568 | }
1569 | ],
1570 | "source": [
1571 | "c_cr"
1572 | ]
1573 | },
1574 | {
1575 | "cell_type": "markdown",
1576 | "metadata": {},
1577 | "source": [
1578 | "Get all the t, and p values for each country"
1579 | ]
1580 | },
1581 | {
1582 | "cell_type": "code",
1583 | "execution_count": 330,
1584 | "metadata": {},
1585 | "outputs": [],
1586 | "source": [
1587 | "country_list =list(controll.country.unique())"
1588 | ]
1589 | },
1590 | {
1591 | "cell_type": "code",
1592 | "execution_count": 331,
1593 | "metadata": {},
1594 | "outputs": [
1595 | {
1596 | "data": {
1597 | "text/plain": [
1598 | "['Mexico',\n",
1599 | " 'Colombia',\n",
1600 | " 'El Salvador',\n",
1601 | " 'Nicaragua',\n",
1602 | " 'Peru',\n",
1603 | " 'Chile',\n",
1604 | " 'Argentina',\n",
1605 | " 'Ecuador',\n",
1606 | " 'Venezuela',\n",
1607 | " 'Guatemala',\n",
1608 | " 'Honduras',\n",
1609 | " 'Panama',\n",
1610 | " 'Paraguay',\n",
1611 | " 'Costa Rica',\n",
1612 | " 'Bolivia',\n",
1613 | " 'Uruguay']"
1614 | ]
1615 | },
1616 | "execution_count": 331,
1617 | "metadata": {},
1618 | "output_type": "execute_result"
1619 | }
1620 | ],
1621 | "source": [
1622 | "country_list"
1623 | ]
1624 | },
1625 | {
1626 | "cell_type": "code",
1627 | "execution_count": 332,
1628 | "metadata": {},
1629 | "outputs": [],
1630 | "source": [
1631 | "lin = []\n",
1632 | "for c in country_list:\n",
1633 | " t,p = stats.ttest_ind(a=controll[controll['country']==c].conversion, \\\n",
1634 | " b=exp[exp['country']==c].conversion,equal_var = False)\n",
1635 | " #t_stat.append(t)\n",
1636 | " #p_value.append(p)\n",
1637 | " lin = lin + [[t,p]]"
1638 | ]
1639 | },
1640 | {
1641 | "cell_type": "code",
1642 | "execution_count": 333,
1643 | "metadata": {},
1644 | "outputs": [
1645 | {
1646 | "data": {
1647 | "text/plain": [
1648 | "[[-1.3866735952325449, 0.16554372211039645],\n",
1649 | " [0.79999178223708245, 0.42371907413141141],\n",
1650 | " [1.1549940887832975, 0.2481266743266678],\n",
1651 | " [-0.27880850314757355, 0.78040038589047944],\n",
1652 | " [-0.28982358545511927, 0.77195298851535477],\n",
1653 | " [-1.0303728644383661, 0.30284764308444695],\n",
1654 | " [0.9638326839451179, 0.33514654687468659],\n",
1655 | " [0.048257426198918048, 0.96151169060066222],\n",
1656 | " [0.56261424690935702, 0.57370152343872549],\n",
1657 | " [0.56496315146205101, 0.57210720819120686],\n",
1658 | " [0.72013284328217941, 0.47146285652575859],\n",
1659 | " [-0.378167043801935, 0.70532683727258894],\n",
1660 | " [-0.14628996329799995, 0.88369650349623641],\n",
1661 | " [-0.40176067651471453, 0.68787635370739864],\n",
1662 | " [0.35995817724402418, 0.71888524684510746],\n",
1663 | " [-0.15134316107212104, 0.87976397365142245]]"
1664 | ]
1665 | },
1666 | "execution_count": 333,
1667 | "metadata": {},
1668 | "output_type": "execute_result"
1669 | }
1670 | ],
1671 | "source": [
1672 | "lin"
1673 | ]
1674 | },
1675 | {
1676 | "cell_type": "code",
1677 | "execution_count": 334,
1678 | "metadata": {},
1679 | "outputs": [],
1680 | "source": [
1681 | "stats = pd.DataFrame(lin, columns=['t', 'p'], index = country_list)"
1682 | ]
1683 | },
1684 | {
1685 | "cell_type": "code",
1686 | "execution_count": 335,
1687 | "metadata": {},
1688 | "outputs": [
1689 | {
1690 | "data": {
1691 | "text/html": [
1692 | "\n",
1693 | "\n",
1706 | "
\n",
1707 | " \n",
1708 | " \n",
1709 | " | \n",
1710 | " t | \n",
1711 | " p | \n",
1712 | "
\n",
1713 | " \n",
1714 | " \n",
1715 | " \n",
1716 | " Mexico | \n",
1717 | " -1.386674 | \n",
1718 | " 0.165544 | \n",
1719 | "
\n",
1720 | " \n",
1721 | " Colombia | \n",
1722 | " 0.799992 | \n",
1723 | " 0.423719 | \n",
1724 | "
\n",
1725 | " \n",
1726 | " El Salvador | \n",
1727 | " 1.154994 | \n",
1728 | " 0.248127 | \n",
1729 | "
\n",
1730 | " \n",
1731 | " Nicaragua | \n",
1732 | " -0.278809 | \n",
1733 | " 0.780400 | \n",
1734 | "
\n",
1735 | " \n",
1736 | " Peru | \n",
1737 | " -0.289824 | \n",
1738 | " 0.771953 | \n",
1739 | "
\n",
1740 | " \n",
1741 | " Chile | \n",
1742 | " -1.030373 | \n",
1743 | " 0.302848 | \n",
1744 | "
\n",
1745 | " \n",
1746 | " Argentina | \n",
1747 | " 0.963833 | \n",
1748 | " 0.335147 | \n",
1749 | "
\n",
1750 | " \n",
1751 | " Ecuador | \n",
1752 | " 0.048257 | \n",
1753 | " 0.961512 | \n",
1754 | "
\n",
1755 | " \n",
1756 | " Venezuela | \n",
1757 | " 0.562614 | \n",
1758 | " 0.573702 | \n",
1759 | "
\n",
1760 | " \n",
1761 | " Guatemala | \n",
1762 | " 0.564963 | \n",
1763 | " 0.572107 | \n",
1764 | "
\n",
1765 | " \n",
1766 | " Honduras | \n",
1767 | " 0.720133 | \n",
1768 | " 0.471463 | \n",
1769 | "
\n",
1770 | " \n",
1771 | " Panama | \n",
1772 | " -0.378167 | \n",
1773 | " 0.705327 | \n",
1774 | "
\n",
1775 | " \n",
1776 | " Paraguay | \n",
1777 | " -0.146290 | \n",
1778 | " 0.883697 | \n",
1779 | "
\n",
1780 | " \n",
1781 | " Costa Rica | \n",
1782 | " -0.401761 | \n",
1783 | " 0.687876 | \n",
1784 | "
\n",
1785 | " \n",
1786 | " Bolivia | \n",
1787 | " 0.359958 | \n",
1788 | " 0.718885 | \n",
1789 | "
\n",
1790 | " \n",
1791 | " Uruguay | \n",
1792 | " -0.151343 | \n",
1793 | " 0.879764 | \n",
1794 | "
\n",
1795 | " \n",
1796 | "
\n",
1797 | "
"
1798 | ],
1799 | "text/plain": [
1800 | " t p\n",
1801 | "Mexico -1.386674 0.165544\n",
1802 | "Colombia 0.799992 0.423719\n",
1803 | "El Salvador 1.154994 0.248127\n",
1804 | "Nicaragua -0.278809 0.780400\n",
1805 | "Peru -0.289824 0.771953\n",
1806 | "Chile -1.030373 0.302848\n",
1807 | "Argentina 0.963833 0.335147\n",
1808 | "Ecuador 0.048257 0.961512\n",
1809 | "Venezuela 0.562614 0.573702\n",
1810 | "Guatemala 0.564963 0.572107\n",
1811 | "Honduras 0.720133 0.471463\n",
1812 | "Panama -0.378167 0.705327\n",
1813 | "Paraguay -0.146290 0.883697\n",
1814 | "Costa Rica -0.401761 0.687876\n",
1815 | "Bolivia 0.359958 0.718885\n",
1816 | "Uruguay -0.151343 0.879764"
1817 | ]
1818 | },
1819 | "execution_count": 335,
1820 | "metadata": {},
1821 | "output_type": "execute_result"
1822 | }
1823 | ],
1824 | "source": [
1825 | "stats"
1826 | ]
1827 | },
1828 | {
1829 | "cell_type": "code",
1830 | "execution_count": 336,
1831 | "metadata": {},
1832 | "outputs": [
1833 | {
1834 | "data": {
1835 | "text/html": [
1836 | "\n",
1837 | "\n",
1850 | "
\n",
1851 | " \n",
1852 | " \n",
1853 | " | \n",
1854 | " controll conversion rate | \n",
1855 | " exp conversion rate | \n",
1856 | " t | \n",
1857 | " p | \n",
1858 | "
\n",
1859 | " \n",
1860 | " \n",
1861 | " \n",
1862 | " Argentina | \n",
1863 | " 0.015071 | \n",
1864 | " 0.013725 | \n",
1865 | " 0.963833 | \n",
1866 | " 0.335147 | \n",
1867 | "
\n",
1868 | " \n",
1869 | " Bolivia | \n",
1870 | " 0.049369 | \n",
1871 | " 0.047901 | \n",
1872 | " 0.359958 | \n",
1873 | " 0.718885 | \n",
1874 | "
\n",
1875 | " \n",
1876 | " Chile | \n",
1877 | " 0.048107 | \n",
1878 | " 0.051295 | \n",
1879 | " -1.030373 | \n",
1880 | " 0.302848 | \n",
1881 | "
\n",
1882 | " \n",
1883 | " Colombia | \n",
1884 | " 0.052089 | \n",
1885 | " 0.050571 | \n",
1886 | " 0.799992 | \n",
1887 | " 0.423719 | \n",
1888 | "
\n",
1889 | " \n",
1890 | " Costa Rica | \n",
1891 | " 0.052256 | \n",
1892 | " 0.054738 | \n",
1893 | " -0.401761 | \n",
1894 | " 0.687876 | \n",
1895 | "
\n",
1896 | " \n",
1897 | " Ecuador | \n",
1898 | " 0.049154 | \n",
1899 | " 0.048988 | \n",
1900 | " 0.048257 | \n",
1901 | " 0.961512 | \n",
1902 | "
\n",
1903 | " \n",
1904 | " El Salvador | \n",
1905 | " 0.053554 | \n",
1906 | " 0.047947 | \n",
1907 | " 1.154994 | \n",
1908 | " 0.248127 | \n",
1909 | "
\n",
1910 | " \n",
1911 | " Guatemala | \n",
1912 | " 0.050643 | \n",
1913 | " 0.048647 | \n",
1914 | " 0.564963 | \n",
1915 | " 0.572107 | \n",
1916 | "
\n",
1917 | " \n",
1918 | " Honduras | \n",
1919 | " 0.050906 | \n",
1920 | " 0.047540 | \n",
1921 | " 0.720133 | \n",
1922 | " 0.471463 | \n",
1923 | "
\n",
1924 | " \n",
1925 | " Mexico | \n",
1926 | " 0.049495 | \n",
1927 | " 0.051186 | \n",
1928 | " -1.386674 | \n",
1929 | " 0.165544 | \n",
1930 | "
\n",
1931 | " \n",
1932 | " Nicaragua | \n",
1933 | " 0.052647 | \n",
1934 | " 0.054177 | \n",
1935 | " -0.278809 | \n",
1936 | " 0.780400 | \n",
1937 | "
\n",
1938 | " \n",
1939 | " Panama | \n",
1940 | " 0.046796 | \n",
1941 | " 0.049370 | \n",
1942 | " -0.378167 | \n",
1943 | " 0.705327 | \n",
1944 | "
\n",
1945 | " \n",
1946 | " Paraguay | \n",
1947 | " 0.048493 | \n",
1948 | " 0.049229 | \n",
1949 | " -0.146290 | \n",
1950 | " 0.883697 | \n",
1951 | "
\n",
1952 | " \n",
1953 | " Peru | \n",
1954 | " 0.049914 | \n",
1955 | " 0.050604 | \n",
1956 | " -0.289824 | \n",
1957 | " 0.771953 | \n",
1958 | "
\n",
1959 | " \n",
1960 | " Uruguay | \n",
1961 | " 0.012048 | \n",
1962 | " 0.012907 | \n",
1963 | " -0.151343 | \n",
1964 | " 0.879764 | \n",
1965 | "
\n",
1966 | " \n",
1967 | " Venezuela | \n",
1968 | " 0.050344 | \n",
1969 | " 0.048978 | \n",
1970 | " 0.562614 | \n",
1971 | " 0.573702 | \n",
1972 | "
\n",
1973 | " \n",
1974 | "
\n",
1975 | "
"
1976 | ],
1977 | "text/plain": [
1978 | " controll conversion rate exp conversion rate t p\n",
1979 | "Argentina 0.015071 0.013725 0.963833 0.335147\n",
1980 | "Bolivia 0.049369 0.047901 0.359958 0.718885\n",
1981 | "Chile 0.048107 0.051295 -1.030373 0.302848\n",
1982 | "Colombia 0.052089 0.050571 0.799992 0.423719\n",
1983 | "Costa Rica 0.052256 0.054738 -0.401761 0.687876\n",
1984 | "Ecuador 0.049154 0.048988 0.048257 0.961512\n",
1985 | "El Salvador 0.053554 0.047947 1.154994 0.248127\n",
1986 | "Guatemala 0.050643 0.048647 0.564963 0.572107\n",
1987 | "Honduras 0.050906 0.047540 0.720133 0.471463\n",
1988 | "Mexico 0.049495 0.051186 -1.386674 0.165544\n",
1989 | "Nicaragua 0.052647 0.054177 -0.278809 0.780400\n",
1990 | "Panama 0.046796 0.049370 -0.378167 0.705327\n",
1991 | "Paraguay 0.048493 0.049229 -0.146290 0.883697\n",
1992 | "Peru 0.049914 0.050604 -0.289824 0.771953\n",
1993 | "Uruguay 0.012048 0.012907 -0.151343 0.879764\n",
1994 | "Venezuela 0.050344 0.048978 0.562614 0.573702"
1995 | ]
1996 | },
1997 | "execution_count": 336,
1998 | "metadata": {},
1999 | "output_type": "execute_result"
2000 | }
2001 | ],
2002 | "source": [
2003 | "pd.concat([c_cr,e_cr,stats],axis = 1)"
2004 | ]
2005 | },
2006 | {
2007 | "cell_type": "markdown",
2008 | "metadata": {},
2009 | "source": [
2010 | "### Conclusion:\n",
2011 | "If we look at the A/B test results in each segment, we can see that the p values is not less than the alpha 0.05, which means that we cannot reject null hypothesis.
\n",
2012 | "Therefore, there is no significant improvement of the converstion rate after the change.
\n",
2013 | "Also, it's not becoming worse after the change."
2014 | ]
2015 | },
2016 | {
2017 | "cell_type": "markdown",
2018 | "metadata": {},
2019 | "source": [
2020 | "#### Some extra\n",
2021 | "Below is the step by step calculation of t-statistics not using stats.ttest_ind() function"
2022 | ]
2023 | },
2024 | {
2025 | "cell_type": "code",
2026 | "execution_count": 337,
2027 | "metadata": {},
2028 | "outputs": [],
2029 | "source": [
2030 | "# take Mexico as an example\n",
2031 | "controll_m = controll[controll['country']=='Mexico']\n",
2032 | "exp_m = exp[exp['country']=='Mexico']"
2033 | ]
2034 | },
2035 | {
2036 | "cell_type": "markdown",
2037 | "metadata": {},
2038 | "source": [
2039 | "Calculate sample size"
2040 | ]
2041 | },
2042 | {
2043 | "cell_type": "code",
2044 | "execution_count": 338,
2045 | "metadata": {},
2046 | "outputs": [],
2047 | "source": [
2048 | "na = len(controll_m)\n",
2049 | "nb = len(exp_m)"
2050 | ]
2051 | },
2052 | {
2053 | "cell_type": "code",
2054 | "execution_count": 339,
2055 | "metadata": {},
2056 | "outputs": [
2057 | {
2058 | "name": "stdout",
2059 | "output_type": "stream",
2060 | "text": [
2061 | "Sample size of controll group: 64209\n",
2062 | "Sample size of experiment group: 64275\n"
2063 | ]
2064 | }
2065 | ],
2066 | "source": [
2067 | "print ('Sample size of controll group: {}'.format(na))\n",
2068 | "print ('Sample size of experiment group: {}'.format(nb))"
2069 | ]
2070 | },
2071 | {
2072 | "cell_type": "markdown",
2073 | "metadata": {},
2074 | "source": [
2075 | "Degree of freedom"
2076 | ]
2077 | },
2078 | {
2079 | "cell_type": "code",
2080 | "execution_count": 340,
2081 | "metadata": {},
2082 | "outputs": [
2083 | {
2084 | "name": "stdout",
2085 | "output_type": "stream",
2086 | "text": [
2087 | "128482\n"
2088 | ]
2089 | }
2090 | ],
2091 | "source": [
2092 | "df = na+nb-2\n",
2093 | "print (df)"
2094 | ]
2095 | },
2096 | {
2097 | "cell_type": "markdown",
2098 | "metadata": {},
2099 | "source": [
2100 | "Calculate conversion rate(sample mean) for controlled and exp group for Mexico"
2101 | ]
2102 | },
2103 | {
2104 | "cell_type": "code",
2105 | "execution_count": 341,
2106 | "metadata": {},
2107 | "outputs": [],
2108 | "source": [
2109 | "xa = controll_m.conversion.mean()\n",
2110 | "xb = exp_m.conversion.mean()"
2111 | ]
2112 | },
2113 | {
2114 | "cell_type": "code",
2115 | "execution_count": 343,
2116 | "metadata": {},
2117 | "outputs": [
2118 | {
2119 | "name": "stdout",
2120 | "output_type": "stream",
2121 | "text": [
2122 | "Conversion rate of controll group: 0.04949461913438926\n",
2123 | "Conversion rate of experiment group: 0.05118630882924932\n"
2124 | ]
2125 | }
2126 | ],
2127 | "source": [
2128 | "# the conversion rate of test group is 0.17% higher, but is it significant enough? or it's due to chance.\n",
2129 | "print ('Conversion rate of controll group: {}'.format(xa))\n",
2130 | "print ('Conversion rate of experiment group: {}'.format(xb))"
2131 | ]
2132 | },
2133 | {
2134 | "cell_type": "markdown",
2135 | "metadata": {},
2136 | "source": [
2137 | "Calculate standard deviation"
2138 | ]
2139 | },
2140 | {
2141 | "cell_type": "code",
2142 | "execution_count": 344,
2143 | "metadata": {},
2144 | "outputs": [],
2145 | "source": [
2146 | "# in ipython notebook, use shift+tab to get function details(more tab more details)\n",
2147 | "# ddof is set default of 1. 1 is for sample std, 0 for population std\n",
2148 | "sa = controll_m.conversion.std()\n",
2149 | "sb = exp_m.conversion.std()"
2150 | ]
2151 | },
2152 | {
2153 | "cell_type": "code",
2154 | "execution_count": 345,
2155 | "metadata": {},
2156 | "outputs": [],
2157 | "source": [
2158 | "# calculate standard error\n",
2159 | "se = pow(sa,2)/na+pow(sb,2)/nb"
2160 | ]
2161 | },
2162 | {
2163 | "cell_type": "code",
2164 | "execution_count": 346,
2165 | "metadata": {},
2166 | "outputs": [],
2167 | "source": [
2168 | "t = (xb-xa)/np.sqrt(se)"
2169 | ]
2170 | },
2171 | {
2172 | "cell_type": "code",
2173 | "execution_count": 347,
2174 | "metadata": {},
2175 | "outputs": [
2176 | {
2177 | "name": "stdout",
2178 | "output_type": "stream",
2179 | "text": [
2180 | "1.38667359523\n"
2181 | ]
2182 | }
2183 | ],
2184 | "source": [
2185 | "print (t)"
2186 | ]
2187 | },
2188 | {
2189 | "cell_type": "markdown",
2190 | "metadata": {},
2191 | "source": [
2192 | "Look up t-table of alpha = 0.05 and df = 128482 to get the critical value. t = 1.96
\n",
2193 | "Since 1.39 is not greater than the critical value 1.96. so we cannot reject null"
2194 | ]
2195 | }
2196 | ],
2197 | "metadata": {
2198 | "kernelspec": {
2199 | "display_name": "Python 3",
2200 | "language": "python",
2201 | "name": "python3"
2202 | },
2203 | "language_info": {
2204 | "codemirror_mode": {
2205 | "name": "ipython",
2206 | "version": 3
2207 | },
2208 | "file_extension": ".py",
2209 | "mimetype": "text/x-python",
2210 | "name": "python",
2211 | "nbconvert_exporter": "python",
2212 | "pygments_lexer": "ipython3",
2213 | "version": "3.5.4"
2214 | }
2215 | },
2216 | "nbformat": 4,
2217 | "nbformat_minor": 2
2218 | }
2219 |
--------------------------------------------------------------------------------
/Employee_Retention_PeopleAnalytics.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Employee Retention - People Analytics"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Goal:\n",
15 | "### 1. Predict Employee Retention\n",
16 | "#### ----create a table with 3 columns, day, employee_headcount, company_id\n",
17 | "### 2. What are the main factors drive employee churn"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 811,
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "import pandas as pd\n",
27 | "import numpy as np\n",
28 | "import matplotlib.pyplot as plt\n",
29 | "%matplotlib inline\n",
30 | "import datetime\n",
31 | "from ggplot import *"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 812,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "data = pd.read_csv(r'C:\\Users\\lshen\\Downloads\\employee_retention_data.csv')"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 813,
46 | "metadata": {},
47 | "outputs": [
48 | {
49 | "data": {
50 | "text/html": [
51 | "\n",
52 | "\n",
65 | "
\n",
66 | " \n",
67 | " \n",
68 | " | \n",
69 | " employee_id | \n",
70 | " company_id | \n",
71 | " dept | \n",
72 | " seniority | \n",
73 | " salary | \n",
74 | " join_date | \n",
75 | " quit_date | \n",
76 | "
\n",
77 | " \n",
78 | " \n",
79 | " \n",
80 | " 0 | \n",
81 | " 13021.0 | \n",
82 | " 7 | \n",
83 | " customer_service | \n",
84 | " 28 | \n",
85 | " 89000.0 | \n",
86 | " 2014-03-24 | \n",
87 | " 2015-10-30 | \n",
88 | "
\n",
89 | " \n",
90 | " 1 | \n",
91 | " 825355.0 | \n",
92 | " 7 | \n",
93 | " marketing | \n",
94 | " 20 | \n",
95 | " 183000.0 | \n",
96 | " 2013-04-29 | \n",
97 | " 2014-04-04 | \n",
98 | "
\n",
99 | " \n",
100 | " 2 | \n",
101 | " 927315.0 | \n",
102 | " 4 | \n",
103 | " marketing | \n",
104 | " 14 | \n",
105 | " 101000.0 | \n",
106 | " 2014-10-13 | \n",
107 | " NaN | \n",
108 | "
\n",
109 | " \n",
110 | " 3 | \n",
111 | " 662910.0 | \n",
112 | " 7 | \n",
113 | " customer_service | \n",
114 | " 20 | \n",
115 | " 115000.0 | \n",
116 | " 2012-05-14 | \n",
117 | " 2013-06-07 | \n",
118 | "
\n",
119 | " \n",
120 | " 4 | \n",
121 | " 256971.0 | \n",
122 | " 2 | \n",
123 | " data_science | \n",
124 | " 23 | \n",
125 | " 276000.0 | \n",
126 | " 2011-10-17 | \n",
127 | " 2014-08-22 | \n",
128 | "
\n",
129 | " \n",
130 | "
\n",
131 | "
"
132 | ],
133 | "text/plain": [
134 | " employee_id company_id dept seniority salary join_date \\\n",
135 | "0 13021.0 7 customer_service 28 89000.0 2014-03-24 \n",
136 | "1 825355.0 7 marketing 20 183000.0 2013-04-29 \n",
137 | "2 927315.0 4 marketing 14 101000.0 2014-10-13 \n",
138 | "3 662910.0 7 customer_service 20 115000.0 2012-05-14 \n",
139 | "4 256971.0 2 data_science 23 276000.0 2011-10-17 \n",
140 | "\n",
141 | " quit_date \n",
142 | "0 2015-10-30 \n",
143 | "1 2014-04-04 \n",
144 | "2 NaN \n",
145 | "3 2013-06-07 \n",
146 | "4 2014-08-22 "
147 | ]
148 | },
149 | "execution_count": 813,
150 | "metadata": {},
151 | "output_type": "execute_result"
152 | }
153 | ],
154 | "source": [
155 | "data.head()"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 814,
161 | "metadata": {},
162 | "outputs": [
163 | {
164 | "data": {
165 | "text/plain": [
166 | "employee_id float64\n",
167 | "company_id int64\n",
168 | "dept object\n",
169 | "seniority int64\n",
170 | "salary float64\n",
171 | "join_date object\n",
172 | "quit_date object\n",
173 | "dtype: object"
174 | ]
175 | },
176 | "execution_count": 814,
177 | "metadata": {},
178 | "output_type": "execute_result"
179 | }
180 | ],
181 | "source": [
182 | "data.dtypes"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": 815,
188 | "metadata": {},
189 | "outputs": [
190 | {
191 | "data": {
192 | "text/plain": [
193 | "(24702, 7)"
194 | ]
195 | },
196 | "execution_count": 815,
197 | "metadata": {},
198 | "output_type": "execute_result"
199 | }
200 | ],
201 | "source": [
202 | "data.shape"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": 816,
208 | "metadata": {},
209 | "outputs": [
210 | {
211 | "data": {
212 | "text/plain": [
213 | "employee_id 0\n",
214 | "company_id 0\n",
215 | "dept 0\n",
216 | "seniority 0\n",
217 | "salary 0\n",
218 | "join_date 0\n",
219 | "quit_date 11192\n",
220 | "dtype: int64"
221 | ]
222 | },
223 | "execution_count": 816,
224 | "metadata": {},
225 | "output_type": "execute_result"
226 | }
227 | ],
228 | "source": [
229 | "data.isnull().sum()"
230 | ]
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "metadata": {},
235 | "source": [
236 | "### change into proper data types"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 817,
242 | "metadata": {},
243 | "outputs": [],
244 | "source": [
245 | "# change join and quit date's type to date time\n",
246 | "# one way -----data['join_date'] = data.join_date.astype(datetime.datetime)\n",
247 | "data['join_date'] = pd.to_datetime(data.join_date)\n",
248 | "data['quit_date'] = pd.to_datetime(data.quit_date)"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": 818,
254 | "metadata": {},
255 | "outputs": [
256 | {
257 | "data": {
258 | "text/plain": [
259 | "employee_id float64\n",
260 | "company_id int64\n",
261 | "dept object\n",
262 | "seniority int64\n",
263 | "salary float64\n",
264 | "join_date datetime64[ns]\n",
265 | "quit_date datetime64[ns]\n",
266 | "dtype: object"
267 | ]
268 | },
269 | "execution_count": 818,
270 | "metadata": {},
271 | "output_type": "execute_result"
272 | }
273 | ],
274 | "source": [
275 | "data.dtypes"
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": 819,
281 | "metadata": {},
282 | "outputs": [
283 | {
284 | "data": {
285 | "text/html": [
286 | "\n",
287 | "\n",
300 | "
\n",
301 | " \n",
302 | " \n",
303 | " | \n",
304 | " employee_id | \n",
305 | " company_id | \n",
306 | " dept | \n",
307 | " seniority | \n",
308 | " salary | \n",
309 | " join_date | \n",
310 | " quit_date | \n",
311 | "
\n",
312 | " \n",
313 | " \n",
314 | " \n",
315 | " count | \n",
316 | " 24702.000000 | \n",
317 | " 24702.000000 | \n",
318 | " 24702 | \n",
319 | " 24702.000000 | \n",
320 | " 24702.000000 | \n",
321 | " 24702 | \n",
322 | " 13510 | \n",
323 | "
\n",
324 | " \n",
325 | " unique | \n",
326 | " NaN | \n",
327 | " NaN | \n",
328 | " 6 | \n",
329 | " NaN | \n",
330 | " NaN | \n",
331 | " 995 | \n",
332 | " 664 | \n",
333 | "
\n",
334 | " \n",
335 | " top | \n",
336 | " NaN | \n",
337 | " NaN | \n",
338 | " customer_service | \n",
339 | " NaN | \n",
340 | " NaN | \n",
341 | " 2012-01-03 00:00:00 | \n",
342 | " 2015-05-08 00:00:00 | \n",
343 | "
\n",
344 | " \n",
345 | " freq | \n",
346 | " NaN | \n",
347 | " NaN | \n",
348 | " 9180 | \n",
349 | " NaN | \n",
350 | " NaN | \n",
351 | " 105 | \n",
352 | " 111 | \n",
353 | "
\n",
354 | " \n",
355 | " first | \n",
356 | " NaN | \n",
357 | " NaN | \n",
358 | " NaN | \n",
359 | " NaN | \n",
360 | " NaN | \n",
361 | " 2011-01-24 00:00:00 | \n",
362 | " 2011-10-13 00:00:00 | \n",
363 | "
\n",
364 | " \n",
365 | " last | \n",
366 | " NaN | \n",
367 | " NaN | \n",
368 | " NaN | \n",
369 | " NaN | \n",
370 | " NaN | \n",
371 | " 2015-12-10 00:00:00 | \n",
372 | " 2015-12-09 00:00:00 | \n",
373 | "
\n",
374 | " \n",
375 | " mean | \n",
376 | " 501604.403530 | \n",
377 | " 3.426969 | \n",
378 | " NaN | \n",
379 | " 14.127803 | \n",
380 | " 138183.345478 | \n",
381 | " NaN | \n",
382 | " NaN | \n",
383 | "
\n",
384 | " \n",
385 | " std | \n",
386 | " 288909.026101 | \n",
387 | " 2.700011 | \n",
388 | " NaN | \n",
389 | " 8.089520 | \n",
390 | " 76058.184573 | \n",
391 | " NaN | \n",
392 | " NaN | \n",
393 | "
\n",
394 | " \n",
395 | " min | \n",
396 | " 36.000000 | \n",
397 | " 1.000000 | \n",
398 | " NaN | \n",
399 | " 1.000000 | \n",
400 | " 17000.000000 | \n",
401 | " NaN | \n",
402 | " NaN | \n",
403 | "
\n",
404 | " \n",
405 | " 25% | \n",
406 | " 250133.750000 | \n",
407 | " 1.000000 | \n",
408 | " NaN | \n",
409 | " 7.000000 | \n",
410 | " 79000.000000 | \n",
411 | " NaN | \n",
412 | " NaN | \n",
413 | "
\n",
414 | " \n",
415 | " 50% | \n",
416 | " 500793.000000 | \n",
417 | " 2.000000 | \n",
418 | " NaN | \n",
419 | " 14.000000 | \n",
420 | " 123000.000000 | \n",
421 | " NaN | \n",
422 | " NaN | \n",
423 | "
\n",
424 | " \n",
425 | " 75% | \n",
426 | " 753137.250000 | \n",
427 | " 5.000000 | \n",
428 | " NaN | \n",
429 | " 21.000000 | \n",
430 | " 187000.000000 | \n",
431 | " NaN | \n",
432 | " NaN | \n",
433 | "
\n",
434 | " \n",
435 | " max | \n",
436 | " 999969.000000 | \n",
437 | " 12.000000 | \n",
438 | " NaN | \n",
439 | " 99.000000 | \n",
440 | " 408000.000000 | \n",
441 | " NaN | \n",
442 | " NaN | \n",
443 | "
\n",
444 | " \n",
445 | "
\n",
446 | "
"
447 | ],
448 | "text/plain": [
449 | " employee_id company_id dept seniority \\\n",
450 | "count 24702.000000 24702.000000 24702 24702.000000 \n",
451 | "unique NaN NaN 6 NaN \n",
452 | "top NaN NaN customer_service NaN \n",
453 | "freq NaN NaN 9180 NaN \n",
454 | "first NaN NaN NaN NaN \n",
455 | "last NaN NaN NaN NaN \n",
456 | "mean 501604.403530 3.426969 NaN 14.127803 \n",
457 | "std 288909.026101 2.700011 NaN 8.089520 \n",
458 | "min 36.000000 1.000000 NaN 1.000000 \n",
459 | "25% 250133.750000 1.000000 NaN 7.000000 \n",
460 | "50% 500793.000000 2.000000 NaN 14.000000 \n",
461 | "75% 753137.250000 5.000000 NaN 21.000000 \n",
462 | "max 999969.000000 12.000000 NaN 99.000000 \n",
463 | "\n",
464 | " salary join_date quit_date \n",
465 | "count 24702.000000 24702 13510 \n",
466 | "unique NaN 995 664 \n",
467 | "top NaN 2012-01-03 00:00:00 2015-05-08 00:00:00 \n",
468 | "freq NaN 105 111 \n",
469 | "first NaN 2011-01-24 00:00:00 2011-10-13 00:00:00 \n",
470 | "last NaN 2015-12-10 00:00:00 2015-12-09 00:00:00 \n",
471 | "mean 138183.345478 NaN NaN \n",
472 | "std 76058.184573 NaN NaN \n",
473 | "min 17000.000000 NaN NaN \n",
474 | "25% 79000.000000 NaN NaN \n",
475 | "50% 123000.000000 NaN NaN \n",
476 | "75% 187000.000000 NaN NaN \n",
477 | "max 408000.000000 NaN NaN "
478 | ]
479 | },
480 | "execution_count": 819,
481 | "metadata": {},
482 | "output_type": "execute_result"
483 | }
484 | ],
485 | "source": [
486 | "data.describe(include = 'all')"
487 | ]
488 | },
489 | {
490 | "cell_type": "markdown",
491 | "metadata": {},
492 | "source": [
493 | "### Get new hire number for each company by each day"
494 | ]
495 | },
496 | {
497 | "cell_type": "code",
498 | "execution_count": 820,
499 | "metadata": {},
500 | "outputs": [],
501 | "source": [
502 | "new_hire_by_date = data.groupby(['company_id','join_date'], as_index = False).employee_id.count()"
503 | ]
504 | },
505 | {
506 | "cell_type": "code",
507 | "execution_count": 821,
508 | "metadata": {},
509 | "outputs": [],
510 | "source": [
511 | "new_hire_by_date.columns = ['company_id','day','new_hire_count']"
512 | ]
513 | },
514 | {
515 | "cell_type": "code",
516 | "execution_count": 822,
517 | "metadata": {},
518 | "outputs": [
519 | {
520 | "data": {
521 | "text/html": [
522 | "\n",
523 | "\n",
536 | "
\n",
537 | " \n",
538 | " \n",
539 | " | \n",
540 | " company_id | \n",
541 | " day | \n",
542 | " new_hire_count | \n",
543 | "
\n",
544 | " \n",
545 | " \n",
546 | " \n",
547 | " 0 | \n",
548 | " 1 | \n",
549 | " 2011-01-24 | \n",
550 | " 25 | \n",
551 | "
\n",
552 | " \n",
553 | " 1 | \n",
554 | " 1 | \n",
555 | " 2011-01-25 | \n",
556 | " 2 | \n",
557 | "
\n",
558 | " \n",
559 | " 2 | \n",
560 | " 1 | \n",
561 | " 2011-01-26 | \n",
562 | " 2 | \n",
563 | "
\n",
564 | " \n",
565 | " 3 | \n",
566 | " 1 | \n",
567 | " 2011-01-31 | \n",
568 | " 30 | \n",
569 | "
\n",
570 | " \n",
571 | " 4 | \n",
572 | " 1 | \n",
573 | " 2011-02-01 | \n",
574 | " 7 | \n",
575 | "
\n",
576 | " \n",
577 | "
\n",
578 | "
"
579 | ],
580 | "text/plain": [
581 | " company_id day new_hire_count\n",
582 | "0 1 2011-01-24 25\n",
583 | "1 1 2011-01-25 2\n",
584 | "2 1 2011-01-26 2\n",
585 | "3 1 2011-01-31 30\n",
586 | "4 1 2011-02-01 7"
587 | ]
588 | },
589 | "execution_count": 822,
590 | "metadata": {},
591 | "output_type": "execute_result"
592 | }
593 | ],
594 | "source": [
595 | "new_hire_by_date.head()"
596 | ]
597 | },
598 | {
599 | "cell_type": "code",
600 | "execution_count": 823,
601 | "metadata": {},
602 | "outputs": [
603 | {
604 | "data": {
605 | "text/html": [
606 | "\n",
607 | "\n",
620 | "
\n",
621 | " \n",
622 | " \n",
623 | " | \n",
624 | " company_id | \n",
625 | " day | \n",
626 | " new_hire_count | \n",
627 | "
\n",
628 | " \n",
629 | " \n",
630 | " \n",
631 | " 5125 | \n",
632 | " 12 | \n",
633 | " 2014-05-19 | \n",
634 | " 2 | \n",
635 | "
\n",
636 | " \n",
637 | " 5126 | \n",
638 | " 12 | \n",
639 | " 2014-10-13 | \n",
640 | " 1 | \n",
641 | "
\n",
642 | " \n",
643 | " 5127 | \n",
644 | " 12 | \n",
645 | " 2015-03-23 | \n",
646 | " 1 | \n",
647 | "
\n",
648 | " \n",
649 | " 5128 | \n",
650 | " 12 | \n",
651 | " 2015-07-06 | \n",
652 | " 1 | \n",
653 | "
\n",
654 | " \n",
655 | " 5129 | \n",
656 | " 12 | \n",
657 | " 2015-07-27 | \n",
658 | " 1 | \n",
659 | "
\n",
660 | " \n",
661 | "
\n",
662 | "
"
663 | ],
664 | "text/plain": [
665 | " company_id day new_hire_count\n",
666 | "5125 12 2014-05-19 2\n",
667 | "5126 12 2014-10-13 1\n",
668 | "5127 12 2015-03-23 1\n",
669 | "5128 12 2015-07-06 1\n",
670 | "5129 12 2015-07-27 1"
671 | ]
672 | },
673 | "execution_count": 823,
674 | "metadata": {},
675 | "output_type": "execute_result"
676 | }
677 | ],
678 | "source": [
679 | "new_hire_by_date.tail()"
680 | ]
681 | },
682 | {
683 | "cell_type": "markdown",
684 | "metadata": {},
685 | "source": [
686 | "### Get quitted number for each company each day"
687 | ]
688 | },
689 | {
690 | "cell_type": "code",
691 | "execution_count": 824,
692 | "metadata": {},
693 | "outputs": [],
694 | "source": [
695 | "quit_by_date = data.groupby(['company_id','quit_date'],as_index=False).employee_id.count()"
696 | ]
697 | },
698 | {
699 | "cell_type": "code",
700 | "execution_count": 825,
701 | "metadata": {},
702 | "outputs": [],
703 | "source": [
704 | "quit_by_date.columns = ['company_id','day','quit_count']"
705 | ]
706 | },
707 | {
708 | "cell_type": "code",
709 | "execution_count": 826,
710 | "metadata": {},
711 | "outputs": [
712 | {
713 | "data": {
714 | "text/html": [
715 | "\n",
716 | "\n",
729 | "
\n",
730 | " \n",
731 | " \n",
732 | " | \n",
733 | " company_id | \n",
734 | " day | \n",
735 | " quit_count | \n",
736 | "
\n",
737 | " \n",
738 | " \n",
739 | " \n",
740 | " 0 | \n",
741 | " 1 | \n",
742 | " 2011-10-21 | \n",
743 | " 1 | \n",
744 | "
\n",
745 | " \n",
746 | " 1 | \n",
747 | " 1 | \n",
748 | " 2011-11-11 | \n",
749 | " 1 | \n",
750 | "
\n",
751 | " \n",
752 | " 2 | \n",
753 | " 1 | \n",
754 | " 2011-11-22 | \n",
755 | " 1 | \n",
756 | "
\n",
757 | " \n",
758 | " 3 | \n",
759 | " 1 | \n",
760 | " 2011-11-25 | \n",
761 | " 1 | \n",
762 | "
\n",
763 | " \n",
764 | " 4 | \n",
765 | " 1 | \n",
766 | " 2011-12-09 | \n",
767 | " 1 | \n",
768 | "
\n",
769 | " \n",
770 | "
\n",
771 | "
"
772 | ],
773 | "text/plain": [
774 | " company_id day quit_count\n",
775 | "0 1 2011-10-21 1\n",
776 | "1 1 2011-11-11 1\n",
777 | "2 1 2011-11-22 1\n",
778 | "3 1 2011-11-25 1\n",
779 | "4 1 2011-12-09 1"
780 | ]
781 | },
782 | "execution_count": 826,
783 | "metadata": {},
784 | "output_type": "execute_result"
785 | }
786 | ],
787 | "source": [
788 | "quit_by_date.head()"
789 | ]
790 | },
791 | {
792 | "cell_type": "markdown",
793 | "metadata": {},
794 | "source": [
795 | "### Create a dataframe storing the date from start to end"
796 | ]
797 | },
798 | {
799 | "cell_type": "code",
800 | "execution_count": 827,
801 | "metadata": {},
802 | "outputs": [],
803 | "source": [
804 | "start_date = '2011-01-23'\n",
805 | "end_date = '2015-12-13'"
806 | ]
807 | },
808 | {
809 | "cell_type": "code",
810 | "execution_count": 828,
811 | "metadata": {},
812 | "outputs": [],
813 | "source": [
814 | "# continuous day dataframe\n",
815 | "d = pd.DataFrame(pd.date_range(start_date, end_date),columns = ['day'])"
816 | ]
817 | },
818 | {
819 | "cell_type": "code",
820 | "execution_count": 829,
821 | "metadata": {},
822 | "outputs": [
823 | {
824 | "data": {
825 | "text/html": [
826 | "\n",
827 | "\n",
840 | "
\n",
841 | " \n",
842 | " \n",
843 | " | \n",
844 | " day | \n",
845 | "
\n",
846 | " \n",
847 | " \n",
848 | " \n",
849 | " 0 | \n",
850 | " 2011-01-23 | \n",
851 | "
\n",
852 | " \n",
853 | " 1 | \n",
854 | " 2011-01-24 | \n",
855 | "
\n",
856 | " \n",
857 | " 2 | \n",
858 | " 2011-01-25 | \n",
859 | "
\n",
860 | " \n",
861 | " 3 | \n",
862 | " 2011-01-26 | \n",
863 | "
\n",
864 | " \n",
865 | " 4 | \n",
866 | " 2011-01-27 | \n",
867 | "
\n",
868 | " \n",
869 | "
\n",
870 | "
"
871 | ],
872 | "text/plain": [
873 | " day\n",
874 | "0 2011-01-23\n",
875 | "1 2011-01-24\n",
876 | "2 2011-01-25\n",
877 | "3 2011-01-26\n",
878 | "4 2011-01-27"
879 | ]
880 | },
881 | "execution_count": 829,
882 | "metadata": {},
883 | "output_type": "execute_result"
884 | }
885 | ],
886 | "source": [
887 | "d.head()"
888 | ]
889 | },
890 | {
891 | "cell_type": "markdown",
892 | "metadata": {},
893 | "source": [
894 | "### Get the company list"
895 | ]
896 | },
897 | {
898 | "cell_type": "code",
899 | "execution_count": 830,
900 | "metadata": {},
901 | "outputs": [],
902 | "source": [
903 | "company_list = data.company_id.unique()"
904 | ]
905 | },
906 | {
907 | "cell_type": "code",
908 | "execution_count": 831,
909 | "metadata": {},
910 | "outputs": [
911 | {
912 | "data": {
913 | "text/plain": [
914 | "array([ 7, 4, 2, 9, 1, 6, 10, 5, 3, 8, 11, 12], dtype=int64)"
915 | ]
916 | },
917 | "execution_count": 831,
918 | "metadata": {},
919 | "output_type": "execute_result"
920 | }
921 | ],
922 | "source": [
923 | "company_list"
924 | ]
925 | },
926 | {
927 | "cell_type": "code",
928 | "execution_count": 832,
929 | "metadata": {},
930 | "outputs": [],
931 | "source": [
932 | "company_list.sort()"
933 | ]
934 | },
935 | {
936 | "cell_type": "code",
937 | "execution_count": 833,
938 | "metadata": {},
939 | "outputs": [
940 | {
941 | "data": {
942 | "text/plain": [
943 | "array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], dtype=int64)"
944 | ]
945 | },
946 | "execution_count": 833,
947 | "metadata": {},
948 | "output_type": "execute_result"
949 | }
950 | ],
951 | "source": [
952 | "company_list"
953 | ]
954 | },
955 | {
956 | "cell_type": "code",
957 | "execution_count": 834,
958 | "metadata": {},
959 | "outputs": [],
960 | "source": [
961 | "c = pd.DataFrame(company_list,columns=['company_id'])"
962 | ]
963 | },
964 | {
965 | "cell_type": "markdown",
966 | "metadata": {},
967 | "source": [
968 | "### Cross Join date and company list"
969 | ]
970 | },
971 | {
972 | "cell_type": "code",
973 | "execution_count": 835,
974 | "metadata": {},
975 | "outputs": [],
976 | "source": [
977 | "# merge on a dummy column and drop it\n",
978 | "headcount = d.assign(foo = 1).merge(c.assign(foo=1)).drop('foo',1)"
979 | ]
980 | },
981 | {
982 | "cell_type": "code",
983 | "execution_count": 836,
984 | "metadata": {},
985 | "outputs": [
986 | {
987 | "data": {
988 | "text/html": [
989 | "\n",
990 | "\n",
1003 | "
\n",
1004 | " \n",
1005 | " \n",
1006 | " | \n",
1007 | " day | \n",
1008 | " company_id | \n",
1009 | "
\n",
1010 | " \n",
1011 | " \n",
1012 | " \n",
1013 | " 21427 | \n",
1014 | " 2015-12-13 | \n",
1015 | " 8 | \n",
1016 | "
\n",
1017 | " \n",
1018 | " 21428 | \n",
1019 | " 2015-12-13 | \n",
1020 | " 9 | \n",
1021 | "
\n",
1022 | " \n",
1023 | " 21429 | \n",
1024 | " 2015-12-13 | \n",
1025 | " 10 | \n",
1026 | "
\n",
1027 | " \n",
1028 | " 21430 | \n",
1029 | " 2015-12-13 | \n",
1030 | " 11 | \n",
1031 | "
\n",
1032 | " \n",
1033 | " 21431 | \n",
1034 | " 2015-12-13 | \n",
1035 | " 12 | \n",
1036 | "
\n",
1037 | " \n",
1038 | "
\n",
1039 | "
"
1040 | ],
1041 | "text/plain": [
1042 | " day company_id\n",
1043 | "21427 2015-12-13 8\n",
1044 | "21428 2015-12-13 9\n",
1045 | "21429 2015-12-13 10\n",
1046 | "21430 2015-12-13 11\n",
1047 | "21431 2015-12-13 12"
1048 | ]
1049 | },
1050 | "execution_count": 836,
1051 | "metadata": {},
1052 | "output_type": "execute_result"
1053 | }
1054 | ],
1055 | "source": [
1056 | "headcount.tail()"
1057 | ]
1058 | },
1059 | {
1060 | "cell_type": "markdown",
1061 | "metadata": {},
1062 | "source": [
1063 | "### merge with new_hire and quit data"
1064 | ]
1065 | },
1066 | {
1067 | "cell_type": "code",
1068 | "execution_count": 837,
1069 | "metadata": {},
1070 | "outputs": [],
1071 | "source": [
1072 | "headcount = (headcount.merge(new_hire_by_date, how='left',\\\n",
1073 | " on=['day','company_id']).fillna(0)).merge(quit_by_date, how='left',\\\n",
1074 | " on =['day','company_id']).fillna(0)"
1075 | ]
1076 | },
1077 | {
1078 | "cell_type": "code",
1079 | "execution_count": 838,
1080 | "metadata": {},
1081 | "outputs": [
1082 | {
1083 | "data": {
1084 | "text/html": [
1085 | "\n",
1086 | "\n",
1099 | "
\n",
1100 | " \n",
1101 | " \n",
1102 | " | \n",
1103 | " day | \n",
1104 | " company_id | \n",
1105 | " new_hire_count | \n",
1106 | " quit_count | \n",
1107 | "
\n",
1108 | " \n",
1109 | " \n",
1110 | " \n",
1111 | " 0 | \n",
1112 | " 2011-01-23 | \n",
1113 | " 1 | \n",
1114 | " 0.0 | \n",
1115 | " 0.0 | \n",
1116 | "
\n",
1117 | " \n",
1118 | " 1 | \n",
1119 | " 2011-01-23 | \n",
1120 | " 2 | \n",
1121 | " 0.0 | \n",
1122 | " 0.0 | \n",
1123 | "
\n",
1124 | " \n",
1125 | " 2 | \n",
1126 | " 2011-01-23 | \n",
1127 | " 3 | \n",
1128 | " 0.0 | \n",
1129 | " 0.0 | \n",
1130 | "
\n",
1131 | " \n",
1132 | " 3 | \n",
1133 | " 2011-01-23 | \n",
1134 | " 4 | \n",
1135 | " 0.0 | \n",
1136 | " 0.0 | \n",
1137 | "
\n",
1138 | " \n",
1139 | " 4 | \n",
1140 | " 2011-01-23 | \n",
1141 | " 5 | \n",
1142 | " 0.0 | \n",
1143 | " 0.0 | \n",
1144 | "
\n",
1145 | " \n",
1146 | " 5 | \n",
1147 | " 2011-01-23 | \n",
1148 | " 6 | \n",
1149 | " 0.0 | \n",
1150 | " 0.0 | \n",
1151 | "
\n",
1152 | " \n",
1153 | " 6 | \n",
1154 | " 2011-01-23 | \n",
1155 | " 7 | \n",
1156 | " 0.0 | \n",
1157 | " 0.0 | \n",
1158 | "
\n",
1159 | " \n",
1160 | " 7 | \n",
1161 | " 2011-01-23 | \n",
1162 | " 8 | \n",
1163 | " 0.0 | \n",
1164 | " 0.0 | \n",
1165 | "
\n",
1166 | " \n",
1167 | " 8 | \n",
1168 | " 2011-01-23 | \n",
1169 | " 9 | \n",
1170 | " 0.0 | \n",
1171 | " 0.0 | \n",
1172 | "
\n",
1173 | " \n",
1174 | " 9 | \n",
1175 | " 2011-01-23 | \n",
1176 | " 10 | \n",
1177 | " 0.0 | \n",
1178 | " 0.0 | \n",
1179 | "
\n",
1180 | " \n",
1181 | " 10 | \n",
1182 | " 2011-01-23 | \n",
1183 | " 11 | \n",
1184 | " 0.0 | \n",
1185 | " 0.0 | \n",
1186 | "
\n",
1187 | " \n",
1188 | " 11 | \n",
1189 | " 2011-01-23 | \n",
1190 | " 12 | \n",
1191 | " 0.0 | \n",
1192 | " 0.0 | \n",
1193 | "
\n",
1194 | " \n",
1195 | " 12 | \n",
1196 | " 2011-01-24 | \n",
1197 | " 1 | \n",
1198 | " 25.0 | \n",
1199 | " 0.0 | \n",
1200 | "
\n",
1201 | " \n",
1202 | " 13 | \n",
1203 | " 2011-01-24 | \n",
1204 | " 2 | \n",
1205 | " 17.0 | \n",
1206 | " 0.0 | \n",
1207 | "
\n",
1208 | " \n",
1209 | " 14 | \n",
1210 | " 2011-01-24 | \n",
1211 | " 3 | \n",
1212 | " 9.0 | \n",
1213 | " 0.0 | \n",
1214 | "
\n",
1215 | " \n",
1216 | " 15 | \n",
1217 | " 2011-01-24 | \n",
1218 | " 4 | \n",
1219 | " 12.0 | \n",
1220 | " 0.0 | \n",
1221 | "
\n",
1222 | " \n",
1223 | " 16 | \n",
1224 | " 2011-01-24 | \n",
1225 | " 5 | \n",
1226 | " 5.0 | \n",
1227 | " 0.0 | \n",
1228 | "
\n",
1229 | " \n",
1230 | " 17 | \n",
1231 | " 2011-01-24 | \n",
1232 | " 6 | \n",
1233 | " 3.0 | \n",
1234 | " 0.0 | \n",
1235 | "
\n",
1236 | " \n",
1237 | " 18 | \n",
1238 | " 2011-01-24 | \n",
1239 | " 7 | \n",
1240 | " 1.0 | \n",
1241 | " 0.0 | \n",
1242 | "
\n",
1243 | " \n",
1244 | " 19 | \n",
1245 | " 2011-01-24 | \n",
1246 | " 8 | \n",
1247 | " 6.0 | \n",
1248 | " 0.0 | \n",
1249 | "
\n",
1250 | " \n",
1251 | " 20 | \n",
1252 | " 2011-01-24 | \n",
1253 | " 9 | \n",
1254 | " 3.0 | \n",
1255 | " 0.0 | \n",
1256 | "
\n",
1257 | " \n",
1258 | " 21 | \n",
1259 | " 2011-01-24 | \n",
1260 | " 10 | \n",
1261 | " 0.0 | \n",
1262 | " 0.0 | \n",
1263 | "
\n",
1264 | " \n",
1265 | " 22 | \n",
1266 | " 2011-01-24 | \n",
1267 | " 11 | \n",
1268 | " 0.0 | \n",
1269 | " 0.0 | \n",
1270 | "
\n",
1271 | " \n",
1272 | " 23 | \n",
1273 | " 2011-01-24 | \n",
1274 | " 12 | \n",
1275 | " 0.0 | \n",
1276 | " 0.0 | \n",
1277 | "
\n",
1278 | " \n",
1279 | "
\n",
1280 | "
"
1281 | ],
1282 | "text/plain": [
1283 | " day company_id new_hire_count quit_count\n",
1284 | "0 2011-01-23 1 0.0 0.0\n",
1285 | "1 2011-01-23 2 0.0 0.0\n",
1286 | "2 2011-01-23 3 0.0 0.0\n",
1287 | "3 2011-01-23 4 0.0 0.0\n",
1288 | "4 2011-01-23 5 0.0 0.0\n",
1289 | "5 2011-01-23 6 0.0 0.0\n",
1290 | "6 2011-01-23 7 0.0 0.0\n",
1291 | "7 2011-01-23 8 0.0 0.0\n",
1292 | "8 2011-01-23 9 0.0 0.0\n",
1293 | "9 2011-01-23 10 0.0 0.0\n",
1294 | "10 2011-01-23 11 0.0 0.0\n",
1295 | "11 2011-01-23 12 0.0 0.0\n",
1296 | "12 2011-01-24 1 25.0 0.0\n",
1297 | "13 2011-01-24 2 17.0 0.0\n",
1298 | "14 2011-01-24 3 9.0 0.0\n",
1299 | "15 2011-01-24 4 12.0 0.0\n",
1300 | "16 2011-01-24 5 5.0 0.0\n",
1301 | "17 2011-01-24 6 3.0 0.0\n",
1302 | "18 2011-01-24 7 1.0 0.0\n",
1303 | "19 2011-01-24 8 6.0 0.0\n",
1304 | "20 2011-01-24 9 3.0 0.0\n",
1305 | "21 2011-01-24 10 0.0 0.0\n",
1306 | "22 2011-01-24 11 0.0 0.0\n",
1307 | "23 2011-01-24 12 0.0 0.0"
1308 | ]
1309 | },
1310 | "execution_count": 838,
1311 | "metadata": {},
1312 | "output_type": "execute_result"
1313 | }
1314 | ],
1315 | "source": [
1316 | "headcount.head(24)"
1317 | ]
1318 | },
1319 | {
1320 | "cell_type": "markdown",
1321 | "metadata": {},
1322 | "source": [
1323 | "Calculate net headcount change per day"
1324 | ]
1325 | },
1326 | {
1327 | "cell_type": "code",
1328 | "execution_count": 839,
1329 | "metadata": {},
1330 | "outputs": [],
1331 | "source": [
1332 | "headcount['head_count_net_change']=headcount.new_hire_count - headcount.quit_count"
1333 | ]
1334 | },
1335 | {
1336 | "cell_type": "markdown",
1337 | "metadata": {},
1338 | "source": [
1339 | "### Answer#1:Get the headcount per day per company"
1340 | ]
1341 | },
1342 | {
1343 | "cell_type": "markdown",
1344 | "metadata": {},
1345 | "source": [
1346 | "Get the cumulative sum of headcount per day per company"
1347 | ]
1348 | },
1349 | {
1350 | "cell_type": "code",
1351 | "execution_count": 840,
1352 | "metadata": {},
1353 | "outputs": [],
1354 | "source": [
1355 | "cumsums = headcount[['company_id','head_count_net_change']].groupby(['company_id']).cumsum()"
1356 | ]
1357 | },
1358 | {
1359 | "cell_type": "code",
1360 | "execution_count": 841,
1361 | "metadata": {},
1362 | "outputs": [],
1363 | "source": [
1364 | "cumsums.columns = ['head_count']"
1365 | ]
1366 | },
1367 | {
1368 | "cell_type": "code",
1369 | "execution_count": 842,
1370 | "metadata": {},
1371 | "outputs": [
1372 | {
1373 | "data": {
1374 | "text/plain": [
1375 | "21432"
1376 | ]
1377 | },
1378 | "execution_count": 842,
1379 | "metadata": {},
1380 | "output_type": "execute_result"
1381 | }
1382 | ],
1383 | "source": [
1384 | "len(cumsums)"
1385 | ]
1386 | },
1387 | {
1388 | "cell_type": "code",
1389 | "execution_count": 843,
1390 | "metadata": {},
1391 | "outputs": [],
1392 | "source": [
1393 | "headcount = pd.concat([headcount,cumsums], axis = 1)"
1394 | ]
1395 | },
1396 | {
1397 | "cell_type": "code",
1398 | "execution_count": 844,
1399 | "metadata": {},
1400 | "outputs": [
1401 | {
1402 | "data": {
1403 | "text/html": [
1404 | "\n",
1405 | "\n",
1418 | "
\n",
1419 | " \n",
1420 | " \n",
1421 | " | \n",
1422 | " day | \n",
1423 | " company_id | \n",
1424 | " new_hire_count | \n",
1425 | " quit_count | \n",
1426 | " head_count_net_change | \n",
1427 | " head_count | \n",
1428 | "
\n",
1429 | " \n",
1430 | " \n",
1431 | " \n",
1432 | " 21427 | \n",
1433 | " 2015-12-13 | \n",
1434 | " 8 | \n",
1435 | " 0.0 | \n",
1436 | " 0.0 | \n",
1437 | " 0.0 | \n",
1438 | " 468.0 | \n",
1439 | "
\n",
1440 | " \n",
1441 | " 21428 | \n",
1442 | " 2015-12-13 | \n",
1443 | " 9 | \n",
1444 | " 0.0 | \n",
1445 | " 0.0 | \n",
1446 | " 0.0 | \n",
1447 | " 432.0 | \n",
1448 | "
\n",
1449 | " \n",
1450 | " 21429 | \n",
1451 | " 2015-12-13 | \n",
1452 | " 10 | \n",
1453 | " 0.0 | \n",
1454 | " 0.0 | \n",
1455 | " 0.0 | \n",
1456 | " 385.0 | \n",
1457 | "
\n",
1458 | " \n",
1459 | " 21430 | \n",
1460 | " 2015-12-13 | \n",
1461 | " 11 | \n",
1462 | " 0.0 | \n",
1463 | " 0.0 | \n",
1464 | " 0.0 | \n",
1465 | " 4.0 | \n",
1466 | "
\n",
1467 | " \n",
1468 | " 21431 | \n",
1469 | " 2015-12-13 | \n",
1470 | " 12 | \n",
1471 | " 0.0 | \n",
1472 | " 0.0 | \n",
1473 | " 0.0 | \n",
1474 | " 12.0 | \n",
1475 | "
\n",
1476 | " \n",
1477 | "
\n",
1478 | "
"
1479 | ],
1480 | "text/plain": [
1481 | " day company_id new_hire_count quit_count \\\n",
1482 | "21427 2015-12-13 8 0.0 0.0 \n",
1483 | "21428 2015-12-13 9 0.0 0.0 \n",
1484 | "21429 2015-12-13 10 0.0 0.0 \n",
1485 | "21430 2015-12-13 11 0.0 0.0 \n",
1486 | "21431 2015-12-13 12 0.0 0.0 \n",
1487 | "\n",
1488 | " head_count_net_change head_count \n",
1489 | "21427 0.0 468.0 \n",
1490 | "21428 0.0 432.0 \n",
1491 | "21429 0.0 385.0 \n",
1492 | "21430 0.0 4.0 \n",
1493 | "21431 0.0 12.0 "
1494 | ]
1495 | },
1496 | "execution_count": 844,
1497 | "metadata": {},
1498 | "output_type": "execute_result"
1499 | }
1500 | ],
1501 | "source": [
1502 | "headcount.tail()"
1503 | ]
1504 | },
1505 | {
1506 | "cell_type": "markdown",
1507 | "metadata": {},
1508 | "source": [
1509 | "### Check the factors drive employee churn"
1510 | ]
1511 | },
1512 | {
1513 | "cell_type": "markdown",
1514 | "metadata": {},
1515 | "source": [
1516 | "### check employment length"
1517 | ]
1518 | },
1519 | {
1520 | "cell_type": "markdown",
1521 | "metadata": {},
1522 | "source": [
1523 | "Get the timedelta between join and quit"
1524 | ]
1525 | },
1526 | {
1527 | "cell_type": "code",
1528 | "execution_count": 845,
1529 | "metadata": {},
1530 | "outputs": [],
1531 | "source": [
1532 | "data['emp_length'] = data.quit_date-data.join_date"
1533 | ]
1534 | },
1535 | {
1536 | "cell_type": "code",
1537 | "execution_count": 846,
1538 | "metadata": {},
1539 | "outputs": [
1540 | {
1541 | "data": {
1542 | "text/html": [
1543 | "\n",
1544 | "\n",
1557 | "
\n",
1558 | " \n",
1559 | " \n",
1560 | " | \n",
1561 | " employee_id | \n",
1562 | " company_id | \n",
1563 | " dept | \n",
1564 | " seniority | \n",
1565 | " salary | \n",
1566 | " join_date | \n",
1567 | " quit_date | \n",
1568 | " emp_length | \n",
1569 | "
\n",
1570 | " \n",
1571 | " \n",
1572 | " \n",
1573 | " 0 | \n",
1574 | " 13021.0 | \n",
1575 | " 7 | \n",
1576 | " customer_service | \n",
1577 | " 28 | \n",
1578 | " 89000.0 | \n",
1579 | " 2014-03-24 | \n",
1580 | " 2015-10-30 | \n",
1581 | " 585 days | \n",
1582 | "
\n",
1583 | " \n",
1584 | " 1 | \n",
1585 | " 825355.0 | \n",
1586 | " 7 | \n",
1587 | " marketing | \n",
1588 | " 20 | \n",
1589 | " 183000.0 | \n",
1590 | " 2013-04-29 | \n",
1591 | " 2014-04-04 | \n",
1592 | " 340 days | \n",
1593 | "
\n",
1594 | " \n",
1595 | " 2 | \n",
1596 | " 927315.0 | \n",
1597 | " 4 | \n",
1598 | " marketing | \n",
1599 | " 14 | \n",
1600 | " 101000.0 | \n",
1601 | " 2014-10-13 | \n",
1602 | " NaT | \n",
1603 | " NaT | \n",
1604 | "
\n",
1605 | " \n",
1606 | " 3 | \n",
1607 | " 662910.0 | \n",
1608 | " 7 | \n",
1609 | " customer_service | \n",
1610 | " 20 | \n",
1611 | " 115000.0 | \n",
1612 | " 2012-05-14 | \n",
1613 | " 2013-06-07 | \n",
1614 | " 389 days | \n",
1615 | "
\n",
1616 | " \n",
1617 | " 4 | \n",
1618 | " 256971.0 | \n",
1619 | " 2 | \n",
1620 | " data_science | \n",
1621 | " 23 | \n",
1622 | " 276000.0 | \n",
1623 | " 2011-10-17 | \n",
1624 | " 2014-08-22 | \n",
1625 | " 1040 days | \n",
1626 | "
\n",
1627 | " \n",
1628 | "
\n",
1629 | "
"
1630 | ],
1631 | "text/plain": [
1632 | " employee_id company_id dept seniority salary join_date \\\n",
1633 | "0 13021.0 7 customer_service 28 89000.0 2014-03-24 \n",
1634 | "1 825355.0 7 marketing 20 183000.0 2013-04-29 \n",
1635 | "2 927315.0 4 marketing 14 101000.0 2014-10-13 \n",
1636 | "3 662910.0 7 customer_service 20 115000.0 2012-05-14 \n",
1637 | "4 256971.0 2 data_science 23 276000.0 2011-10-17 \n",
1638 | "\n",
1639 | " quit_date emp_length \n",
1640 | "0 2015-10-30 585 days \n",
1641 | "1 2014-04-04 340 days \n",
1642 | "2 NaT NaT \n",
1643 | "3 2013-06-07 389 days \n",
1644 | "4 2014-08-22 1040 days "
1645 | ]
1646 | },
1647 | "execution_count": 846,
1648 | "metadata": {},
1649 | "output_type": "execute_result"
1650 | }
1651 | ],
1652 | "source": [
1653 | "data.head()"
1654 | ]
1655 | },
1656 | {
1657 | "cell_type": "code",
1658 | "execution_count": 847,
1659 | "metadata": {},
1660 | "outputs": [
1661 | {
1662 | "data": {
1663 | "text/plain": [
1664 | "employee_id float64\n",
1665 | "company_id int64\n",
1666 | "dept object\n",
1667 | "seniority int64\n",
1668 | "salary float64\n",
1669 | "join_date datetime64[ns]\n",
1670 | "quit_date datetime64[ns]\n",
1671 | "emp_length timedelta64[ns]\n",
1672 | "dtype: object"
1673 | ]
1674 | },
1675 | "execution_count": 847,
1676 | "metadata": {},
1677 | "output_type": "execute_result"
1678 | }
1679 | ],
1680 | "source": [
1681 | "data.dtypes"
1682 | ]
1683 | },
1684 | {
1685 | "cell_type": "code",
1686 | "execution_count": 848,
1687 | "metadata": {},
1688 | "outputs": [
1689 | {
1690 | "data": {
1691 | "text/plain": [
1692 | "count 13510\n",
1693 | "mean 613 days 11:41:01.643227\n",
1694 | "std 328 days 14:56:33.800149\n",
1695 | "min 102 days 00:00:00\n",
1696 | "25% 361 days 00:00:00\n",
1697 | "50% 417 days 00:00:00\n",
1698 | "75% 781 days 00:00:00\n",
1699 | "max 1726 days 00:00:00\n",
1700 | "Name: emp_length, dtype: object"
1701 | ]
1702 | },
1703 | "execution_count": 848,
1704 | "metadata": {},
1705 | "output_type": "execute_result"
1706 | }
1707 | ],
1708 | "source": [
1709 | "data.emp_length.describe()"
1710 | ]
1711 | },
1712 | {
1713 | "cell_type": "code",
1714 | "execution_count": 862,
1715 | "metadata": {},
1716 | "outputs": [
1717 | {
1718 | "data": {
1719 | "text/plain": [
1720 | "375 days 370\n",
1721 | "361 days 368\n",
1722 | "354 days 367\n",
1723 | "368 days 333\n",
1724 | "382 days 325\n",
1725 | "Name: emp_length, dtype: int64"
1726 | ]
1727 | },
1728 | "execution_count": 862,
1729 | "metadata": {},
1730 | "output_type": "execute_result"
1731 | }
1732 | ],
1733 | "source": [
1734 | "data.emp_length.value_counts().head()"
1735 | ]
1736 | },
1737 | {
1738 | "cell_type": "code",
1739 | "execution_count": 860,
1740 | "metadata": {},
1741 | "outputs": [
1742 | {
1743 | "data": {
1744 | "text/plain": [
1745 | ""
1746 | ]
1747 | },
1748 | "execution_count": 860,
1749 | "metadata": {},
1750 | "output_type": "execute_result"
1751 | },
1752 | {
1753 | "data": {
1754 | "image/png": "\n",
1755 | "text/plain": [
1756 | ""
1757 | ]
1758 | },
1759 | "metadata": {},
1760 | "output_type": "display_data"
1761 | }
1762 | ],
1763 | "source": [
1764 | "# need to convert timedelta datatype to day or hour or min or second before plot\n",
1765 | "((data.emp_length.dropna() / np.timedelta64(1, 'D'))).hist(bins=100)"
1766 | ]
1767 | },
1768 | {
1769 | "cell_type": "markdown",
1770 | "metadata": {},
1771 | "source": [
1772 | "Observation:
\n",
1773 | "- Very high churn rate at the beginning of the second year of employment
\n",
1774 | "- relatively high churn rate between 1.5 to 2 years of employment"
1775 | ]
1776 | },
1777 | {
1778 | "cell_type": "markdown",
1779 | "metadata": {},
1780 | "source": [
1781 | "### Dig deeper"
1782 | ]
1783 | },
1784 | {
1785 | "cell_type": "markdown",
1786 | "metadata": {},
1787 | "source": [
1788 | "Since it has such a clear pattern, let's dig into deeper.
\n",
1789 | "Break into two groups: quitted early and not(if they haven’t been in the current\n",
1790 | "company for at least 13 months, we remove them)
\n",
1791 | "Let's define the early quitters are the ones quitted before 13 months"
1792 | ]
1793 | },
1794 | {
1795 | "cell_type": "code",
1796 | "execution_count": 930,
1797 | "metadata": {},
1798 | "outputs": [],
1799 | "source": [
1800 | "# get data quitted before 13 months\n",
1801 | "early_quitter = data[data.emp_length/np.timedelta64(1,'D') < 365+30]"
1802 | ]
1803 | },
1804 | {
1805 | "cell_type": "code",
1806 | "execution_count": 929,
1807 | "metadata": {},
1808 | "outputs": [
1809 | {
1810 | "data": {
1811 | "text/html": [
1812 | "\n",
1813 | "\n",
1826 | "
\n",
1827 | " \n",
1828 | " \n",
1829 | " | \n",
1830 | " employee_id | \n",
1831 | " company_id | \n",
1832 | " dept | \n",
1833 | " seniority | \n",
1834 | " salary | \n",
1835 | " join_date | \n",
1836 | " quit_date | \n",
1837 | " emp_length | \n",
1838 | "
\n",
1839 | " \n",
1840 | " \n",
1841 | " \n",
1842 | " 1 | \n",
1843 | " 825355.0 | \n",
1844 | " 7 | \n",
1845 | " marketing | \n",
1846 | " 20 | \n",
1847 | " 183000.0 | \n",
1848 | " 2013-04-29 | \n",
1849 | " 2014-04-04 | \n",
1850 | " 340 days | \n",
1851 | "
\n",
1852 | " \n",
1853 | " 3 | \n",
1854 | " 662910.0 | \n",
1855 | " 7 | \n",
1856 | " customer_service | \n",
1857 | " 20 | \n",
1858 | " 115000.0 | \n",
1859 | " 2012-05-14 | \n",
1860 | " 2013-06-07 | \n",
1861 | " 389 days | \n",
1862 | "
\n",
1863 | " \n",
1864 | " 12 | \n",
1865 | " 939058.0 | \n",
1866 | " 1 | \n",
1867 | " marketing | \n",
1868 | " 1 | \n",
1869 | " 48000.0 | \n",
1870 | " 2012-12-10 | \n",
1871 | " 2013-11-15 | \n",
1872 | " 340 days | \n",
1873 | "
\n",
1874 | " \n",
1875 | " 14 | \n",
1876 | " 461248.0 | \n",
1877 | " 2 | \n",
1878 | " sales | \n",
1879 | " 20 | \n",
1880 | " 201000.0 | \n",
1881 | " 2013-09-16 | \n",
1882 | " 2014-08-22 | \n",
1883 | " 340 days | \n",
1884 | "
\n",
1885 | " \n",
1886 | " 21 | \n",
1887 | " 219944.0 | \n",
1888 | " 6 | \n",
1889 | " customer_service | \n",
1890 | " 15 | \n",
1891 | " 98000.0 | \n",
1892 | " 2012-06-25 | \n",
1893 | " 2013-05-31 | \n",
1894 | " 340 days | \n",
1895 | "
\n",
1896 | " \n",
1897 | "
\n",
1898 | "
"
1899 | ],
1900 | "text/plain": [
1901 | " employee_id company_id dept seniority salary join_date \\\n",
1902 | "1 825355.0 7 marketing 20 183000.0 2013-04-29 \n",
1903 | "3 662910.0 7 customer_service 20 115000.0 2012-05-14 \n",
1904 | "12 939058.0 1 marketing 1 48000.0 2012-12-10 \n",
1905 | "14 461248.0 2 sales 20 201000.0 2013-09-16 \n",
1906 | "21 219944.0 6 customer_service 15 98000.0 2012-06-25 \n",
1907 | "\n",
1908 | " quit_date emp_length \n",
1909 | "1 2014-04-04 340 days \n",
1910 | "3 2013-06-07 389 days \n",
1911 | "12 2013-11-15 340 days \n",
1912 | "14 2014-08-22 340 days \n",
1913 | "21 2013-05-31 340 days "
1914 | ]
1915 | },
1916 | "execution_count": 929,
1917 | "metadata": {},
1918 | "output_type": "execute_result"
1919 | }
1920 | ],
1921 | "source": [
1922 | "early_quitter.head()"
1923 | ]
1924 | },
1925 | {
1926 | "cell_type": "code",
1927 | "execution_count": 931,
1928 | "metadata": {},
1929 | "outputs": [],
1930 | "source": [
1931 | "last_day = pd.to_datetime(\"2015-12-13\")"
1932 | ]
1933 | },
1934 | {
1935 | "cell_type": "code",
1936 | "execution_count": 932,
1937 | "metadata": {},
1938 | "outputs": [
1939 | {
1940 | "data": {
1941 | "text/plain": [
1942 | "Timestamp('2015-12-13 00:00:00')"
1943 | ]
1944 | },
1945 | "execution_count": 932,
1946 | "metadata": {},
1947 | "output_type": "execute_result"
1948 | }
1949 | ],
1950 | "source": [
1951 | "last_day"
1952 | ]
1953 | },
1954 | {
1955 | "cell_type": "code",
1956 | "execution_count": 944,
1957 | "metadata": {},
1958 | "outputs": [],
1959 | "source": [
1960 | "# get the data not early quitter and exclude the ones employed less than 13 months\n",
1961 | "longer_emp = data[((last_day - data.join_date)/np.timedelta64(1,'D') >365+30)\\\n",
1962 | " &(data.emp_length/np.timedelta64(1,'D') > 365+30)]"
1963 | ]
1964 | },
1965 | {
1966 | "cell_type": "code",
1967 | "execution_count": 945,
1968 | "metadata": {},
1969 | "outputs": [
1970 | {
1971 | "data": {
1972 | "text/html": [
1973 | "\n",
1974 | "\n",
1987 | "
\n",
1988 | " \n",
1989 | " \n",
1990 | " | \n",
1991 | " employee_id | \n",
1992 | " company_id | \n",
1993 | " dept | \n",
1994 | " seniority | \n",
1995 | " salary | \n",
1996 | " join_date | \n",
1997 | " quit_date | \n",
1998 | " emp_length | \n",
1999 | "
\n",
2000 | " \n",
2001 | " \n",
2002 | " \n",
2003 | " 0 | \n",
2004 | " 13021.0 | \n",
2005 | " 7 | \n",
2006 | " customer_service | \n",
2007 | " 28 | \n",
2008 | " 89000.0 | \n",
2009 | " 2014-03-24 | \n",
2010 | " 2015-10-30 | \n",
2011 | " 585 days | \n",
2012 | "
\n",
2013 | " \n",
2014 | " 4 | \n",
2015 | " 256971.0 | \n",
2016 | " 2 | \n",
2017 | " data_science | \n",
2018 | " 23 | \n",
2019 | " 276000.0 | \n",
2020 | " 2011-10-17 | \n",
2021 | " 2014-08-22 | \n",
2022 | " 1040 days | \n",
2023 | "
\n",
2024 | " \n",
2025 | " 5 | \n",
2026 | " 509529.0 | \n",
2027 | " 4 | \n",
2028 | " data_science | \n",
2029 | " 14 | \n",
2030 | " 165000.0 | \n",
2031 | " 2012-01-30 | \n",
2032 | " 2013-08-30 | \n",
2033 | " 578 days | \n",
2034 | "
\n",
2035 | " \n",
2036 | " 8 | \n",
2037 | " 172999.0 | \n",
2038 | " 9 | \n",
2039 | " engineer | \n",
2040 | " 7 | \n",
2041 | " 160000.0 | \n",
2042 | " 2012-12-10 | \n",
2043 | " 2015-10-23 | \n",
2044 | " 1047 days | \n",
2045 | "
\n",
2046 | " \n",
2047 | " 10 | \n",
2048 | " 892155.0 | \n",
2049 | " 6 | \n",
2050 | " customer_service | \n",
2051 | " 13 | \n",
2052 | " 72000.0 | \n",
2053 | " 2012-11-12 | \n",
2054 | " 2015-02-27 | \n",
2055 | " 837 days | \n",
2056 | "
\n",
2057 | " \n",
2058 | "
\n",
2059 | "
"
2060 | ],
2061 | "text/plain": [
2062 | " employee_id company_id dept seniority salary join_date \\\n",
2063 | "0 13021.0 7 customer_service 28 89000.0 2014-03-24 \n",
2064 | "4 256971.0 2 data_science 23 276000.0 2011-10-17 \n",
2065 | "5 509529.0 4 data_science 14 165000.0 2012-01-30 \n",
2066 | "8 172999.0 9 engineer 7 160000.0 2012-12-10 \n",
2067 | "10 892155.0 6 customer_service 13 72000.0 2012-11-12 \n",
2068 | "\n",
2069 | " quit_date emp_length \n",
2070 | "0 2015-10-30 585 days \n",
2071 | "4 2014-08-22 1040 days \n",
2072 | "5 2013-08-30 578 days \n",
2073 | "8 2015-10-23 1047 days \n",
2074 | "10 2015-02-27 837 days "
2075 | ]
2076 | },
2077 | "execution_count": 945,
2078 | "metadata": {},
2079 | "output_type": "execute_result"
2080 | }
2081 | ],
2082 | "source": [
2083 | "longer_emp.head()"
2084 | ]
2085 | },
2086 | {
2087 | "cell_type": "markdown",
2088 | "metadata": {},
2089 | "source": [
2090 | "might use decision tree here to model it"
2091 | ]
2092 | },
2093 | {
2094 | "cell_type": "code",
2095 | "execution_count": 949,
2096 | "metadata": {},
2097 | "outputs": [
2098 | {
2099 | "data": {
2100 | "text/plain": [
2101 | "count 5654.000000\n",
2102 | "mean 131393.880439\n",
2103 | "std 65464.211853\n",
2104 | "min 17000.000000\n",
2105 | "25% 81000.000000\n",
2106 | "50% 122000.000000\n",
2107 | "75% 173000.000000\n",
2108 | "max 372000.000000\n",
2109 | "Name: salary, dtype: float64"
2110 | ]
2111 | },
2112 | "execution_count": 949,
2113 | "metadata": {},
2114 | "output_type": "execute_result"
2115 | }
2116 | ],
2117 | "source": [
2118 | "early_quitter.salary.describe()"
2119 | ]
2120 | },
2121 | {
2122 | "cell_type": "code",
2123 | "execution_count": 950,
2124 | "metadata": {},
2125 | "outputs": [
2126 | {
2127 | "data": {
2128 | "text/plain": [
2129 | "count 7795.000000\n",
2130 | "mean 138768.313021\n",
2131 | "std 75379.904785\n",
2132 | "min 19000.000000\n",
2133 | "25% 80000.000000\n",
2134 | "50% 123000.000000\n",
2135 | "75% 187000.000000\n",
2136 | "max 379000.000000\n",
2137 | "Name: salary, dtype: float64"
2138 | ]
2139 | },
2140 | "execution_count": 950,
2141 | "metadata": {},
2142 | "output_type": "execute_result"
2143 | }
2144 | ],
2145 | "source": [
2146 | "longer_emp.salary.describe()"
2147 | ]
2148 | },
2149 | {
2150 | "cell_type": "markdown",
2151 | "metadata": {},
2152 | "source": [
2153 | "### Check week of year-quit time"
2154 | ]
2155 | },
2156 | {
2157 | "cell_type": "code",
2158 | "execution_count": 850,
2159 | "metadata": {},
2160 | "outputs": [
2161 | {
2162 | "data": {
2163 | "text/html": [
2164 | "\n",
2165 | "\n",
2178 | "
\n",
2179 | " \n",
2180 | " \n",
2181 | " | \n",
2182 | " employee_id | \n",
2183 | " company_id | \n",
2184 | " dept | \n",
2185 | " seniority | \n",
2186 | " salary | \n",
2187 | " join_date | \n",
2188 | " quit_date | \n",
2189 | " emp_length | \n",
2190 | "
\n",
2191 | " \n",
2192 | " \n",
2193 | " \n",
2194 | " 0 | \n",
2195 | " 13021.0 | \n",
2196 | " 7 | \n",
2197 | " customer_service | \n",
2198 | " 28 | \n",
2199 | " 89000.0 | \n",
2200 | " 2014-03-24 | \n",
2201 | " 2015-10-30 | \n",
2202 | " 585 days | \n",
2203 | "
\n",
2204 | " \n",
2205 | " 1 | \n",
2206 | " 825355.0 | \n",
2207 | " 7 | \n",
2208 | " marketing | \n",
2209 | " 20 | \n",
2210 | " 183000.0 | \n",
2211 | " 2013-04-29 | \n",
2212 | " 2014-04-04 | \n",
2213 | " 340 days | \n",
2214 | "
\n",
2215 | " \n",
2216 | " 2 | \n",
2217 | " 927315.0 | \n",
2218 | " 4 | \n",
2219 | " marketing | \n",
2220 | " 14 | \n",
2221 | " 101000.0 | \n",
2222 | " 2014-10-13 | \n",
2223 | " NaT | \n",
2224 | " NaT | \n",
2225 | "
\n",
2226 | " \n",
2227 | " 3 | \n",
2228 | " 662910.0 | \n",
2229 | " 7 | \n",
2230 | " customer_service | \n",
2231 | " 20 | \n",
2232 | " 115000.0 | \n",
2233 | " 2012-05-14 | \n",
2234 | " 2013-06-07 | \n",
2235 | " 389 days | \n",
2236 | "
\n",
2237 | " \n",
2238 | " 4 | \n",
2239 | " 256971.0 | \n",
2240 | " 2 | \n",
2241 | " data_science | \n",
2242 | " 23 | \n",
2243 | " 276000.0 | \n",
2244 | " 2011-10-17 | \n",
2245 | " 2014-08-22 | \n",
2246 | " 1040 days | \n",
2247 | "
\n",
2248 | " \n",
2249 | "
\n",
2250 | "
"
2251 | ],
2252 | "text/plain": [
2253 | " employee_id company_id dept seniority salary join_date \\\n",
2254 | "0 13021.0 7 customer_service 28 89000.0 2014-03-24 \n",
2255 | "1 825355.0 7 marketing 20 183000.0 2013-04-29 \n",
2256 | "2 927315.0 4 marketing 14 101000.0 2014-10-13 \n",
2257 | "3 662910.0 7 customer_service 20 115000.0 2012-05-14 \n",
2258 | "4 256971.0 2 data_science 23 276000.0 2011-10-17 \n",
2259 | "\n",
2260 | " quit_date emp_length \n",
2261 | "0 2015-10-30 585 days \n",
2262 | "1 2014-04-04 340 days \n",
2263 | "2 NaT NaT \n",
2264 | "3 2013-06-07 389 days \n",
2265 | "4 2014-08-22 1040 days "
2266 | ]
2267 | },
2268 | "execution_count": 850,
2269 | "metadata": {},
2270 | "output_type": "execute_result"
2271 | }
2272 | ],
2273 | "source": [
2274 | "data.head()"
2275 | ]
2276 | },
2277 | {
2278 | "cell_type": "code",
2279 | "execution_count": 868,
2280 | "metadata": {},
2281 | "outputs": [],
2282 | "source": [
2283 | "# get week of the year\n",
2284 | "week = data.quit_date.dropna().dt.week"
2285 | ]
2286 | },
2287 | {
2288 | "cell_type": "code",
2289 | "execution_count": 869,
2290 | "metadata": {},
2291 | "outputs": [
2292 | {
2293 | "data": {
2294 | "text/plain": [
2295 | ""
2296 | ]
2297 | },
2298 | "execution_count": 869,
2299 | "metadata": {},
2300 | "output_type": "execute_result"
2301 | },
2302 | {
2303 | "data": {
2304 | "image/png": "\n",
2305 | "text/plain": [
2306 | ""
2307 | ]
2308 | },
2309 | "metadata": {},
2310 | "output_type": "display_data"
2311 | }
2312 | ],
2313 | "source": [
2314 | "week.hist(bins = 100)"
2315 | ]
2316 | },
2317 | {
2318 | "cell_type": "markdown",
2319 | "metadata": {},
2320 | "source": [
2321 | "Observation:
\n",
2322 | "No significant pattern"
2323 | ]
2324 | },
2325 | {
2326 | "cell_type": "markdown",
2327 | "metadata": {},
2328 | "source": [
2329 | "### Check if different dept matters"
2330 | ]
2331 | },
2332 | {
2333 | "cell_type": "code",
2334 | "execution_count": 876,
2335 | "metadata": {},
2336 | "outputs": [],
2337 | "source": [
2338 | "# dept quitted\n",
2339 | "dept_q = data[data['quit_date'].notnull()].dept"
2340 | ]
2341 | },
2342 | {
2343 | "cell_type": "code",
2344 | "execution_count": 879,
2345 | "metadata": {},
2346 | "outputs": [
2347 | {
2348 | "data": {
2349 | "text/plain": [
2350 | "customer_service 0.554902\n",
2351 | "data_science 0.527273\n",
2352 | "design 0.563768\n",
2353 | "engineer 0.512031\n",
2354 | "marketing 0.562993\n",
2355 | "sales 0.570933\n",
2356 | "Name: dept, dtype: float64"
2357 | ]
2358 | },
2359 | "execution_count": 879,
2360 | "metadata": {},
2361 | "output_type": "execute_result"
2362 | }
2363 | ],
2364 | "source": [
2365 | "# percentage of churned in each dept\n",
2366 | "dept.value_counts()/data.dept.value_counts()"
2367 | ]
2368 | },
2369 | {
2370 | "cell_type": "markdown",
2371 | "metadata": {},
2372 | "source": [
2373 | "Observation:\n",
2374 | "No significant diff"
2375 | ]
2376 | },
2377 | {
2378 | "cell_type": "markdown",
2379 | "metadata": {},
2380 | "source": [
2381 | "### Seniority"
2382 | ]
2383 | },
2384 | {
2385 | "cell_type": "code",
2386 | "execution_count": 888,
2387 | "metadata": {},
2388 | "outputs": [],
2389 | "source": [
2390 | "s = data[data['quit_date'].notnull()].seniority.value_counts()/data.seniority.value_counts()"
2391 | ]
2392 | },
2393 | {
2394 | "cell_type": "code",
2395 | "execution_count": 889,
2396 | "metadata": {},
2397 | "outputs": [
2398 | {
2399 | "data": {
2400 | "text/plain": [
2401 | "1 0.499419\n",
2402 | "2 0.530786\n",
2403 | "3 0.507378\n",
2404 | "4 0.471508\n",
2405 | "5 0.569444\n",
2406 | "6 0.601053\n",
2407 | "7 0.550647\n",
2408 | "8 0.581349\n",
2409 | "9 0.552966\n",
2410 | "10 0.564186\n",
2411 | "11 0.554113\n",
2412 | "12 0.590081\n",
2413 | "13 0.559284\n",
2414 | "14 0.552174\n",
2415 | "15 0.554336\n",
2416 | "16 0.570513\n",
2417 | "17 0.535274\n",
2418 | "18 0.524083\n",
2419 | "19 0.546154\n",
2420 | "20 0.555687\n",
2421 | "21 0.581841\n",
2422 | "22 0.530105\n",
2423 | "23 0.547771\n",
2424 | "24 0.535666\n",
2425 | "25 0.563636\n",
2426 | "26 0.517291\n",
2427 | "27 0.532710\n",
2428 | "28 0.547009\n",
2429 | "29 0.492013\n",
2430 | "98 1.000000\n",
2431 | "99 1.000000\n",
2432 | "Name: seniority, dtype: float64"
2433 | ]
2434 | },
2435 | "execution_count": 889,
2436 | "metadata": {},
2437 | "output_type": "execute_result"
2438 | }
2439 | ],
2440 | "source": [
2441 | "s"
2442 | ]
2443 | },
2444 | {
2445 | "cell_type": "code",
2446 | "execution_count": 890,
2447 | "metadata": {},
2448 | "outputs": [],
2449 | "source": [
2450 | "s_df = pd.DataFrame(s)"
2451 | ]
2452 | },
2453 | {
2454 | "cell_type": "code",
2455 | "execution_count": 892,
2456 | "metadata": {
2457 | "scrolled": true
2458 | },
2459 | "outputs": [
2460 | {
2461 | "data": {
2462 | "text/plain": [
2463 | ""
2464 | ]
2465 | },
2466 | "execution_count": 892,
2467 | "metadata": {},
2468 | "output_type": "execute_result"
2469 | },
2470 | {
2471 | "data": {
2472 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD/CAYAAAAKVJb/AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAFo9JREFUeJzt3X20VfV95/H3V0ABwcTCzYMiuawWGhGf6i1JYxKZqBGiIpOYWZJkSFxJWJmp0tTgqEtHjdOH1Jmpba0xNa0mcZKg0Y6wxqsmqZJqfQKMQRFZpZToHcyEosUkHSs4v/ljb8xhc+4951zO4dz78/1aay/2w/f89u/se/bn7L3PPodIKSFJystB3e6AJKn9DHdJypDhLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjJkuEtShgx3ScrQ2G6teOrUqam3t7dbq5ekUWndunX/lFLqaVTXtXDv7e1l7dq13Vq9JI1KEfHjZuq8LCNJGTLcJSlDhrskZahr19zr2bVrFwMDA7zyyivd7sqoMn78eKZNm8a4ceO63RVJI0TDcI+Im4GzgJ+mlObUWR7AnwIfAv4F+FRK6YnhdGZgYIDJkyfT29tL0awaSSmxY8cOBgYGmDFjRre7I2mEaOayzNeA+UMsXwDMLIelwI3D7cwrr7zClClTDPYWRARTpkzxbEfSXhqGe0rpb4EXhyg5B/hGKjwKvDki3j7cDhnsrXObSapqxweqRwLP10wPlPPesNauXcuyZcuG/ZjVq1fz8MMPd6Jrkt4g2vGBar3Dxrr/63ZELKW4dMP06dMbNtx76d371bGqrV86s63tDaavr4++vr6m63fv3r3XY1avXs2kSZN4z3ve06kuSjqA6mXZYHnUSu1Q2nHkPgAcVTM9DdhWrzCldFNKqS+l1NfT0/Dbs13xi1/8gjPPPJPjjz+eOXPmcNttt7Fu3TpOOeUUTjrpJM444wxeeOEFAObNm8cll1zC3LlzmTVrFg8++CBQhPNZZ50FwIsvvsiiRYs47rjjePe738369esBuPrqq1m6dCkf/OAHWbJkyeuP2bp1K1/5yle47rrrOOGEE3jwwQeZMWMGu3btAuDll1+mt7f39WlJqqcdR+6rgAsiYgXwLmBnSumFNrTbFffeey9HHHEEd99dvHvu3LmTBQsWsHLlSnp6erjtttu4/PLLufnmm4HiqPvxxx+nv7+fL37xi3z/+9/fq72rrrqKE088kbvuuov777+fJUuW8OSTTwKwbt06HnroISZMmMDq1auB4mcZPve5zzFp0iSWL18OFG8id999N4sWLWLFihV85CMf8bZHSUNq5lbIbwPzgKkRMQBcBYwDSCl9BeinuA1yM8WtkOd3qrMHwrHHHsvy5cu55JJLOOusszj88MN5+umnOf300wF47bXXePvbf/l58Yc//GEATjrpJLZu3bpPew899BB33nknAB/4wAfYsWMHO3fuBGDhwoVMmDChYZ8+85nPcO2117Jo0SJuueUWvvrVr+7v05SUuYbhnlJa3GB5An67bT3qslmzZrFu3Tr6+/u57LLLOP300znmmGN45JFH6tYfcsghAIwZM4bdu3fvs7zYPHvbc3fLoYce2lSfTj75ZLZu3coPfvADXnvtNebM2efrBpK0F39+oGLbtm1MnDiRT3ziEyxfvpzHHnuM7du3vx7uu3btYsOGDU239/73v59vfvObQHEtfurUqRx22GFDPmby5Mn87Gc/22vekiVLWLx4MeefP6pPjCQdICPq5wdGgqeeeoqLL76Ygw46iHHjxnHjjTcyduxYli1bxs6dO9m9ezef//znOeaYY5pq7+qrr+b888/nuOOOY+LEiXz9619v+Jizzz6bc889l5UrV3L99dfzvve9j49//ONcccUVLF485ImUJAEQ9S4bHAh9fX2p+nvuGzdu5Oijj+5Kf0a6O+64g5UrV3LrrbfWXe62k0audt4KGRHrUkoN77X2yH0UuPDCC7nnnnvo7+/vdlckjRKG+yhw/fXXd7sLkkYZP1CVpAyNuHDv1mcAo5nbTFLViAr38ePHs2PHDsOqBXt+z338+PHd7oqkEWREXXOfNm0aAwMDbN++vdtdGVX2/E9MkrTHiAr3cePG+b8JSVIbjKjLMpKk9jDcJSlDhrskZchwl6QMGe6SlCHDXZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDBnukpQhw12SMmS4S1KGDHdJypDhLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjLUVLhHxPyI2BQRmyPi0jrLp0fEAxHxw4hYHxEfan9XJUnNahjuETEGuAFYAMwGFkfE7ErZFcDtKaUTgfOAL7e7o5Kk5jVz5D4X2JxS2pJSehVYAZxTqUnAYeX4m4Bt7euiJKlVY5uoORJ4vmZ6AHhXpeZq4LsRcSFwKHBaW3onSRqWZo7co868VJleDHwtpTQN+BBwa0Ts03ZELI2ItRGxdvv27a33VpLUlGbCfQA4qmZ6Gvtedvk0cDtASukRYDwwtdpQSummlFJfSqmvp6dneD2WJDXUTLivAWZGxIyIOJjiA9NVlZrngFMBIuJoinD30FySuqRhuKeUdgMXAPcBGynuitkQEddExMKy7AvAZyPiR8C3gU+llKqXbiRJB0gzH6iSUuoH+ivzrqwZfwY4ub1dkyQNl99QlaQMGe6SlCHDXZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDBnukpQhw12SMmS4S1KGDHdJypDhLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjJkuEtShgx3ScqQ4S5JGTLcJSlDhrskZchwl6QMGe6SlCHDXZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDDUV7hExPyI2RcTmiLh0kJp/FxHPRMSGiPhWe7spSWrF2EYFETEGuAE4HRgA1kTEqpTSMzU1M4HLgJNTSi9FxFs61WFJUmPNHLnPBTanlLaklF4FVgDnVGo+C9yQUnoJIKX00/Z2U5LUimbC/Ujg+ZrpgXJerVnArIj4u4h4NCLmt6uDkqTWNbwsA0SdealOOzOBecA04MGImJNS+ue9GopYCiwFmD59esudHW16L717n3lbv3RmF3oi6Y2mmSP3AeComulpwLY6NStTSrtSSv8IbKII+72klG5KKfWllPp6enqG22dJUgPNHLmvAWZGxAzgfwPnAR+r1NwFLAa+FhFTKS7TbGlnRzU6efYidUfDI/eU0m7gAuA+YCNwe0ppQ0RcExELy7L7gB0R8QzwAHBxSmlHpzotSRpaM0fupJT6gf7KvCtrxhNwUTmog+odCYNHwxqcZ09vTH5DVZIy1NSRu37Jo6B8+bdVTgx3AQabRgZfh+1juGtE8LMEqb0Md7VsNB1djaa+tiLX56X2MdxHiE7srLkeDef6vKR2MtylEWI0vWmNpr62Iqfn5a2QkpQhj9ylYfCad/NyOhoeTUZduLtTSZ2T6/6V6/MaipdlJClDo+7IXRpNvCShbjHcJY1Kb8RLLa3wsowkZSjbI3dPhyW9kXnkLkkZGhFH7l47k6T2GhHhLkmjzUi/9OtlGUnKkOEuSRky3CUpQ15zZ+RfO5OkVnnkLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjJkuEtShgx3ScqQ4S5JGTLcJSlDTYV7RMyPiE0RsTkiLh2i7tyISBHR174uSpJa1TDcI2IMcAOwAJgNLI6I2XXqJgPLgMfa3UlJUmuaOXKfC2xOKW1JKb0KrADOqVP3X4BrgVfa2D9J0jA0E+5HAs/XTA+U814XEScCR6WU/lcb+yZJGqZmwj3qzEuvL4w4CLgO+ELDhiKWRsTaiFi7ffv25nspSWpJM+E+ABxVMz0N2FYzPRmYA6yOiK3Au4FV9T5UTSndlFLqSyn19fT0DL/XkqQhNRPua4CZETEjIg4GzgNW7VmYUtqZUpqaUupNKfUCjwILU0prO9JjSVJDDcM9pbQbuAC4D9gI3J5S2hAR10TEwk53UJLUuqb+m72UUj/QX5l35SC18/a/W5Kk/eE3VCUpQ4a7JGXIcJekDBnukpQhw12SMmS4S1KGDHdJypDhLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjJkuEtShgx3ScqQ4S5JGTLcJSlDhrskZchwl6QMGe6SlCHDXZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDBnukpQhw12SMmS4S1KGDHdJypDhLkkZMtwlKUNNhXtEzI+ITRGxOSIurbP8ooh4JiLWR8TfRMQ72t9VSVKzGoZ7RIwBbgAWALOBxRExu1L2Q6AvpXQccAdwbbs7KklqXjNH7nOBzSmlLSmlV4EVwDm1BSmlB1JK/1JOPgpMa283JUmtaCbcjwSer5keKOcN5tPAPfvTKUnS/hnbRE3UmZfqFkZ8AugDThlk+VJgKcD06dOb7KIkqVXNHLkPAEfVTE8DtlWLIuI04HJgYUrpX+s1lFK6KaXUl1Lq6+npGU5/JUlNaCbc1wAzI2JGRBwMnAesqi2IiBOBv6AI9p+2v5uSpFY0DPeU0m7gAuA+YCNwe0ppQ0RcExELy7L/CkwCvhMRT0bEqkGakyQdAM1ccyel1A/0V+ZdWTN+Wpv7JUnaD35DVZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDBnukpQhw12SMmS4S1KGDHdJypDhLkkZMtwlKUOGuyRlyHCXpAwZ7pKUIcNdkjJkuEtShgx3ScqQ4S5JGTLcJSlDhrskZchwl6QMGe6SlCHDXZIyZLhLUoYMd0nKkOEuSRky3CUpQ4a7JGXIcJekDBnukpShpsI9IuZHxKaI2BwRl9ZZfkhE3FYufywietvdUUlS8xqGe0SMAW4AFgCzgcURMbtS9mngpZTSrwHXAX/U7o5KkprXzJH7XGBzSmlLSulVYAVwTqXmHODr5fgdwKkREe3rpiSpFc2E+5HA8zXTA+W8ujUppd3ATmBKOzooSWpdpJSGLoj4KHBGSukz5fS/B+amlC6sqdlQ1gyU0/9Q1uyotLUUWFpO/jqwqbK6qcA/Ndn30VTb7fV3qrbb6+9UbbfX36nabq+/U7XdXn+nagere0dKqafho1NKQw7AbwH31UxfBlxWqbkP+K1yfGzZoWjUdp11rc2xttvr93n5vEbC+n1enXte9YZmLsusAWZGxIyIOBg4D1hVqVkFfLIcPxe4P5W9kyQdeGMbFaSUdkfEBRRH52OAm1NKGyLiGop3llXAXwG3RsRm4EWKNwBJUpc0DHeAlFI/0F+Zd2XN+CvAR9vQn5syre32+jtV2+31d6q22+vvVG2319+p2m6vv1O1rbS5j4YfqEqSRh9/fkCSMmS4S1KGsgz3iJgbEb9Zjs+OiIsi4kNNPO4bne/d8ETEwRGxJCJOK6c/FhF/HhG/HRHjut0/SSPLqLnmHhHvpPgm7GMppZ/XzJ+fUrq3Zvoqit/BGQt8D3gXsBo4jeJ+/d8v66q3cwbwb4D7AVJKC4foy3spfpbh6ZTSdyvL3gVsTCm9HBETgEuB3wCeAf4gpbSzpnYZ8D9TSrXfAB5snd8sn9NE4J+BScBfA6dS/B0/Wan/VeDfAkcBu4G/B75du37pQIuIt6SUftrmNqekyhcmReMvMXVrAM6vGV9G8W3Wu4CtwDk1y56oPO4pils2JwIvA4eV8ycA62sfB/wPYB5wSvnvC+X4KZU2H68Z/yzwJHAV8HfApZXaDcDYcvwm4E+A95b1f12p3QlsAx4E/iPQM8T2WF/+Oxb4P8CYcjpqn1fN9voecAXwMPBl4Pcp3mDmdftv2+bXyVs61O6Ubj+3On16E/Al4FlgRzlsLOe9uYV27qlMHwb8IXAr8LHKsi9Xpt8G3EjxY4JTgKvLfe524O2V2l+pDFPK/fdw4Fdq6uZXnuNfAeuBbwFvrbT5JWBqOd4HbAE2Az+us98+Ue4Dv9rENukDHigz4ahy/9lJ8T2fEyu1k4Bryn19J7AdeBT4VKfbbOn10u0X7BAb+7ma8aeASeV4L7AW+J1y+oeVx/2w3ng5/WTN+EHA75Yb/IRy3pZB+lLb5hrKEAYOBZ6q1G6sfXENtv497Zb9+GD5gt4O3EvxhbDJldqngYPLHeNne3YOYHztOmu2157wnwisLsen19kmbQ8MuhwWZW3bA4Puh8V9wCXA2yrb7xLge5Xa3xhkOAl4oVJ7Z7kNFlF8IfFO4JBBXsP3AhdSnJGuL9c9vZy3slL7/4B/rAy7yn+31G7XmvG/BH4PeAfF/nlX9bVdM/4A8Jvl+Cwq3+gs1/PfgOeAx8v2jhjk7/U4xRn/YorfyTq3nH8q8EildiXwKWAacBHwn4GZFD+e+AedbLOVodsBvn6Q4SngX2vqnqmzM9wL/DH7BuZjwMRy/KDKDv5EnT5MA74D/Dk1byiVmh9RBMiUOi+galh+h/KsA7gF6Kt58a2p1FZ3nHHAQuDbwPbKst+lCJ0fUxyZ/w3w1XJbXVXdAfjlznk4sK5m2dOV2rYHBl0Oi+p6aFNg0P2w2DTEvrSpMv0axSXGB+oM/7dSW92HLqc4K51S5+9Ve6DzXIN2lpd/32Nrt2Gdvj8xRBvV6Wf55Znxo4P9Heu0+z6KM9iflNtgaQvPq7qP/6gyvab89yDg2U622crQlpAe7kBxeeGEcqerHXqBbTV191MeXdfMGwt8A3itMv+QQdY1tfZFVmf5mQzyDklxdLiFMkQog5DiTab64nsT8DXgHyjeaHaVj/kBcPxQf+DKsgl15h1BGSbAmyl+6mFunbrfoQjKm8qdYc+bTQ/wt5XatgdGnW1yQMOinN/2wKD7YfFd4D9Rc+YBvJXiDfH7lTaeBmYOsm2er0xvpOZAqJz3SYoziR8P1lfg9wbbVjXz9hw8/TEwmTpnxxS/NHsR8IVyX4maZdVLjheW2+EDFGd5fwK8H/gicOtgr4GaeWOA+cAtlfmPUJxBf5TiAGpROf8U9j2gexh4bzl+Nnv/9tamNrS5cLA2WxlafkA7B4pT5fcOsuxblRfI2wapO7mL/Z8IzBhk2WTgeIqj2rcOUjOrg307hiL839mgru2B0e2wKOvaHhgjICwOp/iPcJ4FXqL4qY+N5bzqZalzgV8fZNssqkxfC5xWp24+8PeVeddQXiKtzP814I4hXmdnU1xq+kmdZVdVhj2XPd8GfKNO/TzgNorLmk9RfHt+KTCuUreihf3leIqz2HuAdwJ/SnHjwgbgPXVqHy+XP7RnO1McPC1r0OZLZZsnN2hzVr02Wxk6EiwOo2eoBMaLlcA4vFLbVGB0OyzK5e0KjLE1NZ0Ki+Oa3bHLtk6rbjNqPmOo1J66n7UL2tUuxU0Nc+rVtqmv+1t7dIu1Df8O5ev44vLv/9+B/wC8aZDXzJ7aPytrPzdYbVOvweE+0CH/gZo7ltpV2842K2HR9r5263kNVktrd421Unthh2qb6kMn2hxmu8+2s7as+y5N3LXWSm3Tr53hPMjhjTEwyAfM+1PbiTZHQu2BWD+t3TU2amq7vf4OP69m71prurbZoalfhVS+ImL9YIsorr23XNuJNkdCbbfXT7Hz/xwgpbQ1IuYBd0TEO8paRmltt9ffydqxFDciHELxGREppecG+VZ5K7UNGe56K3AGxQc9tYLi9HA4tZ1ocyTUdnv9P4mIE1JKTwKklH4eEWcBNwPHVh47mmq7vf5O1f4lsCYiHqX4gP6PACKih+KzLYZZ25zhHO475DPQ5B1LrdR2os2RUDsC1t/0XWOjqbbb6+9wbVN3rbVa28wwan5bRpLUvCx/FVKS3ugMd0nKkOEuSRky3CUpQ4a7JGXo/wN76dTXKFbruQAAAABJRU5ErkJggg==\n",
2473 | "text/plain": [
2474 | ""
2475 | ]
2476 | },
2477 | "metadata": {},
2478 | "output_type": "display_data"
2479 | }
2480 | ],
2481 | "source": [
2482 | "s_df.plot(kind = 'bar')"
2483 | ]
2484 | },
2485 | {
2486 | "cell_type": "markdown",
2487 | "metadata": {},
2488 | "source": [
2489 | "Observation:\n",
2490 | "No significant diff"
2491 | ]
2492 | },
2493 | {
2494 | "cell_type": "markdown",
2495 | "metadata": {},
2496 | "source": [
2497 | "## Conclusions:"
2498 | ]
2499 | },
2500 | {
2501 | "cell_type": "markdown",
2502 | "metadata": {},
2503 | "source": [
2504 | "1. Empolyee quit at their working anniversaries and has a extremly high churn rate at the first and second year.
\n",
2505 | "2. Salary is an import factor.(need to dig deeper).Employees with low and high salaries are less likely to quit. Probably because employees with high\n",
2506 | "salaries are happy there and employees with low salaries are not that marketable, so they have a\n",
2507 | "hard time finding a new job."
2508 | ]
2509 | }
2510 | ],
2511 | "metadata": {
2512 | "kernelspec": {
2513 | "display_name": "Python 3",
2514 | "language": "python",
2515 | "name": "python3"
2516 | },
2517 | "language_info": {
2518 | "codemirror_mode": {
2519 | "name": "ipython",
2520 | "version": 3
2521 | },
2522 | "file_extension": ".py",
2523 | "mimetype": "text/x-python",
2524 | "name": "python",
2525 | "nbconvert_exporter": "python",
2526 | "pygments_lexer": "ipython3",
2527 | "version": "3.5.4"
2528 | }
2529 | },
2530 | "nbformat": 4,
2531 | "nbformat_minor": 2
2532 | }
2533 |
--------------------------------------------------------------------------------
/Machine_Learning_Algorithms_Python.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Machine Learning Algorithms"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "3 Types of ML Algorithms
\n",
15 | "1. Supervised Learning - consisits of target variable and predictors. Regression and Classification. Models:Regression, KNN, Decision Tree, Random Forest, Logistics Regression
\n",
16 | "2. Unsupervised Learning - Do not have any outcome variables to predict. Clustering. Segement customer, or picture. Models: Apriori, K-means
\n",
17 | "3. Reinforcement learning - The machine is trained to make specific decisions. Markov Decision Process"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "- Training data: data used to fit the model\n",
25 | "- Test data"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "Split data into training and test data set"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 3,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "from sklearn.model_selection import train_test_split\n",
42 | "#train_X, test_X, train_y, test_y = train_test_split(X, y,random_state = 0)"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "## Linear Regression"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "Minimize the sum of squared difference of distince between obersed value and estimated value"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 1,
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "from sklearn import linear_model"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {},
71 | "source": [
72 | " Identify feature and response variable(s) and values must be numeric and numpy arrays
\n",
73 | " Load train and test data sets"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 3,
79 | "metadata": {},
80 | "outputs": [],
81 | "source": [
82 | "# instantiate a model--make an instance\n",
83 | "lr = linear_model.LinearRegression()"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": 5,
89 | "metadata": {},
90 | "outputs": [],
91 | "source": [
92 | "# Train the model with trainning set and make prediction\n",
93 | "\n",
94 | "# lr.fit(X_train, y_train)\n",
95 | "# y_pred = lr.predict(X_test)"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 6,
101 | "metadata": {},
102 | "outputs": [],
103 | "source": [
104 | "#Equation coefficient and Intercept\n",
105 | "\n",
106 | "# print('Coefficient: \\n', linear.coef_)\n",
107 | "# print('Intercept: \\n', linear.intercept_)"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "## Logistic Regression"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "Classification: used to estimate discrete values---Binary values like 0/1, yes/no, true/false"
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "Fit data to a logit function, and predicts probability"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": 9,
134 | "metadata": {},
135 | "outputs": [],
136 | "source": [
137 | "from sklearn.linear_model import LogisticRegression"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 10,
143 | "metadata": {},
144 | "outputs": [],
145 | "source": [
146 | "# instantiate a model--make an instance\n",
147 | "logreg = LogisticRegression()"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": 11,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "# logreg.fit(X_train, y_train)\n",
157 | "# y_pred = logreg.predict(X_test)"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 12,
163 | "metadata": {},
164 | "outputs": [],
165 | "source": [
166 | "#Equation coefficient and Intercept\n",
167 | "# print('Coefficient: \\n', logreg.coef_)\n",
168 | "# print('Intercept: \\n', logreg.intercept_)"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {},
174 | "source": [
175 | "# Decision Tree"
176 | ]
177 | },
178 | {
179 | "cell_type": "markdown",
180 | "metadata": {},
181 | "source": [
182 | "Classification"
183 | ]
184 | },
185 | {
186 | "cell_type": "markdown",
187 | "metadata": {},
188 | "source": [
189 | "Split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible.
\n",
190 | "Tree's depth - how many splits it makes before coming to a prediction"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 17,
196 | "metadata": {},
197 | "outputs": [],
198 | "source": [
199 | "from sklearn import tree"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": 23,
205 | "metadata": {},
206 | "outputs": [],
207 | "source": [
208 | "# For Classification\n",
209 | "# lgorithm default is gini, others - entropy\n",
210 | "dt = tree.DecisionTreeClassifier(criterion = 'gini')"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 24,
216 | "metadata": {},
217 | "outputs": [],
218 | "source": [
219 | "# For Regression\n",
220 | "dt = tree.DecisionTreeRegressor()"
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "## KNN - K-nearest Neighbors"
228 | ]
229 | },
230 | {
231 | "cell_type": "markdown",
232 | "metadata": {},
233 | "source": [
234 | "Both Classification(more widely used) and Regression"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {},
240 | "source": [
241 | "- KNN is computationaly expensive
\n",
242 | "- Variables should be normalized, else higher range varibles can bias it
\n",
243 | "- Remove outlier, noise before doing KNN"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": 25,
249 | "metadata": {},
250 | "outputs": [],
251 | "source": [
252 | "from sklearn.neighbors import KNeighborsClassifier"
253 | ]
254 | },
255 | {
256 | "cell_type": "code",
257 | "execution_count": 26,
258 | "metadata": {},
259 | "outputs": [],
260 | "source": [
261 | "# default value of n_neighbors is 5\n",
262 | "knn = KNeighborsClassifier(n_neighbors=6)"
263 | ]
264 | },
265 | {
266 | "cell_type": "markdown",
267 | "metadata": {},
268 | "source": [
269 | "## Random Forest"
270 | ]
271 | },
272 | {
273 | "cell_type": "markdown",
274 | "metadata": {},
275 | "source": [
276 | "In Random Forest, there are a collection of decision trees. "
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": 46,
282 | "metadata": {},
283 | "outputs": [],
284 | "source": [
285 | "from sklearn.ensemble import RandomForestClassifier"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": 47,
291 | "metadata": {},
292 | "outputs": [],
293 | "source": [
294 | "rf = RandomForestClassifier()"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": null,
300 | "metadata": {},
301 | "outputs": [],
302 | "source": [
303 | "from sklearn.ensemble import RandomForestRegressor\n",
304 | "rf = RandomForestRegressor()"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": 51,
310 | "metadata": {},
311 | "outputs": [],
312 | "source": [
313 | "if 0:\n",
314 | " print ('Lin')"
315 | ]
316 | },
317 | {
318 | "cell_type": "markdown",
319 | "metadata": {},
320 | "source": [
321 | "## K-Means"
322 | ]
323 | },
324 | {
325 | "cell_type": "markdown",
326 | "metadata": {},
327 | "source": [
328 | "Unsupervised learning - Clustering"
329 | ]
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {},
334 | "source": [
335 | "How K-means forms cluster:
\n",
336 | "1. K-means picks k number of points for each cluster known as centroids.
\n",
337 | "2. Each data point forms a cluster with the closest centroids i.e. k clusters.
\n",
338 | "3. Finds the centroid of each cluster based on existing cluster members. Here we have new centroids.
\n",
339 | "4. As we have new centroids, repeat step 2 and 3. Find the closest distance for each data point from new centroids and get associated with new k-clusters. Repeat this process until convergence occurs i.e. centroids does not change."
340 | ]
341 | },
342 | {
343 | "cell_type": "markdown",
344 | "metadata": {},
345 | "source": [
346 | "How to determine value of K:
\n",
347 | "Sum of square of difference between centroid and the data points --- the number of cluster increases, the value keeps decreasing. If draw a plot, sum of square distince decreases sharply up to some value of k. Here, we can find the optimum number of cluster."
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": 27,
353 | "metadata": {},
354 | "outputs": [],
355 | "source": [
356 | "from sklearn.cluster import KMeans"
357 | ]
358 | },
359 | {
360 | "cell_type": "code",
361 | "execution_count": 28,
362 | "metadata": {},
363 | "outputs": [],
364 | "source": [
365 | "kmeans = KMeans(n_clusters = 3, random_state = 0)"
366 | ]
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": 4,
371 | "metadata": {},
372 | "outputs": [],
373 | "source": [
374 | "# kmeans.fit(X_train, y_train)\n",
375 | "# kmeans.predict(X_test)"
376 | ]
377 | },
378 | {
379 | "cell_type": "code",
380 | "execution_count": null,
381 | "metadata": {},
382 | "outputs": [],
383 | "source": []
384 | },
385 | {
386 | "cell_type": "markdown",
387 | "metadata": {},
388 | "source": [
389 | "## Model Validation"
390 | ]
391 | },
392 | {
393 | "cell_type": "markdown",
394 | "metadata": {},
395 | "source": [
396 | "- MAE(Mean Absolute Error): absolute difference between predicted and actual value"
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "execution_count": 1,
402 | "metadata": {},
403 | "outputs": [],
404 | "source": [
405 | "from sklearn.metrics import mean_absolute_error\n",
406 | "# mean_absolute_error(y,predicted_y)"
407 | ]
408 | },
409 | {
410 | "cell_type": "markdown",
411 | "metadata": {},
412 | "source": [
413 | "## Cross Validation"
414 | ]
415 | },
416 | {
417 | "cell_type": "markdown",
418 | "metadata": {},
419 | "source": [
420 | "By improving the accuracy score, we might get into the situation of over-fitting. Cross Validation helps to achieve more generalized relationships."
421 | ]
422 | },
423 | {
424 | "cell_type": "markdown",
425 | "metadata": {},
426 | "source": [
427 | "#### Method: k-fold cross validation"
428 | ]
429 | },
430 | {
431 | "cell_type": "markdown",
432 | "metadata": {},
433 | "source": [
434 | "Steps:
\n",
435 | "1. Randomly split data into k folds.
\n",
436 | "2. For each k folds, build and train the model on k-1 folds of the data set, and test the model on the kth fold.
\n",
437 | "3. Record error/accuracy.
\n",
438 | "4. Repeat until each of the k fold of data has served as test set.
\n",
439 | "5. The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model."
440 | ]
441 | },
442 | {
443 | "cell_type": "markdown",
444 | "metadata": {},
445 | "source": [
446 | "How to choose value of K --- often use k = 10
\n",
447 | "Lower value of k is more biased.
\n",
448 | "Large value of k is less biased, but can suffer from large variability."
449 | ]
450 | },
451 | {
452 | "cell_type": "code",
453 | "execution_count": 2,
454 | "metadata": {},
455 | "outputs": [],
456 | "source": [
457 | "from sklearn.model_selection import KFold\n",
458 | "from sklearn.model_selection import cross_val_score"
459 | ]
460 | },
461 | {
462 | "cell_type": "code",
463 | "execution_count": 5,
464 | "metadata": {},
465 | "outputs": [],
466 | "source": [
467 | "kf = KFold(n_splits = 10, random_state=0)\n",
468 | "modelCV = RandomForestClassifier()\n",
469 | "scoring = \"accuracy\"\n",
470 | "# results = cross_val_score(modelCV, X_train, y_train, cv=kf,scoring = scoring)\n",
471 | "# print ('10-fold cross validation average accuracy: {}'.format(results.mean()))"
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "execution_count": null,
477 | "metadata": {},
478 | "outputs": [],
479 | "source": []
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": null,
484 | "metadata": {},
485 | "outputs": [],
486 | "source": []
487 | },
488 | {
489 | "cell_type": "code",
490 | "execution_count": null,
491 | "metadata": {},
492 | "outputs": [],
493 | "source": []
494 | },
495 | {
496 | "cell_type": "code",
497 | "execution_count": null,
498 | "metadata": {},
499 | "outputs": [],
500 | "source": []
501 | },
502 | {
503 | "cell_type": "markdown",
504 | "metadata": {},
505 | "source": [
506 | "## Underfitting, Overfitting and Model Optimization"
507 | ]
508 | },
509 | {
510 | "cell_type": "markdown",
511 | "metadata": {},
512 | "source": [
513 | "Now that we have a way to measure model accuracy, we can experiment with altenative models and see which gives the best predictions. (different options built with the model)"
514 | ]
515 | },
516 | {
517 | "cell_type": "markdown",
518 | "metadata": {},
519 | "source": [
520 | "Overfitting: model matches with training data almost perfectly, but does poorly in validation and other new data.
\n",
521 | "Underfitting: Model fails to capture important distinctions and patterns in the data
\n",
522 | "eg. for the decision tree model, more max_leaf_nodes, the more move from underfitting overfitting."
523 | ]
524 | }
525 | ],
526 | "metadata": {
527 | "kernelspec": {
528 | "display_name": "Python 3",
529 | "language": "python",
530 | "name": "python3"
531 | },
532 | "language_info": {
533 | "codemirror_mode": {
534 | "name": "ipython",
535 | "version": 3
536 | },
537 | "file_extension": ".py",
538 | "mimetype": "text/x-python",
539 | "name": "python",
540 | "nbconvert_exporter": "python",
541 | "pygments_lexer": "ipython3",
542 | "version": "3.6.4"
543 | }
544 | },
545 | "nbformat": 4,
546 | "nbformat_minor": 2
547 | }
548 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data-Analysis-Machine-Learning-with-Python
2 | Data Analysis and Machine Learning with Python to Solve Business Problems
3 |
--------------------------------------------------------------------------------
/raw-data/readme:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------