├── images ├── no_keyword.png ├── results_2.png ├── output_108_0.png ├── output_110_1.png ├── output_118_0.png ├── output_63_0.png ├── output_68_0.png ├── output_70_0.png ├── output_72_0.png ├── output_77_1.png ├── output_79_1.png ├── output_92_0.png ├── output_94_0.png ├── simulated_words.png ├── location_features.png ├── feature_importance_1.png ├── feature_importance_2.png ├── test_set_performance.png └── location_features_example.png ├── README.md └── insurance_card_text_classification.md /images/no_keyword.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/no_keyword.png -------------------------------------------------------------------------------- /images/results_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/results_2.png -------------------------------------------------------------------------------- /images/output_108_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_108_0.png -------------------------------------------------------------------------------- /images/output_110_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_110_1.png -------------------------------------------------------------------------------- /images/output_118_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_118_0.png -------------------------------------------------------------------------------- /images/output_63_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_63_0.png -------------------------------------------------------------------------------- /images/output_68_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_68_0.png -------------------------------------------------------------------------------- /images/output_70_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_70_0.png -------------------------------------------------------------------------------- /images/output_72_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_72_0.png -------------------------------------------------------------------------------- /images/output_77_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_77_1.png -------------------------------------------------------------------------------- /images/output_79_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_79_1.png -------------------------------------------------------------------------------- /images/output_92_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_92_0.png -------------------------------------------------------------------------------- /images/output_94_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/output_94_0.png -------------------------------------------------------------------------------- /images/simulated_words.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/simulated_words.png -------------------------------------------------------------------------------- /images/location_features.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/location_features.png -------------------------------------------------------------------------------- /images/feature_importance_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/feature_importance_1.png -------------------------------------------------------------------------------- /images/feature_importance_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/feature_importance_2.png -------------------------------------------------------------------------------- /images/test_set_performance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/test_set_performance.png -------------------------------------------------------------------------------- /images/location_features_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hatemr/Classifying-Insurance-Card-Text/master/images/location_features_example.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Classifying-Insurance-Card-Text 2 | A machine learning project to identify names, Group IDs, and Member IDs from insurance cards. 3 | 4 | The code can be found in [insurance_card_text_classification.ipynb](insurance_card_text_classification.ipynb). Since `.ipynb` files often don't render in Github, I also made a `.md` version [here](insurance_card_text_classification.md). 5 | 6 | # Business Problem 7 | Patients scan their insurance card at the clinic, but the administrative assisstant still must manually enter insurance information into the computer. This manual data entry costs time for the clinic and lowers the patient experience. If we could automatically extract information from the scanned insurance card, we could avoid time and frsutrated from manual entry. 8 | 9 | # Solution 10 | I build a text classifier to identify which words on an insurance card were the 1) member name, 2) Member ID, and 3) Group number. 11 | 12 | # Data 13 | The company who builds the scanners do not save the scanned text, because it would violate health privacy laws. So we had to produce insurance cards ourselves. We found about five real cards and 20 generic cards from online. The company extracted text using OCR software, then I took the resulting XML and extracted the words from the card. 14 | 15 | I then hand-labelled every word as 1) member name, 2) group ID, 3) member ID, or 4) none. The classes were heavily imbalanced; there's only one name on every hard and a couple hundred other words. To overcome class imbalance, I simulated new Group IDs: 16 | * letter -> random letter 17 | * digit -> random digit 18 | * punctuation -> same 19 | 20 | ![simulated words](images/simulated_words.png) 21 | 22 | The resulting simulated words looked similar, just with different digits/letters. 23 | 24 | I also simulated names, using a list of most-common names I found online. This balanced the classes, even though most of the data was now simulated. 25 | 26 | # Modeling 27 | I ran a random forest using 60/40 stratified splitting, to keep the classes balanced. Here is the test set performance: 28 | 29 | ![test_set_performance.png](images/test_set_performance.png) 30 | 31 | I also measured the feature importances to see which features were highly predictive: 32 | 33 | ![feature importance 1](images/feature_importance_1.png) 34 | 35 | The fraction of alphabetic features was the most predictive, while the length was not predictive. 36 | 37 | ## Take-2 38 | Next, I changed to multiclass classification to predict 1) group ID, 2) member ID, 3) none. I also added in an indicator variable for whether the word was a "keyword", such as "Member" or "Group". The results are shown below, broken down by simulated and real data: 39 | 40 | ![results 2](images/results_2.png) 41 | 42 | The results seem mostly good, with only few off-diagonals. However, it performs worse on the _real_ data, which causes concern. This would have to be improved later, but for a proof-of-concept, performance isn't bad. 43 | 44 | Now, length becomes a very important feature: 45 | ![feature importance 2](images/feature_importance_2.png) 46 | 47 | # Text-based features 48 | I engineered 10 text-based features to represent each word: 49 | 1. Length of word 50 | 2. Fraction of characters that are letters 51 | 3. Fraction of characters that are digits 52 | 4. Fraction of alphabetic characters that are uppercase 53 | 5. Fraction of alphabetric characters that are lowercase 54 | 6. Fraction of digits that are puncuation (.,:) 55 | 7. Fraction of digits that are punctuation that are periods 56 | 8. Fraction of digits that are punctuation that are dashes 57 | 58 | # Location features 59 | We can use location information as features to identify Group numbers and others. For example, Group numbers often sit next to text saying "Group No.:". We encode location feautures manually; the vertical location is the line number, normalized to range one, and the horizontal location left/middle/right (0/0.5/1): 60 | 61 | ![location features](images/location_features.png) 62 | 63 | ![location features example](images/location_features_example.png) 64 | 65 | Instead of using a machine learning model with location as features, we instead use a simple logic algorithm: 66 | * Iteration through words. 67 | * If the word is a keyword (e.g. _Member_ is a keyword for Member IDs), measure the distance to every other word. 68 | * Iterate through the other words, starting with the closest word. 69 | * If the word is more than 50% digits and longer than three characters, predict it as a Member ID (or Group ID). This is because IDs are typically longer than 3 characters and have many numbers and few letters. 70 | 71 | # Results 72 | This simple algorithm correctly dientified 8 of 10 Member IDs and 7 of 10 Group IDs. The failures cases are all known cases where the assumptions of the algorithm failed: 73 | * no group ID to find 74 | * group ID had less than 50% digits 75 | * no keyword on the card. Here's an example: 76 | 77 | ![no keyword](images/no_keyword.png) 78 | 79 | # Future work 80 | This project gives a compelling proof-of-concept for automatic text entry. Indeed, some companies like Zocdoc have built models using CNNs to identify the group IDs when you scan the card on your phone. Another similar offering is Textract from Amazon, which scans an image and extracts key-value pairs. The approached used here, feature engineering and ML prediction, can serve as a good solution, especially if other commercial tools aren't flexible enough for this particular task. Further work should be done to engineer new features, refine the models, and deploy a model in production. 81 | -------------------------------------------------------------------------------- /insurance_card_text_classification.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ```python 4 | import xml.etree.ElementTree as ET 5 | import os, sys 6 | #import nltk 7 | #from nltk.corpus import names 8 | #nltk.download('punkt') 9 | #nltk.download('names') 10 | import string 11 | import random 12 | import matplotlib.pyplot as plt 13 | import numpy as np 14 | import pandas as pd 15 | import seaborn as sn 16 | %matplotlib inline 17 | ``` 18 | 19 | 20 | ```python 21 | # import data 22 | df = pd.read_excel('Card Samples/group ids not identified.xlsx', index_col=None, header=0, dtype={'word': str}) 23 | 24 | # fill NAs 25 | df1 = df.fillna(0) 26 | 27 | # remove include=0 28 | df1 = df1.loc[df1.include==1] 29 | ``` 30 | 31 | Split the lines into individual words. 32 | 33 | 34 | ```python 35 | df1_values = df1.values.copy() 36 | new_values = np.empty((1,12)) 37 | 38 | # split rows to individual words 39 | for i in range(df1_values[:,1].shape[0]): 40 | for word in df1_values[i,1].split(): 41 | new_row = df1_values[i,:].copy() 42 | new_row[1] = word 43 | new_row = new_row.reshape(1,-1) 44 | new_values = np.append(new_values, new_row, axis=0) 45 | 46 | # delete first row 47 | new_values = new_values[1:,:] 48 | 49 | # create dataframe 50 | df2 = pd.DataFrame(columns=df1.columns.tolist(), data=new_values) 51 | 52 | # horizontal location 53 | df2 = df2.assign(x_loc=0.5*df2.middle + df2.right_side) 54 | df2.x_loc = df2.x_loc.astype(float) 55 | 56 | # vertical location 57 | df2 = df2.assign(y_loc=1-df2.line/df2.total_lines_on_card) 58 | df2.y_loc = df2.y_loc.astype(float) 59 | 60 | # delete uneeded columns 61 | df2 = df2.drop(columns=['include', 'left_side', 'middle', 'right_side', 'line', 'total_lines_on_card']) 62 | ``` 63 | 64 | 65 | ```python 66 | # get location of keyword 67 | df2_values = df2.values 68 | dist_memberid = np.empty((1,2)) 69 | dist_groupid = np.empty((1,2)) 70 | 71 | for card in np.unique(df2_values[:,0]): 72 | d1 = df2_values[df2_values[:,0]==card] # 73 | 74 | d2 = d1[d1[:,4]==1] # member id keyword 75 | if d2.shape[0] == 0: 76 | memberid_loc = np.array([[0,0]]) 77 | else: 78 | memberid_loc = d2[0,6:8].reshape(1,-1) 79 | memberid_loc = np.repeat(memberid_loc, d1.shape[0], axis=0) 80 | dist_memberid = np.append(dist_memberid, memberid_loc, axis=0) 81 | 82 | d3 = d1[d1[:,5]==1] # group id keyword 83 | if d3.shape[0] == 0: 84 | groupid_loc = np.array([[0,0]]) 85 | else: 86 | groupid_loc = d3[0,6:8].reshape(1,-1) # x and y 87 | groupid_loc = np.repeat(groupid_loc, d1.shape[0], axis=0) 88 | dist_groupid = np.append(dist_groupid, groupid_loc, axis=0) 89 | 90 | dist_memberid = dist_memberid[1:].astype(float) 91 | dist_groupid = dist_groupid[1:].astype(float) 92 | 93 | # group id keyword locations 94 | df3 = df2.assign(x_loc_group_keyword=dist_groupid[:,0]) 95 | df3 = df3.assign(y_loc_group_keyword=dist_groupid[:,1]) 96 | 97 | # member id keyword locations 98 | df3 = df3.assign(x_loc_member_keyword=dist_memberid[:,0]) 99 | df3 = df3.assign(y_loc_member_keyword=dist_memberid[:,1]) 100 | 101 | # calc distances 102 | dist_group_id = np.linalg.norm(df3[['x_loc','y_loc']].values.astype(float) - df3[['x_loc_group_keyword','y_loc_group_keyword']].values.astype(float), axis=1) 103 | dist_member_id = np.linalg.norm(df3[['x_loc','y_loc']].values.astype(float) - df3[['x_loc_member_keyword','y_loc_member_keyword']].values.astype(float), axis=1) 104 | 105 | df3 = df3.assign(dist_member_id=dist_member_id) 106 | df3 = df3.assign(dist_group_id=dist_group_id) 107 | 108 | frac_digit = [] 109 | for index, row in df3.iterrows(): 110 | frac_digit.append(sum([1 for char in row.word if char.isdigit()]) / len(row.word)) 111 | 112 | df3 = df3.assign(frac_digit = frac_digit) 113 | df3 = df3.assign(pred_member=df3.shape[0]*[0]) 114 | df3 = df3.assign(pred_group=df3.shape[0]*[0]) 115 | # drop extra columns 116 | #df3 = df3.drop(columns=['x_loc','y_loc','x_loc_group_keyword','y_loc_group_keyword', 'x_loc_member_keyword','y_loc_member_keyword']) 117 | ``` 118 | 119 | 120 | ```python 121 | n,m = df3.iloc[0:1,:].shape 122 | cols = df3.columns.tolist() 123 | #df4 = pd.DataFrame(data=d4, columns=cols) 124 | 125 | d0 = np.empty((1,m)) 126 | 127 | for card in np.unique(df3.card): 128 | if card>3: 129 | pass 130 | d1 = df3.loc[df3.card==card] 131 | 132 | for i, row in d1.sort_values('dist_member_id').iterrows(): 133 | if row.frac_digit > 0.5 and len(row.word) >= 4: 134 | #i_memberid_pred = i 135 | d5 = row.values.copy().reshape(1,-1) 136 | d5[:,-2] = 1 137 | break 138 | #if 'd5' not in locals(): 139 | # print("No prediction for member id for card", card) 140 | 141 | 142 | for i, row in d1.sort_values('dist_group_id').iterrows(): 143 | if row.frac_digit > 0.5 and len(row.word) >= 4: 144 | i_groupid_pred = i 145 | d10 = row.values.copy().reshape(1,-1) 146 | d10[:,-1] = 1 147 | break 148 | #if 'd10' not in locals(): 149 | # print("No prediction for group id for card", card) 150 | 151 | # member id 152 | d3 = d1.loc[d1.member_id==1].values 153 | d4 = d1.loc[d1.memberid_keyword==1].values 154 | #d5 = d2.loc[d1.index==i_memberid_pred].values.reshape(1,-1) 155 | d6 = np.append(d3, d4, axis=0) 156 | d6 = np.append(d6, d5, axis=0) 157 | 158 | # group id 159 | d8 = d1.loc[d1.group_id==1].values 160 | d9 = d1.loc[d1.groupid_keyword==1].values 161 | # d10 = d7.loc[d7.index==i_groupid_pred].values.reshape(1,-1) 162 | d11 = np.append(d8, d9, axis=0) 163 | d11 = np.append(d11, d10, axis=0) 164 | 165 | # combine member id and group id 166 | d12 = np.append(d6, d11, axis=0) 167 | 168 | 169 | d0 = np.append(d0, d12, axis=0) 170 | #break 171 | 172 | df4 = pd.DataFrame(data=d0, columns=cols) 173 | df4.drop([0], inplace=True) 174 | ``` 175 | 176 | 177 | ```python 178 | #pd.DataFrame(data=d0, columns=cols).drop([0]) 179 | ``` 180 | 181 | 182 | ```python 183 | cards = np.unique(df4.card) 184 | i=6 185 | df4.loc[df4.card==cards[i]][['word','member_id','memberid_keyword','x_loc','y_loc', 'x_loc_member_keyword', 'y_loc_member_keyword', 'dist_member_id','frac_digit','pred_member']] 186 | ``` 187 | 188 | 189 | 190 | 191 |
192 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 |
wordmember_idmemberid_keywordx_locy_locx_loc_member_keywordy_loc_member_keyworddist_member_idfrac_digitpred_member
381234567891000.72727300.727273010
39Member0100.72727300.727273000
401234567891000.72727300.727273011
411234560010.72727300.727273110
42Group0010.72727300.727273100
431234560010.72727300.727273110
302 |
303 | 304 | 305 | 306 | 307 | ```python 308 | filt = df4.loc[df4.pred_member==1] 309 | filt1 = filt.loc[filt.member_id==1] 310 | TP = filt1.shape[0] 311 | print(filt.member_id.sum()/filt.shape[0]) 312 | 313 | ``` 314 | 315 | 0.8 316 | 317 | 318 | 319 | ```python 320 | filt = df4.loc[df4.pred_group==1] 321 | filt1 = filt.loc[filt.group_id==1] 322 | TP = filt1.shape[0] 323 | print(filt.group_id.sum()/filt.shape[0]) 324 | ``` 325 | 326 | 0.7 327 | 328 | 329 | 330 | ```python 331 | i=8 332 | df4.loc[df4.card==cards[i]][['word','group_id','groupid_keyword','x_loc','y_loc', 'x_loc_group_keyword', 'y_loc_group_keyword', 'dist_group_id','frac_digit','pred_group']] 333 | ``` 334 | 335 | 336 | 337 | 338 |
339 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 |
wordgroup_idgroupid_keywordx_locy_locx_loc_group_keywordy_loc_group_keyworddist_group_idfrac_digitpred_group
501123450000000.71428610.8571431.0101510
51Member0000.71428610.8571431.0101500
52ID0000.71428610.8571431.0101500
531123450000000.71428610.8571431.0101510
54NEOOOOOO1010.85714310.857143000
55Group0110.85714310.857143000
56000000010.14285710.8571430.71428611
462 |
463 | 464 | 465 | 466 | 467 | ```python 468 | df3.loc[df3.card==cards[i]] 469 | ``` 470 | 471 | 472 | 473 | 474 |
475 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 | 704 | 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 715 | 716 | 717 | 718 | 719 | 720 | 721 | 722 | 723 | 724 | 725 | 726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 748 | 749 | 750 | 751 | 752 | 753 | 754 | 755 | 756 | 757 | 758 | 759 | 760 | 761 | 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 | 789 | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | 810 | 811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 | 822 | 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 | 864 | 865 | 866 | 867 | 868 | 869 | 870 | 871 | 872 | 873 | 874 | 875 | 876 | 877 | 878 | 879 | 880 | 881 | 882 | 883 | 884 | 885 | 886 | 887 | 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 | 898 | 899 | 900 | 901 | 902 | 903 | 904 | 905 | 906 | 907 | 908 | 909 | 910 | 911 | 912 | 913 | 914 | 915 | 916 | 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | 943 | 944 | 945 | 946 | 947 | 948 | 949 | 950 | 951 | 952 | 953 | 954 | 955 | 956 | 957 | 958 | 959 | 960 | 961 | 962 | 963 | 964 | 965 | 966 | 967 | 968 | 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | 980 | 981 | 982 | 983 | 984 | 985 | 986 | 987 | 988 | 989 | 990 | 991 | 992 | 993 | 994 | 995 | 996 | 997 | 998 | 999 | 1000 | 1001 | 1002 | 1003 | 1004 | 1005 | 1006 | 1007 | 1008 | 1009 | 1010 | 1011 | 1012 | 1013 | 1014 | 1015 | 1016 | 1017 | 1018 | 1019 | 1020 | 1021 | 1022 | 1023 | 1024 | 1025 | 1026 | 1027 | 1028 | 1029 | 1030 | 1031 | 1032 | 1033 | 1034 | 1035 | 1036 | 1037 | 1038 | 1039 | 1040 | 1041 | 1042 | 1043 | 1044 | 1045 | 1046 | 1047 | 1048 | 1049 | 1050 | 1051 | 1052 | 1053 | 1054 | 1055 | 1056 | 1057 | 1058 | 1059 | 1060 | 1061 | 1062 | 1063 | 1064 | 1065 | 1066 | 1067 | 1068 | 1069 | 1070 | 1071 | 1072 | 1073 | 1074 | 1075 | 1076 | 1077 | 1078 | 1079 | 1080 | 1081 | 1082 | 1083 | 1084 | 1085 | 1086 | 1087 | 1088 | 1089 | 1090 | 1091 | 1092 | 1093 | 1094 | 1095 | 1096 | 1097 | 1098 | 1099 | 1100 | 1101 | 1102 | 1103 | 1104 | 1105 | 1106 | 1107 | 1108 | 1109 | 1110 | 1111 | 1112 | 1113 | 1114 | 1115 | 1116 | 1117 | 1118 | 1119 | 1120 | 1121 | 1122 | 1123 | 1124 | 1125 | 1126 | 1127 | 1128 | 1129 | 1130 | 1131 | 1132 | 1133 | 1134 | 1135 | 1136 | 1137 | 1138 | 1139 | 1140 | 1141 | 1142 | 1143 | 1144 | 1145 | 1146 | 1147 | 1148 | 1149 | 1150 | 1151 | 1152 | 1153 | 1154 | 1155 | 1156 | 1157 | 1158 | 1159 | 1160 | 1161 | 1162 | 1163 | 1164 | 1165 | 1166 | 1167 | 1168 | 1169 | 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1179 | 1180 | 1181 | 1182 | 1183 | 1184 | 1185 | 1186 | 1187 | 1188 | 1189 | 1190 | 1191 | 1192 | 1193 | 1194 | 1195 | 1196 | 1197 | 1198 | 1199 | 1200 | 1201 | 1202 | 1203 | 1204 | 1205 | 1206 | 1207 | 1208 | 1209 | 1210 | 1211 | 1212 | 1213 | 1214 | 1215 | 1216 | 1217 | 1218 | 1219 | 1220 | 1221 | 1222 | 1223 | 1224 | 1225 | 1226 | 1227 | 1228 | 1229 | 1230 | 1231 | 1232 | 1233 | 1234 | 1235 | 1236 | 1237 | 1238 | 1239 | 1240 | 1241 | 1242 | 1243 | 1244 | 1245 | 1246 | 1247 | 1248 | 1249 | 1250 | 1251 | 1252 | 1253 | 1254 | 1255 | 1256 | 1257 | 1258 | 1259 | 1260 | 1261 | 1262 | 1263 | 1264 | 1265 | 1266 | 1267 | 1268 | 1269 | 1270 | 1271 | 1272 | 1273 | 1274 | 1275 | 1276 | 1277 | 1278 | 1279 | 1280 | 1281 | 1282 | 1283 | 1284 | 1285 | 1286 | 1287 | 1288 | 1289 | 1290 | 1291 | 1292 | 1293 | 1294 | 1295 | 1296 | 1297 | 1298 | 1299 | 1300 | 1301 | 1302 | 1303 | 1304 | 1305 | 1306 | 1307 | 1308 | 1309 | 1310 | 1311 | 1312 | 1313 | 1314 | 1315 | 1316 | 1317 | 1318 | 1319 | 1320 | 1321 | 1322 | 1323 | 1324 | 1325 | 1326 | 1327 | 1328 | 1329 | 1330 | 1331 | 1332 | 1333 | 1334 | 1335 | 1336 | 1337 | 1338 | 1339 | 1340 | 1341 | 1342 | 1343 | 1344 | 1345 | 1346 | 1347 | 1348 | 1349 | 1350 | 1351 | 1352 | 1353 | 1354 | 1355 | 1356 | 1357 | 1358 | 1359 | 1360 | 1361 | 1362 | 1363 | 1364 | 1365 | 1366 | 1367 | 1368 | 1369 | 1370 | 1371 | 1372 | 1373 | 1374 | 1375 | 1376 | 1377 | 1378 | 1379 | 1380 | 1381 | 1382 | 1383 | 1384 | 1385 | 1386 | 1387 | 1388 | 1389 | 1390 | 1391 | 1392 | 1393 | 1394 | 1395 | 1396 | 1397 | 1398 | 1399 | 1400 | 1401 | 1402 | 1403 | 1404 | 1405 | 1406 | 1407 | 1408 | 1409 | 1410 | 1411 | 1412 | 1413 | 1414 | 1415 | 1416 | 1417 | 1418 | 1419 | 1420 | 1421 | 1422 | 1423 | 1424 | 1425 | 1426 | 1427 | 1428 | 1429 | 1430 | 1431 | 1432 | 1433 | 1434 | 1435 | 1436 | 1437 | 1438 | 1439 | 1440 | 1441 | 1442 | 1443 | 1444 | 1445 | 1446 | 1447 | 1448 | 1449 | 1450 | 1451 | 1452 | 1453 |
cardwordmember_idgroup_idmemberid_keywordgroupid_keywordx_locy_locx_loc_group_keywordy_loc_group_keywordx_loc_member_keywordy_loc_member_keyworddist_member_iddist_group_idfrac_digitpred_memberpred_group
28224PacificSource00000.00.9285711.00.8571430.00.7142860.2142861.0025480.0000
28324Group00001.00.9285711.00.8571430.00.7142861.0227020.0714290.0000
28424Name00001.00.9285711.00.8571430.00.7142861.0227020.0714290.0000
28524Here00001.00.9285711.00.8571430.00.7142861.0227020.0714290.0000
28624HEALTH00001.00.8571431.00.8571430.00.7142861.0101530.0000000.0000
28724PLANS00001.00.8571431.00.8571430.00.7142861.0101530.0000000.0000
28824Group00011.00.8571431.00.8571430.00.7142861.0101530.0000000.0000
28924#:00001.00.8571431.00.8571430.00.7142861.0101530.0000000.0000
29024NEOOOOOO01001.00.8571431.00.8571430.00.7142861.0101530.0000000.0000
29124Subscriber00000.00.7857141.00.8571430.00.7142860.0714291.0025480.0000
29224Name:00000.00.7857141.00.8571430.00.7142860.0714291.0025480.0000
29324John00000.00.7857141.00.8571430.00.7142860.0714291.0025480.0000
29424Smith00000.00.7857141.00.8571430.00.7142860.0714291.0025480.0000
29524Member00100.00.7142861.00.8571430.00.7142860.0000001.0101530.0000
29624ID00100.00.7142861.00.8571430.00.7142860.0000001.0101530.0000
2972411234500010000.00.7142861.00.8571430.00.7142860.0000001.0101531.0000
29824Network:00000.00.6428571.00.8571430.00.7142860.0714291.0227020.0000
29924SmartHeaIth00000.00.6428571.00.8571430.00.7142860.0714291.0227020.0000
30024(Referral00000.00.6428571.00.8571430.00.7142860.0714291.0227020.0000
30124Required)00000.00.6428571.00.8571430.00.7142860.0714291.0227020.0000
30224Card00000.00.5714291.00.8571430.00.7142860.1428571.0400160.0000
30324Issued:00000.00.5714291.00.8571430.00.7142860.1428571.0400160.0000
3042401/01/1400000.00.5714291.00.8571430.00.7142860.1428571.0400160.7500
30524ID00000.00.5000001.00.8571430.00.7142860.2142861.0618620.0000
306240000000.00.4285711.00.8571430.00.7142860.2857141.0879681.0000
307240100000.00.3571431.00.8571430.00.7142860.3571431.1180341.0000
308240200000.00.2857141.00.8571430.00.7142860.4285711.1517511.0000
30924Member00000.00.5000001.00.8571430.00.7142860.2142861.0618620.0000
31024PCP00000.50.5000001.00.8571430.00.7142860.5439840.6144520.0000
31124John00000.00.4285711.00.8571430.00.7142860.2857141.0879680.0000
31224Susie00000.00.3571431.00.8571430.00.7142860.3571431.1180340.0000
31324David00000.00.2857141.00.8571430.00.7142860.4285711.1517510.0000
31424D.00000.50.4285711.00.8571430.00.7142860.5758760.6585390.0000
31524Jones00000.50.4285711.00.8571430.00.7142860.5758760.6585390.0000
31624D.00000.50.3571431.00.8571430.00.7142860.6144520.7071070.0000
31724Jones00000.50.3571431.00.8571430.00.7142860.6144520.7071070.0000
31824D.00000.50.2857141.00.8571430.00.7142860.6585390.7592960.0000
31924Jones00000.50.2857141.00.8571430.00.7142860.6585390.7592960.0000
32024Drug00001.00.2142861.00.8571430.00.7142861.1180340.6428570.0000
32124List00001.00.2142861.00.8571430.00.7142861.1180340.6428570.0000
32224RxBin00001.00.1428571.00.8571430.00.7142861.1517510.7142860.0000
32324RxGroup00001.00.0714291.00.8571430.00.7142861.1888080.7857140.0000
32424RxPCN00001.00.0000001.00.8571430.00.7142861.2289040.8571430.0000
32524XX00001.00.2142861.00.8571430.00.7142861.1180340.6428570.0000
326240000000001.00.1428571.00.8571430.00.7142861.1517510.7142861.0000
327240000000000001.00.0714291.00.8571430.00.7142861.1888080.7857141.0000
328240000000001.00.0000001.00.8571430.00.7142861.2289040.8571431.0000
1454 |
1455 | 1456 | 1457 | 1458 | 1459 | ```python 1460 | 1461 | ``` 1462 | 1463 | 1464 | ```python 1465 | df3.loc[df3.card==2][['word','member_id','memberid_keyword','x_loc','y_loc', 'x_loc_member_keyword', 'y_loc_member_keyword', 'dist_member_id','frac_digit','pred_member']] 1466 | ``` 1467 | 1468 | 1469 | 1470 | 1471 |
1472 | 1485 | 1486 | 1487 | 1488 | 1489 | 1490 | 1491 | 1492 | 1493 | 1494 | 1495 | 1496 | 1497 | 1498 | 1499 | 1500 | 1501 | 1502 | 1503 | 1504 | 1505 | 1506 | 1507 | 1508 | 1509 | 1510 | 1511 | 1512 | 1513 | 1514 | 1515 | 1516 | 1517 | 1518 | 1519 | 1520 | 1521 | 1522 | 1523 | 1524 | 1525 | 1526 | 1527 | 1528 | 1529 | 1530 | 1531 | 1532 | 1533 | 1534 | 1535 | 1536 | 1537 | 1538 | 1539 | 1540 | 1541 | 1542 | 1543 | 1544 | 1545 | 1546 | 1547 | 1548 | 1549 | 1550 | 1551 | 1552 | 1553 | 1554 | 1555 | 1556 | 1557 | 1558 | 1559 | 1560 | 1561 | 1562 | 1563 | 1564 | 1565 | 1566 | 1567 | 1568 | 1569 | 1570 | 1571 | 1572 | 1573 | 1574 | 1575 | 1576 | 1577 | 1578 | 1579 | 1580 | 1581 | 1582 | 1583 | 1584 | 1585 | 1586 | 1587 | 1588 | 1589 | 1590 | 1591 | 1592 | 1593 | 1594 | 1595 | 1596 | 1597 | 1598 | 1599 | 1600 | 1601 | 1602 | 1603 | 1604 | 1605 | 1606 | 1607 | 1608 | 1609 | 1610 | 1611 | 1612 | 1613 | 1614 | 1615 | 1616 | 1617 | 1618 | 1619 | 1620 | 1621 | 1622 | 1623 | 1624 | 1625 | 1626 | 1627 | 1628 | 1629 | 1630 | 1631 | 1632 | 1633 | 1634 | 1635 | 1636 | 1637 | 1638 | 1639 | 1640 | 1641 | 1642 | 1643 | 1644 | 1645 | 1646 | 1647 | 1648 | 1649 | 1650 | 1651 | 1652 | 1653 | 1654 | 1655 | 1656 | 1657 | 1658 | 1659 | 1660 | 1661 | 1662 | 1663 | 1664 | 1665 | 1666 | 1667 | 1668 | 1669 | 1670 | 1671 | 1672 | 1673 | 1674 | 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | 1682 | 1683 | 1684 | 1685 | 1686 | 1687 | 1688 | 1689 | 1690 | 1691 | 1692 | 1693 | 1694 | 1695 | 1696 | 1697 | 1698 | 1699 | 1700 | 1701 | 1702 | 1703 | 1704 | 1705 | 1706 | 1707 | 1708 | 1709 | 1710 | 1711 | 1712 | 1713 | 1714 | 1715 | 1716 | 1717 | 1718 | 1719 | 1720 | 1721 | 1722 | 1723 | 1724 | 1725 | 1726 | 1727 | 1728 | 1729 | 1730 | 1731 | 1732 | 1733 | 1734 | 1735 | 1736 | 1737 | 1738 | 1739 | 1740 | 1741 | 1742 | 1743 | 1744 | 1745 | 1746 | 1747 | 1748 | 1749 | 1750 | 1751 | 1752 | 1753 | 1754 | 1755 | 1756 | 1757 | 1758 | 1759 | 1760 | 1761 | 1762 | 1763 | 1764 | 1765 | 1766 | 1767 | 1768 | 1769 | 1770 | 1771 | 1772 | 1773 | 1774 | 1775 | 1776 | 1777 | 1778 | 1779 | 1780 | 1781 | 1782 | 1783 | 1784 | 1785 | 1786 | 1787 | 1788 | 1789 | 1790 | 1791 | 1792 | 1793 | 1794 | 1795 | 1796 | 1797 | 1798 | 1799 | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | 1809 | 1810 | 1811 | 1812 | 1813 | 1814 | 1815 | 1816 | 1817 | 1818 | 1819 | 1820 | 1821 | 1822 | 1823 | 1824 | 1825 | 1826 | 1827 | 1828 | 1829 | 1830 | 1831 | 1832 | 1833 | 1834 | 1835 | 1836 | 1837 | 1838 | 1839 | 1840 | 1841 | 1842 | 1843 | 1844 | 1845 | 1846 | 1847 | 1848 | 1849 | 1850 | 1851 | 1852 | 1853 | 1854 | 1855 | 1856 | 1857 | 1858 | 1859 | 1860 | 1861 | 1862 | 1863 | 1864 | 1865 | 1866 | 1867 | 1868 | 1869 | 1870 | 1871 | 1872 | 1873 | 1874 | 1875 | 1876 | 1877 | 1878 | 1879 | 1880 | 1881 | 1882 | 1883 | 1884 | 1885 | 1886 | 1887 | 1888 | 1889 | 1890 | 1891 | 1892 | 1893 | 1894 | 1895 | 1896 | 1897 | 1898 | 1899 | 1900 | 1901 | 1902 | 1903 | 1904 | 1905 | 1906 | 1907 | 1908 | 1909 | 1910 | 1911 | 1912 | 1913 | 1914 | 1915 | 1916 | 1917 | 1918 | 1919 | 1920 | 1921 | 1922 | 1923 | 1924 | 1925 | 1926 | 1927 | 1928 | 1929 | 1930 | 1931 | 1932 | 1933 | 1934 | 1935 | 1936 | 1937 | 1938 | 1939 | 1940 | 1941 | 1942 | 1943 | 1944 | 1945 | 1946 | 1947 | 1948 | 1949 | 1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 |
wordmember_idmemberid_keywordx_locy_locx_loc_member_keywordy_loc_member_keyworddist_member_idfrac_digitpred_member
0BlueCross000.00.93750.00.68750.2500000.0000000
1BlueShield000.00.93750.00.68750.2500000.0000000
2Subscriber000.00.81250.00.68750.1250000.0000000
3(O):000.00.81250.00.68750.1250000.0000000
4SMITH,000.00.75000.00.68750.0625000.0000000
5JOHN000.00.75000.00.68750.0625000.0000000
6Identification010.00.68750.00.68750.0000000.0000000
7Number(3.5):000.00.68750.00.68750.0000000.1666670
8ZGP123456789100.00.62500.00.68750.0625000.7500000
9Group000.00.56250.00.68750.1250000.0000000
10No:000.00.56250.00.68750.1250000.0000000
11123456000.00.56250.00.68750.1250001.0000000
12Effective000.00.50000.00.68750.1875000.0000000
1301/01/13000.00.50000.00.68750.1875000.7500000
14Plan000.00.43750.00.68750.2500000.0000000
15Code:000.00.43750.00.68750.2500000.0000000
16BC000.00.43750.00.68750.2500000.0000000
17400000.00.43750.00.68750.2500001.0000000
18BS000.00.43750.00.68750.2500000.0000000
19900000.00.43750.00.68750.2500001.0000000
20Rx001.00.31250.00.68751.0680000.0000000
21PCN001.00.31250.00.68751.0680000.0000000
22OV/SPC001.00.25000.00.68751.0915160.0000000
23Emergency001.00.18750.00.68751.1180340.0000000
24Rx001.00.12500.00.68751.1473470.0000000
25Deductible001.00.12500.00.68751.1473470.0000000
26Rx001.00.06250.00.68751.1792480.0000000
27Copay001.00.06250.00.68751.1792480.0000000
28Gen001.00.06250.00.68751.1792480.0000000
29Rx001.00.00000.00.68751.2135300.0000000
30copay001.00.00000.00.68751.2135300.0000000
31Br001.00.00000.00.68751.2135300.0000000
3211552001.00.37500.00.68751.0476911.0000000
33BCIL001.00.31250.00.68751.0680000.0000000
34$20/$40001.00.25000.00.68751.0915160.5714290
35200001.00.18750.00.68751.1180341.0000000
3650001.00.12500.00.68751.1473471.0000000
37$100/120001.00.06250.00.68751.1792480.7500000
38$100/200/300001.00.00000.00.68751.2135300.7500000
2011 |
2012 | 2013 | 2014 | 2015 | 2016 | ```python 2017 | 2018 | ``` 2019 | 2020 | 2021 | ```python 2022 | 2023 | ``` 2024 | 2025 | 2026 | ```python 2027 | 2028 | ``` 2029 | 2030 | > “For decades, machine learning approaches targeting Natural Language Processing problems have been based on shallow models (e.g., SVM and logistic regression) trained on very high dimensional and sparse features. In the last few years, neural networks based on dense vector representations have been producing superior results on various NLP tasks. This trend is sparked by the success of word embeddings and deep learning methods.” [1] 2031 | 2032 | We are using the old technique due to: 2033 | 1. little data 2034 | 2. data isn't really "natural language." It involves text but is less natural, and less fluid, more structured. 2035 | 2036 | Source: [here](https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76) 2037 | 2038 | ### Upload Data 2039 | 2040 | 2041 | ```python 2042 | # deprecated 2043 | # first four sample 2044 | #tree = ET.parse('Card Samples/Scan02-15-2019 11 44 21.xml') 2045 | #root = tree.getroot() 2046 | 2047 | # second set of samples 2048 | #tree2 = ET.parse('Card Samples/Scan02-15-2019 11 44 21.xml') 2049 | #root2 = tree2.getroot() 2050 | ``` 2051 | 2052 | ### Extract Text 2053 | 2054 | 2055 | ```python 2056 | # findall looks only one level down 2057 | # root[1][0][0][0][0][3].findall('w:t', ns)[0].text 2058 | ``` 2059 | 2060 | This function returns a dictionary of the namespaces in the xml file. The namespace map is needed to extract the text later. 2061 | 2062 | 2063 | ```python 2064 | # Example 2: https://www.programcreek.com/python/example/77333/xml.etree.ElementTree.iterparse 2065 | 2066 | # get namespaces 2067 | def xml_parse(xml_file): 2068 | """ 2069 | Parse an XML file, returns a tree of nodes and a dict of namespaces 2070 | :param xml_file: the input XML file 2071 | :returns: (doc, ns_map) 2072 | """ 2073 | root = None 2074 | ns_map = {} # prefix -> ns_uri 2075 | for event, elem in ET.iterparse(xml_file, ['start-ns', 'start', 'end']): 2076 | if event == 'start-ns': 2077 | # elem = (prefix, ns_uri) 2078 | ns_map[elem[0]] = elem[1] 2079 | elif event == 'start': 2080 | if root is None: 2081 | root = elem 2082 | for prefix, uri in ns_map.items(): 2083 | ET.register_namespace(prefix, uri) 2084 | 2085 | return (ET.ElementTree(root), ns_map) 2086 | ``` 2087 | 2088 | This function extracts the text. 2089 | 2090 | 2091 | ```python 2092 | def words_from_root(file_path, xml_file): 2093 | """ 2094 | Extract text from xml file 2095 | Params: 2096 | file_path: path to files 2097 | xml_file: xml file to be parsed 2098 | Returns: 2099 | List of text in the xml 2100 | """ 2101 | # create ElementTree object 2102 | tree = ET.parse(file_path + '/' + xml_file) 2103 | root = tree.getroot() 2104 | 2105 | # create namespace map, for parsing 2106 | doc, ns_map = xml_parse(file_path + '/' + xml_file) 2107 | 2108 | # initialize output (list of text) 2109 | words = [] 2110 | 2111 | # iterate recursivley over current element and all elements below it 2112 | for elem in root.iter(): 2113 | #find elements of with tag name "w:t" 2114 | hts = elem.findall('w:t', ns_map) 2115 | 2116 | # if any found, append 2117 | if hts: 2118 | words.append(hts[0].text) 2119 | return words 2120 | ``` 2121 | 2122 | Extract text from the four sample cards. 2123 | 2124 | 2125 | ```python 2126 | #from insurance_card_prediction import xml_parse, words_from_root 2127 | ``` 2128 | 2129 | 2130 | ```python 2131 | # List of files 2132 | file_path = "Card Samples/renamed copies" 2133 | dirs = os.listdir(file_path) 2134 | 2135 | words_on_card = {} 2136 | 2137 | # Extract text 2138 | for file in dirs: 2139 | words_on_card[file] = words_from_root(file_path, file) 2140 | 2141 | # remove card6.xml. OCR is not good enough 2142 | del words_on_card['card6.xml'] 2143 | ``` 2144 | 2145 | ### Tokenization 2146 | * These strings are all from the same line. Let's separate words into a bag-of-words for each card. 2147 | * Link: https://www.nltk.org/book/ch03.html 2148 | 2149 | Lines of words to bag-of-words 2150 | 2151 | 2152 | ```python 2153 | bag_of_words = {} 2154 | 2155 | # turn lines-of-words into words 2156 | 2157 | # for each xml file 2158 | for key in sorted(words_on_card.keys()): 2159 | bag = [] 2160 | # for each line in the XML 2161 | for i, line in enumerate(words_on_card[key]): 2162 | if i == 0: 2163 | l = words_on_card[key][i:i+3] # choose two nearest lines (five total) 2164 | elif i == 1: 2165 | l = words_on_card[key][i-1:i+3] 2166 | else: 2167 | l = words_on_card[key][i-2:i+3] 2168 | l = ' '.join(l) 2169 | #print(i, l) 2170 | 2171 | list_of_words_on_line = nltk.word_tokenize(line) 2172 | for word in list_of_words_on_line: 2173 | # save word with all words on its line 2174 | bag.append((word, l)) 2175 | #if 'Group' in word: 2176 | #print(l) 2177 | bag_of_words[key] = bag 2178 | ``` 2179 | 2180 | * Separating the words throws away information about the surrounding words. For example, 'Eric Davidson' turns into 'Eric' and 'Davidson', and they can never be rejoined. Similary, the words "Delta Dental of Illinois" should all be together, but they will be separated. 2181 | * To start, I will ignore this complication and simply try to build a classifier to tell if a word is a _name_ or not. 2182 | 2183 | ## Name Classifier 2184 | 1. Combine words from all four samples into one big bag-of-words. 2185 | 2. Label words as "name" or not. 2186 | 3. Augment the dataset with extra words and names. 2187 | 2188 | 2189 | ```python 2190 | # 1. Combine words from all four samples 2191 | word_bag = [] 2192 | for key in bag_of_words: 2193 | word_bag += bag_of_words[key] 2194 | ``` 2195 | 2196 | ### Target Variables 2197 | * Member name 2198 | * Group ID 2199 | * Member ID 2200 | 2201 | Label the data. 2202 | 2203 | 2204 | ```python 2205 | # Label words (name/not name) 2206 | for i, word in enumerate(word_bag): 2207 | pass #if i in ind_list_group_ids: 2208 | #print(i, word[0], word[1]) 2209 | ``` 2210 | 2211 | 2212 | ```python 2213 | # list of names 2214 | ind_list_names = [153,154,156,158,269,270,455,456,605,607,609,610,611,612,613,614,647,649,698,701,719,720,721, \ 2215 | 722,723,724, 738,740,815,816,864,865,886,887,888,890,892,894,952,1032,1033,1034,1094,1095,1096, \ 2216 | 1128,1129,1130,1190,1191,1192,1240,1241,1276,1277,1278,1326,1327,1328,1402,14031438,1439,1440, \ 2217 | 1441,1462,1463,1468,1498,1500,1577,1579,1580,1641,1642,1694,1696,1749,1750,1790,1791,1845,1846, \ 2218 | 1961,1962,2005,2006,2034,2041,2043,2092,2094] # Name, Member, ID 2219 | #ind_list_names_suspect = [647,649,701,719,720,721,722,723,724,740] 2220 | target_name = [1 if i in ind_list_names else 0 for i in range(len(word_bag))] 2221 | 2222 | # list of group IDs 2223 | ind_list_group_ids = [142,483,619,662,683,770,804,860,972,1029,1197,1229,1425,1762,1805,1870,1982,2034,2050,2132] # Group, Number 2224 | #ind_list_group_ids_suspect = [662,683,804,860,972] 2225 | target_group_id = [1 if i in ind_list_group_ids else 0 for i in range(len(word_bag))] 2226 | 2227 | # list of member IDs 2228 | ind_list_member_ids = [149,150,274,460,602,653,704,708,710,715,743,744,812,868,915,947,1025,1154,1155,1236,1237,1318,1405,1444, \ 2229 | 1524,1525,1603,1754,1755,1788,1839,1912,1958,2003,2047,2086,2115] # Member , ID 2230 | #ind_list_member_ids_suspect = [653,704,708,701,715,743,744,812,868,915,947] 2231 | target_member_id = [1 if i in ind_list_member_ids else 0 for i in range(len(word_bag))] 2232 | ``` 2233 | 2234 | The names from the dataset come as all UPPERCASE. This makes the model predict _name_ for any word with an uppercase letter. To avoid given this easy tell, change the original names to the same case as the simulated names: uppercase first letter and rest lower. 2235 | 2236 | Fix case of the true names. 2237 | 2238 | 2239 | ```python 2240 | # Change names from ALL UPPERCASE to Capitalize (only first letter) 2241 | word_bag_cap = [] 2242 | for i, tup in enumerate(word_bag): 2243 | if i in ind_list_names: 2244 | name_cap = tup[0].capitalize() 2245 | tup = (name_cap, tup[1]) 2246 | word_bag_cap.append(tup) 2247 | else: 2248 | word_bag_cap.append(tup) 2249 | ``` 2250 | 2251 | Turn the data into a Pandas dataframe. 2252 | 2253 | 2254 | ```python 2255 | # create dataframe 2256 | df = pd.DataFrame(index=[tup[0] for tup in word_bag_cap]) 2257 | 2258 | df = df.assign(target_name=target_name) 2259 | df = df.assign(target_group_id=target_group_id) 2260 | df = df.assign(target_member_id=target_member_id) 2261 | df = df.assign(words_in_line=[tup[1] for tup in word_bag_cap]) 2262 | ``` 2263 | 2264 | 2265 | ```python 2266 | df.head() 2267 | ``` 2268 | 2269 | 2270 | 2271 | 2272 |
2273 | 2286 | 2287 | 2288 | 2289 | 2290 | 2291 | 2292 | 2293 | 2294 | 2295 | 2296 | 2297 | 2298 | 2299 | 2300 | 2301 | 2302 | 2303 | 2304 | 2305 | 2306 | 2307 | 2308 | 2309 | 2310 | 2311 | 2312 | 2313 | 2314 | 2315 | 2316 | 2317 | 2318 | 2319 | 2320 | 2321 | 2322 | 2323 | 2324 | 2325 | 2326 | 2327 | 2328 | 2329 | 2330 | 2331 | 2332 | 2333 |
target_nametarget_group_idtarget_member_idwords_in_line
www.aetna.com000www.aetna.com PAYER NUMBER 60054 0735 Informed...
PAYER000www.aetna.com PAYER NUMBER 60054 0735 Informed...
NUMBER000www.aetna.com PAYER NUMBER 60054 0735 Informed...
60054000www.aetna.com PAYER NUMBER 60054 0735 Informed...
0735000www.aetna.com PAYER NUMBER 60054 0735 Informed...
2334 |
2335 | 2336 | 2337 | 2338 | 2339 | ```python 2340 | # number of "names" 2341 | print('There are', df.target_name.sum(), 'names out of', df.shape[0], 'words.', round(df.target_name.sum()/df.shape[0],3), 'percent.') 2342 | ``` 2343 | 2344 | There are 90 names out of 2204 words. 0.041 percent. 2345 | 2346 | 2347 | # Features 2348 | 2349 | 2350 | ```python 2351 | def create_features(df): 2352 | """ 2353 | Creates features from words 2354 | Args: dataframe with words as the indices 2355 | Returns: dataframe with the new features 2356 | """ 2357 | 2358 | length = [] 2359 | frac_alpha = [] 2360 | frac_alpha_upper = [] 2361 | frac_alpha_lower = [] 2362 | frac_digit = [] 2363 | frac_punc = [] 2364 | frac_punc_dashes = [] 2365 | frac_punc_periods = [] 2366 | name_keywords_ind = [] 2367 | groupid_keywords_ind = [] 2368 | memberid_keywords_ind = [] 2369 | five_or_more_digits = [] 2370 | 2371 | # iterate down rows 2372 | for index, row in df.iterrows(): 2373 | 2374 | leng = len(index) 2375 | length.append(leng) 2376 | frac_alpha.append(sum([1 for char in index if char.isalpha()]) / leng) 2377 | frac_alpha_upper.append(sum([1 for char in index if (char.isalpha() and char.isupper())]) / leng) 2378 | frac_alpha_lower.append(sum([1 for char in index if (char.isalpha() and char.islower())]) / leng) 2379 | frac_digit.append(sum([1 for char in index if char.isdigit()]) / leng) 2380 | 2381 | count = lambda l1,l2: sum([1 for x in l1 if x in l2]) 2382 | frac_punc.append( count(index,set(string.punctuation)) / leng) 2383 | frac_punc_dashes.append( count(index,set(["-"])) / leng) 2384 | frac_punc_periods.append( count(index,set(["."])) / leng) 2385 | 2386 | words_in_line = row.words_in_line.split() 2387 | words_in_line_wo_punc = [word.translate(str.maketrans('', '', string.punctuation)) for word in words_in_line] 2388 | 2389 | name_keywords_ind.append( sum([1 for word in words_in_line_wo_punc if word.lower() in ['name','member','id']]) >= 1 ) 2390 | groupid_keywords_ind.append( sum([1 for word in words_in_line_wo_punc if word.lower() in ['group', 'grp']]) >=1 ) 2391 | memberid_keywords_ind.append( sum([1 for word in words_in_line_wo_punc if word.lower() in ['member', 'id']]) >=1 ) 2392 | 2393 | five_or_more_digits.append(sum([1 for char in index if char.isdigit()]) >=5) 2394 | 2395 | # add simulated=0 if not there already 2396 | if 'simulated' not in df.columns: 2397 | df = df.assign(simulated = df.shape[0]*[0]) 2398 | 2399 | # find length of each string 2400 | df = df.assign(length=length); 2401 | 2402 | # add new columns 2403 | df = df.assign(frac_alpha=frac_alpha) 2404 | df = df.assign(frac_alpha_upper=frac_alpha_upper) 2405 | df = df.assign(frac_alpha_lower=frac_alpha_lower) 2406 | df = df.assign(frac_digit=frac_digit) 2407 | df = df.assign(frac_punc=frac_punc) 2408 | df = df.assign(frac_punc_dashes=frac_punc_dashes) 2409 | df = df.assign(frac_punc_periods=frac_punc_periods) 2410 | df = df.assign(name_keywords_ind=name_keywords_ind) 2411 | df = df.assign(groupid_keywords_ind=groupid_keywords_ind) 2412 | df = df.assign(memberid_keywords_ind=memberid_keywords_ind) 2413 | df = df.assign(five_or_more_digits=five_or_more_digits) 2414 | 2415 | # check NLTK's corpus of names: https://www.cs.cmu.edu/Groups/AI/areas/nlp/corpora/names/0.html 2416 | # THIS IS CHEATING 2417 | #df = df.assign(in_nltk_corpus=[1 if word.capitalize() in names.words() else 0 for word in df.index.values]) 2418 | 2419 | return df 2420 | ``` 2421 | 2422 | 2423 | ```python 2424 | #from insurance_card_prediction import create_features 2425 | ``` 2426 | 2427 | 2428 | ```python 2429 | # create features 2430 | df = create_features(df) 2431 | ``` 2432 | 2433 | # Simulate Data 2434 | 2435 | 1) Names 2436 | 2) Group IDs 2437 | 3) Member IDs 2438 | 2439 | * The labels are highly imbalanced; only 0.025% of examples are "names." I will add more names to the dataset. Names are very easy to sample from since we know what realistic names are, unlike some other variables. 2440 | * Sample uniformly from top-10 names from 1960s 2441 | 2442 | 2443 | ```python 2444 | def simulate_data(df, targets=['group IDs']): 2445 | """ 2446 | Simulates names by sampling uniformly from top-10 baby names from 1960s 2447 | Args: 2448 | df: dataframe 2449 | targets: list of strings of the target variables to simulate 2450 | 2451 | Returns: dataframe augmented with more names 2452 | """ 2453 | 2454 | # SIMULATE NAMES 2455 | if 'names' in targets: 2456 | print('Simulating names') 2457 | # names https://www.ssa.gov/oact/babynames/decades/names1960s.html 2458 | male_names = ['Michael','David','John','James','Robert' ,'Mark','William','Richard','Thomas','Jeffrey'] 2459 | female_names = ['Lisa','Mary','Susan','Karen','Kimberly','Patricia','Linda','Donna','Michelle','Cynthia'] 2460 | all_names = male_names+female_names 2461 | 2462 | # generate samples 2463 | num_samples = 611 2464 | np.random.seed(102) 2465 | new_names = np.random.choice(a=all_names, size=num_samples) 2466 | new_names1 = [] 2467 | 2468 | # randomly change the capitalization (UPPER, lower, Capital) 2469 | for i, name in enumerate(new_names): 2470 | j = np.random.choice(2) 2471 | if j == 0: 2472 | new_names1.append(name.lower()) 2473 | elif j == 1: 2474 | new_names1.append(name.upper()) 2475 | else: 2476 | new_names1.append(name) 2477 | 2478 | # dataframe with new samples 2479 | df2 = pd.DataFrame(index=new_names1) 2480 | df2 = df2.assign(target_name=num_samples*[1.]) 2481 | df2 = df2.assign(target_group_id=num_samples*[0.]) 2482 | df2 = df2.assign(target_member_id=num_samples*[0.]) 2483 | 2484 | df = df.append(df2) 2485 | 2486 | 2487 | # SIMULATE GROUP IDS 2488 | if 'group IDs' in targets: 2489 | print('Simulating Group IDs') 2490 | 2491 | # list group IDs 2492 | grp_ids = list(df.loc[df.target_group_id==1].index) 2493 | 2494 | # bring ratio to 40% balance of group IDs 2495 | num_new_grp_ids = int((2*df.shape[0] - 5*len(grp_ids))/3) 2496 | 2497 | # for new words 2498 | new_grp_ids = [] 2499 | 2500 | np.random.seed(102) 2501 | # to replace alpha character randomly 2502 | replace_word = lambda w: random.choice(string.ascii_uppercase) if w.isupper() else random.choice(string.ascii_lowercase) 2503 | 2504 | # enough to reach 40% 2505 | for i in range(int(num_new_grp_ids)): 2506 | 2507 | # randomly select Group ID to copy 2508 | grp_id_to_copy = random.choice(grp_ids) 2509 | 2510 | # copy Group ID 2511 | new_grp_ids.append(''.join([random.choice(string.digits) if char.isdigit() else replace_word(char) if char.isalpha() else char for char in grp_id_to_copy])) 2512 | 2513 | # create new dataframe 2514 | df3 = pd.DataFrame(index=new_grp_ids) 2515 | df3 = df3.assign(target_name=num_new_grp_ids*[0.]) 2516 | df3 = df3.assign(target_group_id=num_new_grp_ids*[1.]) # all ones 2517 | df3 = df3.assign(target_member_id=num_new_grp_ids*[0.]) 2518 | df3 = df3.assign(words_in_line=new_grp_ids) # lines by themselves (no neighbors) 2519 | df3 = df3.assign(simulated=df3.shape[0]*[1.]) # simulated=1 2520 | 2521 | # append new df to old df 2522 | df = df.append(df3)[df.columns.tolist()] 2523 | 2524 | 2525 | # SIMULATE MEMBER IDS 2526 | if 'member IDs' in targets: 2527 | print('Simulating Member IDs') 2528 | # list member IDs 2529 | member_ids = list(df.loc[df.target_member_id==1].index) 2530 | 2531 | # bring ratio to 40% balance of group IDs 2532 | num_new_member_ids = int((2*df.shape[0] - 5*len(member_ids))/3) 2533 | 2534 | # for new words 2535 | new_member_ids = [] 2536 | 2537 | np.random.seed(102) 2538 | # to replace alpha character randomly 2539 | replace_word = lambda w: random.choice(string.ascii_uppercase) if w.isupper() else random.choice(string.ascii_lowercase) 2540 | 2541 | # enough to reach 40% 2542 | for i in range(int(num_new_member_ids)): 2543 | 2544 | # randomly select member ID to copy 2545 | member_id_to_copy = random.choice(member_ids) 2546 | 2547 | # copy Group ID 2548 | new_member_ids.append(''.join([random.choice(string.digits) if char.isdigit() else replace_word(char) if char.isalpha() else char for char in member_id_to_copy])) 2549 | 2550 | # create new dataframe 2551 | df4 = pd.DataFrame(index=new_member_ids) 2552 | df4 = df4.assign(target_name=num_new_member_ids*[0.]) 2553 | df4 = df4.assign(target_group_id=num_new_member_ids*[0.]) 2554 | df4 = df4.assign(target_member_id=num_new_member_ids*[1.]) # all ones 2555 | df4 = df4.assign(words_in_line=new_member_ids) # lines by themselves (no neighbors) 2556 | df4 = df4.assign(simulated=df4.shape[0]*[1.]) # simulated=1 2557 | 2558 | # append new df to old df 2559 | df = df.append(df4)[df.columns.tolist()] 2560 | 2561 | return df 2562 | ``` 2563 | 2564 | Simulate desired data. 2565 | 2566 | 2567 | ```python 2568 | #from insurance_card_prediction import simulate_data_1 2569 | 2570 | #simulate data (BOTH group IDs and member IDs) 2571 | df1 = simulate_data(df, ['group IDs','member IDs']) 2572 | #df1 = df.copy() 2573 | ``` 2574 | 2575 | Simulating Group IDs 2576 | Simulating Member IDs 2577 | 2578 | 2579 | Create features for the new rows. 2580 | 2581 | 2582 | ```python 2583 | # create features for new rows 2584 | df2 = create_features(df1) 2585 | ``` 2586 | 2587 | # Modeling 2588 | 2589 | ## Prepare Data 2590 | 2591 | 1. Standardize the numeric cariables 2592 | 2. One-hot encode the categorical variables 2593 | 2594 | 2595 | ```python 2596 | # https://jorisvandenbossche.github.io/blog/2018/05/28/scikit-learn-columntransformer/ 2597 | from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, LabelBinarizer 2598 | from sklearn.compose import ColumnTransformer, make_column_transformer 2599 | 2600 | numerical_columns = df2.columns[5:13].tolist() # add LENGTH (5) 2601 | categorical_columns = df2.columns[[4,13,14,15,16]].tolist() # remove LENGTH (5) 2602 | 2603 | from sklearn.base import TransformerMixin #gives fit_transform method for free 2604 | 2605 | class MyLabelBinarizer(TransformerMixin): 2606 | def __init__(self, *args, **kwargs): 2607 | self.encoder = LabelBinarizer(*args, **kwargs) 2608 | def fit(self, x, y=0): 2609 | self.encoder.fit(x) 2610 | return self 2611 | def transform(self, x, y=0): 2612 | return self.encoder.transform(x) 2613 | 2614 | preprocess = make_column_transformer( 2615 | (StandardScaler(), numerical_columns) 2616 | #(MyLabelBinarizer(), categorical_columns) 2617 | #OneHotEncoder(categories='auto'), categorical_columns) 2618 | ) 2619 | 2620 | df_cat = pd.DataFrame(index=df2.index) 2621 | 2622 | # one-hot encode categorical variables 2623 | for col in categorical_columns: 2624 | #df_temp = df2[col].astype('category') 2625 | #df_temp_2 = pd.get_dummies(df_temp, prefix=col) 2626 | #df_cat = pd.concat([df_cat, df_temp_2], axis=1) 2627 | le = LabelEncoder() 2628 | X = le.fit_transform(df2[col]) 2629 | df_temp = pd.DataFrame(data=X, index=df2.index.values , columns=[col]) 2630 | df_cat = pd.concat([df_cat, df_temp], axis=1) 2631 | 2632 | #for col in categorical_columns: 2633 | # df_temp 2634 | 2635 | # transform. returns numpy array 2636 | X = preprocess.fit_transform(df2) 2637 | df_num = pd.DataFrame(index=df2.index, data=X, columns=numerical_columns) 2638 | 2639 | # transform. returns numpy array 2640 | #X = preprocess.fit_transform(df2) 2641 | 2642 | # combine numerical and concatenated 2643 | df3 = pd.concat([df_num, df_cat], axis=1) 2644 | 2645 | # true label - now member IDs 2646 | y = [] 2647 | for index, row in df2.iterrows(): 2648 | if row.target_name == 1: 2649 | y.append(0) 2650 | elif row.target_group_id == 1: 2651 | y.append(1) 2652 | elif row.target_member_id == 1: 2653 | y.append(0) 2654 | else: 2655 | y.append(0) 2656 | 2657 | # add target variable 2658 | df3 = df3.assign(y=y) 2659 | ``` 2660 | 2661 | C:\Users\Emile\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler. 2662 | return self.partial_fit(X, y) 2663 | C:\Users\Emile\Anaconda3\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by StandardScaler. 2664 | return self.fit(X, **fit_params).transform(X) 2665 | 2666 | 2667 | Split into training and test sets 2668 | 2669 | 2670 | ```python 2671 | from sklearn.model_selection import StratifiedShuffleSplit 2672 | from sklearn.model_selection import train_test_split 2673 | pd.options.mode.chained_assignment = None # default='warn' 2674 | 2675 | X = df3.iloc[:,:-1] 2676 | y = df3.iloc[:,-1] 2677 | 2678 | X_train, X_test, y_train, y_test = train_test_split(X , y, 2679 | stratify=y, 2680 | test_size=0.4, 2681 | random_state=102) 2682 | 2683 | X_train_simulated = pd.DataFrame(X_train.loc[:, ('simulated')].copy()) 2684 | X_test_simulated = pd.DataFrame(X_test.loc[:, ('simulated')].copy()) 2685 | 2686 | X_train.drop(columns=['simulated'], inplace=True) 2687 | X_test.drop(columns=['simulated'], inplace=True) 2688 | 2689 | y_train = pd.DataFrame(y_train) 2690 | y_test = pd.DataFrame(y_test) 2691 | ``` 2692 | 2693 | ## Gradient Boosting 2694 | 2695 | 2696 | ```python 2697 | import xgboost as xgb 2698 | from xgboost import XGBClassifier 2699 | from xgboost import plot_importance 2700 | from sklearn.metrics import accuracy_score 2701 | from sklearn.metrics import confusion_matrix 2702 | ``` 2703 | 2704 | 2705 | ```python 2706 | # specify parameters via map 2707 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'multi:softmax', 'num_class':3, 'random_state':102} 2708 | num_round = 2 2709 | 2710 | # initialize model 2711 | gb_clf = XGBClassifier(max_depth=3, objective='binary:logistic') 2712 | 2713 | # fit model 2714 | eval_set = [(X_test, y_test.y.values)] 2715 | gb_clf.fit(X_train, y_train.y.values, eval_metric="error", eval_set=eval_set, verbose=True, early_stopping_rounds=None) 2716 | 2717 | # make prediction 2718 | y_pred = gb_clf.predict(X_test) 2719 | 2720 | # predict probabilties 2721 | y_pred_prob = gb_clf.predict_proba(X_test) 2722 | 2723 | print("Accuracy on training set: {:.3f}".format(gb_clf.score(X_train, y_train))) 2724 | print("Accuracy on test set: {:.3f}".format(gb_clf.score(X_test, y_test))) 2725 | ``` 2726 | 2727 | [0] validation_0-error:0.193669 2728 | [1] validation_0-error:0.193669 2729 | [2] validation_0-error:0.157018 2730 | [3] validation_0-error:0.143274 2731 | [4] validation_0-error:0.143274 2732 | [5] validation_0-error:0.143274 2733 | [6] validation_0-error:0.143274 2734 | [7] validation_0-error:0.154519 2735 | [8] validation_0-error:0.154519 2736 | [9] validation_0-error:0.127447 2737 | [10] validation_0-error:0.127447 2738 | [11] validation_0-error:0.127447 2739 | [12] validation_0-error:0.127447 2740 | [13] validation_0-error:0.127447 2741 | [14] validation_0-error:0.103707 2742 | [15] validation_0-error:0.103707 2743 | [16] validation_0-error:0.103707 2744 | [17] validation_0-error:0.103707 2745 | [18] validation_0-error:0.103707 2746 | [19] validation_0-error:0.103707 2747 | [20] validation_0-error:0.103707 2748 | [21] validation_0-error:0.103707 2749 | [22] validation_0-error:0.098709 2750 | [23] validation_0-error:0.098292 2751 | [24] validation_0-error:0.098292 2752 | [25] validation_0-error:0.098292 2753 | [26] validation_0-error:0.098292 2754 | [27] validation_0-error:0.098292 2755 | [28] validation_0-error:0.098292 2756 | [29] validation_0-error:0.098292 2757 | [30] validation_0-error:0.098292 2758 | [31] validation_0-error:0.098292 2759 | [32] validation_0-error:0.098709 2760 | [33] validation_0-error:0.098709 2761 | [34] validation_0-error:0.099125 2762 | [35] validation_0-error:0.099125 2763 | [36] validation_0-error:0.099542 2764 | [37] validation_0-error:0.099542 2765 | [38] validation_0-error:0.099542 2766 | [39] validation_0-error:0.099542 2767 | [40] validation_0-error:0.099542 2768 | [41] validation_0-error:0.099542 2769 | [42] validation_0-error:0.099542 2770 | [43] validation_0-error:0.099542 2771 | [44] validation_0-error:0.099542 2772 | [45] validation_0-error:0.099125 2773 | [46] validation_0-error:0.098292 2774 | [47] validation_0-error:0.098292 2775 | [48] validation_0-error:0.098292 2776 | [49] validation_0-error:0.098292 2777 | [50] validation_0-error:0.098292 2778 | [51] validation_0-error:0.098292 2779 | [52] validation_0-error:0.098292 2780 | [53] validation_0-error:0.098292 2781 | [54] validation_0-error:0.087047 2782 | [55] validation_0-error:0.087047 2783 | [56] validation_0-error:0.087047 2784 | [57] validation_0-error:0.087047 2785 | [58] validation_0-error:0.08788 2786 | [59] validation_0-error:0.08788 2787 | [60] validation_0-error:0.087047 2788 | [61] validation_0-error:0.086214 2789 | [62] validation_0-error:0.086214 2790 | [63] validation_0-error:0.086214 2791 | [64] validation_0-error:0.078301 2792 | [65] validation_0-error:0.078301 2793 | [66] validation_0-error:0.078301 2794 | [67] validation_0-error:0.078301 2795 | [68] validation_0-error:0.078301 2796 | [69] validation_0-error:0.077884 2797 | [70] validation_0-error:0.077884 2798 | [71] validation_0-error:0.077884 2799 | [72] validation_0-error:0.077884 2800 | [73] validation_0-error:0.064556 2801 | [74] validation_0-error:0.064556 2802 | [75] validation_0-error:0.064556 2803 | [76] validation_0-error:0.064556 2804 | [77] validation_0-error:0.064556 2805 | [78] validation_0-error:0.064556 2806 | [79] validation_0-error:0.064556 2807 | [80] validation_0-error:0.064556 2808 | [81] validation_0-error:0.064556 2809 | [82] validation_0-error:0.064556 2810 | [83] validation_0-error:0.064556 2811 | [84] validation_0-error:0.064556 2812 | [85] validation_0-error:0.064556 2813 | [86] validation_0-error:0.064556 2814 | [87] validation_0-error:0.062474 2815 | [88] validation_0-error:0.062474 2816 | [89] validation_0-error:0.062474 2817 | [90] validation_0-error:0.062474 2818 | [91] validation_0-error:0.062474 2819 | [92] validation_0-error:0.062474 2820 | [93] validation_0-error:0.062474 2821 | [94] validation_0-error:0.062474 2822 | [95] validation_0-error:0.062474 2823 | [96] validation_0-error:0.062474 2824 | [97] validation_0-error:0.062474 2825 | [98] validation_0-error:0.062474 2826 | [99] validation_0-error:0.062474 2827 | Accuracy on training set: 0.930 2828 | Accuracy on test set: 0.938 2829 | 2830 | 2831 | Plot feature importances 2832 | 2833 | 2834 | ```python 2835 | plot_importance(gb_clf); 2836 | ``` 2837 | 2838 | 2839 | ![png](images/output_63_0.png) 2840 | 2841 | 2842 | Combine actuals with predictions. 2843 | 2844 | 2845 | ```python 2846 | # combine actual with predicted 2847 | y_test_combined = y_test.rename(index=str, columns={"y": "y_true"}).assign(y_pred=y_pred).assign(simulated=X_test_simulated['simulated'].values.astype(int)) 2848 | 2849 | y_test_combined = y_test_combined.assign(got_right=(y_test_combined.y_true == y_test_combined.y_pred).astype(int)) 2850 | 2851 | y_test_combined = y_test_combined.assign(y_pred_prob_0=y_pred_prob[:,0]) 2852 | y_test_combined = y_test_combined.assign(y_pred_prob_1=y_pred_prob[:,1]) 2853 | #y_test_combined = y_test_combined.assign(y_pred_prob_2=y_pred_prob[:,2]) 2854 | 2855 | ``` 2856 | 2857 | 2858 | ```python 2859 | from insurance_card_prediction import plot_confusion_matrix 2860 | ``` 2861 | 2862 | Plot the confusion matrix. 2863 | 2864 | 2865 | ```python 2866 | cm = confusion_matrix(y_test_combined.y_true.values, y_test_combined.y_pred.values) 2867 | 2868 | # plot it 2869 | plot_confusion_matrix(cm, 2870 | target_names=['Not Group ID','Group ID'], 2871 | title='Total (real and simulated)', 2872 | cmap=None, 2873 | normalize=True) 2874 | 2875 | print(cm) 2876 | ``` 2877 | 2878 | 2879 | ![png](images/output_68_0.png) 2880 | 2881 | 2882 | [[1724 96] 2883 | [ 54 527]] 2884 | 2885 | 2886 | Split into real and simulated. 2887 | 2888 | 2889 | ```python 2890 | # cm1 is simulated 2891 | cm1 = confusion_matrix(y_test_combined.loc[y_test_combined.simulated.values==1].y_true.values, y_test_combined.loc[y_test_combined.simulated.values==1].y_pred.values) 2892 | 2893 | if cm1.shape == (2,2): 2894 | newrow = np.array([[0,0]]) 2895 | cm1 = np.vstack((newrow, cm1)) 2896 | 2897 | newcol = np.array([[0],[0],[0]]) 2898 | cm1 = np.hstack((newcol, cm1)) 2899 | 2900 | # plot it 2901 | plot_confusion_matrix(cm1, 2902 | target_names=['Neither','Group ID', 'Member ID'], 2903 | title='Simulated', 2904 | cmap=None, 2905 | normalize=True) 2906 | 2907 | print(cm1) 2908 | ``` 2909 | 2910 | 2911 | ![png](images/output_70_0.png) 2912 | 2913 | 2914 | [[ 0 0 0] 2915 | [ 0 527 46] 2916 | [ 1 73 875]] 2917 | 2918 | 2919 | Now look at reals. 2920 | 2921 | 2922 | ```python 2923 | # cm2 is real 2924 | cm2 = confusion_matrix(y_test_combined.loc[y_test_combined.simulated.values==0].y_true.values, y_test_combined.loc[y_test_combined.simulated.values==0].y_pred.values) 2925 | 2926 | #newrow = np.array([[0,0]]) 2927 | #cm1 = np.vstack((newrow, cm1)) 2928 | 2929 | #newcol = np.array([[0],[0],[0]]) 2930 | #cm1 = np.hstack((newcol, cm1)) 2931 | 2932 | # plot it 2933 | plot_confusion_matrix(cm2, 2934 | target_names=['Neither','Group ID', 'Member ID'], 2935 | title='Real', 2936 | cmap=None, 2937 | normalize=True) 2938 | 2939 | print(cm2) 2940 | ``` 2941 | 2942 | 2943 | ![png](images/output_72_0.png) 2944 | 2945 | 2946 | [[778 37 45] 2947 | [ 4 4 0] 2948 | [ 3 0 8]] 2949 | 2950 | 2951 | 2952 | ```python 2953 | a = y_test_combined.loc[y_test_combined.simulated==0] 2954 | a = y_test_combined.loc[y_test_combined.y_true==0] 2955 | a = a.loc[a.y_pred!=0] 2956 | a.loc[:,['y_true','y_pred']] 2957 | ``` 2958 | 2959 | 2960 | 2961 | 2962 |
2963 | 2976 | 2977 | 2978 | 2979 | 2980 | 2981 | 2982 | 2983 | 2984 | 2985 | 2986 | 2987 | 2988 | 2989 | 2990 | 2991 | 2992 | 2993 | 2994 | 2995 | 2996 | 2997 | 2998 | 2999 | 3000 | 3001 | 3002 | 3003 | 3004 | 3005 | 3006 | 3007 | 3008 | 3009 | 3010 | 3011 | 3012 | 3013 | 3014 | 3015 | 3016 | 3017 | 3018 | 3019 | 3020 | 3021 | 3022 | 3023 | 3024 | 3025 | 3026 | 3027 | 3028 | 3029 | 3030 | 3031 | 3032 | 3033 | 3034 | 3035 | 3036 | 3037 | 3038 | 3039 | 3040 | 3041 | 3042 | 3043 | 3044 | 3045 | 3046 | 3047 | 3048 | 3049 | 3050 | 3051 | 3052 | 3053 | 3054 | 3055 | 3056 | 3057 | 3058 | 3059 | 3060 | 3061 | 3062 | 3063 | 3064 | 3065 | 3066 | 3067 | 3068 | 3069 | 3070 | 3071 | 3072 | 3073 | 3074 | 3075 | 3076 | 3077 | 3078 | 3079 | 3080 | 3081 | 3082 | 3083 | 3084 | 3085 | 3086 | 3087 | 3088 | 3089 | 3090 | 3091 | 3092 | 3093 | 3094 | 3095 | 3096 | 3097 | 3098 | 3099 | 3100 | 3101 | 3102 | 3103 | 3104 | 3105 | 3106 | 3107 | 3108 | 3109 | 3110 | 3111 | 3112 | 3113 | 3114 | 3115 | 3116 | 3117 | 3118 | 3119 | 3120 | 3121 | 3122 | 3123 | 3124 | 3125 | 3126 | 3127 | 3128 | 3129 | 3130 | 3131 | 3132 | 3133 | 3134 | 3135 | 3136 | 3137 | 3138 | 3139 | 3140 | 3141 | 3142 | 3143 | 3144 | 3145 | 3146 | 3147 | 3148 | 3149 | 3150 | 3151 | 3152 | 3153 | 3154 | 3155 | 3156 | 3157 | 3158 | 3159 | 3160 | 3161 | 3162 | 3163 | 3164 | 3165 | 3166 | 3167 | 3168 | 3169 | 3170 | 3171 | 3172 | 3173 | 3174 | 3175 | 3176 | 3177 | 3178 | 3179 | 3180 | 3181 | 3182 | 3183 | 3184 | 3185 | 3186 | 3187 | 3188 | 3189 | 3190 | 3191 | 3192 | 3193 | 3194 | 3195 | 3196 | 3197 | 3198 | 3199 | 3200 | 3201 | 3202 | 3203 | 3204 | 3205 | 3206 | 3207 | 3208 | 3209 | 3210 | 3211 | 3212 | 3213 | 3214 | 3215 | 3216 | 3217 | 3218 | 3219 | 3220 | 3221 | 3222 | 3223 | 3224 | 3225 | 3226 | 3227 | 3228 | 3229 | 3230 | 3231 | 3232 | 3233 | 3234 | 3235 | 3236 | 3237 | 3238 | 3239 | 3240 | 3241 | 3242 | 3243 | 3244 | 3245 | 3246 | 3247 | 3248 | 3249 | 3250 | 3251 | 3252 | 3253 | 3254 | 3255 | 3256 | 3257 | 3258 | 3259 | 3260 | 3261 | 3262 | 3263 | 3264 | 3265 | 3266 | 3267 | 3268 | 3269 | 3270 | 3271 | 3272 | 3273 | 3274 | 3275 | 3276 | 3277 | 3278 | 3279 | 3280 | 3281 | 3282 | 3283 | 3284 | 3285 | 3286 | 3287 | 3288 | 3289 | 3290 | 3291 |
y_truey_pred
30374-080002
4302
3502
WASHINGTON01
2002
8772602
0500002
10000000102
61001101
CHILDI01
9102
HOSPITAL01
ME02
IL02
REQUIRED01
MO02
TRINET01
HOSPITAL01
ID02
NO02
00358601
7634202
DENTAL01
NA02
00491501
CMS-H383202
0000000001
OF02
0069999901
PO02
.........
2373512501
PROVIDERS02
GA02
Independent02
HEALTH01
DR02
00433601
A00001233445601
HOSPITAL01
OH02
1102
0102
00433601
795230412002
CHILD2SMITH01
99999999902
2373512501
H043201
8084002
RX02
12222222202
XX02
12345678902
0000002
KANSAS01
WG02
61034201
HAWAII01
01701001
232-116401
3292 |

82 rows × 2 columns

3293 |
3294 | 3295 | 3296 | 3297 | 3298 | ```python 3299 | y_test_combined.loc[y_test_combined.y_true!=0].head(10) 3300 | ``` 3301 | 3302 | 3303 | 3304 | 3305 |
3306 | 3319 | 3320 | 3321 | 3322 | 3323 | 3324 | 3325 | 3326 | 3327 | 3328 | 3329 | 3330 | 3331 | 3332 | 3333 | 3334 | 3335 | 3336 | 3337 | 3338 | 3339 | 3340 | 3341 | 3342 | 3343 | 3344 | 3345 | 3346 | 3347 | 3348 | 3349 | 3350 | 3351 | 3352 | 3353 | 3354 | 3355 | 3356 | 3357 | 3358 | 3359 | 3360 | 3361 | 3362 | 3363 | 3364 | 3365 | 3366 | 3367 | 3368 | 3369 | 3370 | 3371 | 3372 | 3373 | 3374 | 3375 | 3376 | 3377 | 3378 | 3379 | 3380 | 3381 | 3382 | 3383 | 3384 | 3385 | 3386 | 3387 | 3388 | 3389 | 3390 | 3391 | 3392 | 3393 | 3394 | 3395 | 3396 | 3397 | 3398 | 3399 | 3400 | 3401 | 3402 | 3403 | 3404 | 3405 | 3406 | 3407 | 3408 | 3409 | 3410 | 3411 | 3412 | 3413 | 3414 | 3415 | 3416 | 3417 | 3418 | 3419 | 3420 | 3421 | 3422 | 3423 | 3424 | 3425 | 3426 | 3427 | 3428 | 3429 | 3430 | 3431 | 3432 | 3433 | 3434 |
y_truey_predsimulatedgot_righty_pred_prob_0y_pred_prob_1y_pred_prob_2
PHOAUWTIK22110.0990760.0962460.804678
98705720017911110.0164790.8855540.097966
8701936363822110.0111820.0658650.922952
89110025122110.0056060.0194660.974928
76353783037811110.0164790.8855540.097966
966337739-1222110.0167010.0527850.930514
0888971911110.0377980.6908100.271392
K853521100.0512450.6782740.270481
5375513211110.0377980.6908100.271392
377869079-3122110.0167010.0527850.930514
3435 |
3436 | 3437 | 3438 | 3439 | 3440 | ```python 3441 | low_prob_mask = y_test_combined.iloc[:,4:7].max(axis=1) < 0.57 3442 | y_test_combined.loc[low_prob_mask][115:126] 3443 | ``` 3444 | 3445 | 3446 | 3447 | 3448 |
3449 | 3462 | 3463 | 3464 | 3465 | 3466 | 3467 | 3468 | 3469 | 3470 | 3471 | 3472 | 3473 | 3474 | 3475 | 3476 | 3477 | 3478 | 3479 | 3480 | 3481 | 3482 | 3483 | 3484 | 3485 | 3486 | 3487 | 3488 | 3489 | 3490 | 3491 | 3492 | 3493 | 3494 | 3495 | 3496 | 3497 | 3498 | 3499 | 3500 | 3501 | 3502 | 3503 | 3504 | 3505 | 3506 | 3507 | 3508 | 3509 | 3510 | 3511 | 3512 | 3513 | 3514 | 3515 | 3516 | 3517 | 3518 | 3519 | 3520 | 3521 | 3522 | 3523 | 3524 | 3525 | 3526 | 3527 | 3528 | 3529 | 3530 | 3531 | 3532 | 3533 | 3534 | 3535 | 3536 | 3537 | 3538 | 3539 | 3540 | 3541 | 3542 | 3543 | 3544 | 3545 | 3546 | 3547 | 3548 | 3549 | 3550 | 3551 | 3552 | 3553 | 3554 | 3555 | 3556 | 3557 | 3558 | 3559 | 3560 | 3561 | 3562 | 3563 | 3564 | 3565 | 3566 | 3567 | 3568 | 3569 | 3570 | 3571 | 3572 | 3573 | 3574 | 3575 | 3576 | 3577 | 3578 | 3579 | 3580 | 3581 | 3582 | 3583 | 3584 | 3585 | 3586 | 3587 |
y_truey_predsimulatedgot_righty_pred_prob_0y_pred_prob_1y_pred_prob_2
3925022110.0492230.3816070.569170
1052722110.0492230.3816070.569170
0057712100.0492230.3816070.569170
2670922110.0492230.3816070.569170
9216122110.0492230.3816070.569170
5785412100.0492230.3816070.569170
9309512100.0492230.3816070.569170
0832822110.0492230.3816070.569170
8556312100.0492230.3816070.569170
795230412002000.4103420.1309000.458758
CHILD2SMITH01000.2251690.4297280.345103
3588 |
3589 | 3590 | 3591 | 3592 | 3593 | ```python 3594 | 3595 | ``` 3596 | 3597 | 3598 | ```python 3599 | y_test_combined.iloc[:,4:7].max(axis=1).hist(bins=20); 3600 | plt.xlabel('Predicted probability of max class') 3601 | plt.ylabel('Frequency') 3602 | ``` 3603 | 3604 | 3605 | 3606 | 3607 | Text(0, 0.5, 'Frequency') 3608 | 3609 | 3610 | 3611 | 3612 | ![png](images/output_77_1.png) 3613 | 3614 | 3615 | 3616 | ```python 3617 | import seaborn as sns 3618 | ``` 3619 | 3620 | 3621 | ```python 3622 | sns.set_style('darkgrid') 3623 | sns.distplot(y_test_combined.iloc[:,4:7].max(axis=1), norm_hist=False); 3624 | plt.xlabel('Predicted probability of max class') 3625 | plt.ylabel('Frequency in percentages') 3626 | ``` 3627 | 3628 | 3629 | 3630 | 3631 | Text(0, 0.5, 'Frequency in percentages') 3632 | 3633 | 3634 | 3635 | 3636 | ![png](images/output_79_1.png) 3637 | 3638 | 3639 | 3640 | ```python 3641 | sns.set_style('darkgrid') 3642 | ``` 3643 | 3644 | 3645 | ```python 3646 | 3647 | ``` 3648 | 3649 | 3650 | ```python 3651 | 3652 | ``` 3653 | 3654 | 3655 | ```python 3656 | 3657 | ``` 3658 | 3659 | 3660 | ```python 3661 | 3662 | ``` 3663 | 3664 | 3665 | ```python 3666 | 3667 | ``` 3668 | 3669 | ### Random Forest 3670 | Used for Group IDs classification. 3671 | 3672 | 3673 | ```python 3674 | import warnings 3675 | warnings.filterwarnings("ignore", category=FutureWarning) 3676 | 3677 | from sklearn.ensemble import RandomForestClassifier 3678 | 3679 | forest_clf = RandomForestClassifier(random_state=102) 3680 | forest_clf.fit(X_train, y_train.y.values) 3681 | 3682 | y_pred = forest_clf.predict(X_test) 3683 | print("%d out of %d exmaples were wrong." 3684 | % ((y_test.y != y_pred).sum(), X_test.shape[0])) 3685 | ``` 3686 | 3687 | 187 out of 2401 exmaples were wrong. 3688 | 3689 | 3690 | Confusion matrix 3691 | 3692 | 3693 | ```python 3694 | from sklearn.metrics import confusion_matrix 3695 | 3696 | cm = confusion_matrix(y_test.y.values, y_pred) 3697 | print("Confusion matrix: \n",cm, "\n") 3698 | 3699 | tn, fp, fn, tp = cm.ravel() 3700 | print(' TN:',tn, '\n FP:',fp, '\n FN:',fn, '\n TP',tp) 3701 | ``` 3702 | 3703 | Confusion matrix: 3704 | [[1344 97] 3705 | [ 90 870]] 3706 | 3707 | TN: 1344 3708 | FP: 97 3709 | FN: 90 3710 | TP 870 3711 | 3712 | 3713 | Plot the confusion matrix 3714 | 3715 | 3716 | ```python 3717 | # https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html 3718 | def plot_confusion_matrix(cm, 3719 | target_names, 3720 | title='Confusion matrix', 3721 | cmap=None, 3722 | normalize=True): 3723 | """ 3724 | given a sklearn confusion matrix (cm), make a nice plot 3725 | 3726 | Arguments 3727 | --------- 3728 | cm: confusion matrix from sklearn.metrics.confusion_matrix 3729 | 3730 | target_names: given classification classes such as [0, 1, 2] 3731 | the class names, for example: ['high', 'medium', 'low'] 3732 | 3733 | title: the text to display at the top of the matrix 3734 | 3735 | cmap: the gradient of the values displayed from matplotlib.pyplot.cm 3736 | see http://matplotlib.org/examples/color/colormaps_reference.html 3737 | plt.get_cmap('jet') or plt.cm.Blues 3738 | 3739 | normalize: If False, plot the raw numbers 3740 | If True, plot the proportions 3741 | 3742 | Usage 3743 | ----- 3744 | plot_confusion_matrix(cm = cm, # confusion matrix created by 3745 | # sklearn.metrics.confusion_matrix 3746 | normalize = True, # show proportions 3747 | target_names = y_labels_vals, # list of names of the classes 3748 | title = best_estimator_name) # title of graph 3749 | 3750 | Citiation 3751 | --------- 3752 | http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html 3753 | 3754 | """ 3755 | import matplotlib.pyplot as plt 3756 | import numpy as np 3757 | import itertools 3758 | 3759 | accuracy = np.trace(cm) / float(np.sum(cm)) 3760 | misclass = 1 - accuracy 3761 | 3762 | if cmap is None: 3763 | cmap = plt.get_cmap('Blues') 3764 | 3765 | plt.figure(figsize=(8, 6)) 3766 | plt.imshow(cm, interpolation='nearest', cmap=cmap) 3767 | plt.title(title) 3768 | plt.colorbar() 3769 | 3770 | if target_names is not None: 3771 | tick_marks = np.arange(len(target_names)) 3772 | plt.xticks(tick_marks, target_names, rotation=45) 3773 | plt.yticks(tick_marks, target_names) 3774 | 3775 | if normalize: 3776 | cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] 3777 | 3778 | 3779 | thresh = cm.max() / 1.5 if normalize else cm.max() / 2 3780 | for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): 3781 | if normalize: 3782 | plt.text(j, i, "{:0.2f}".format(cm[i, j],2), 3783 | horizontalalignment="center", 3784 | color="black" if cm[i, j] > thresh else "black", fontsize=30) 3785 | else: 3786 | plt.text(j, i, "{:,}".format(cm[i, j]), 3787 | horizontalalignment="center", 3788 | color="white" if cm[i, j] > thresh else "black", fontsize=30) 3789 | 3790 | 3791 | plt.tight_layout() 3792 | plt.ylabel('True label') 3793 | plt.xlabel('Predicted label') 3794 | #plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass)) 3795 | plt.show() 3796 | ``` 3797 | 3798 | 3799 | ```python 3800 | # plot it 3801 | plot_confusion_matrix(cm, 3802 | target_names=['Not a Group ID','Group ID'], 3803 | title='Confusion matrix', 3804 | cmap=None, 3805 | normalize=True) 3806 | 3807 | print(cm) 3808 | ``` 3809 | 3810 | 3811 | ![png](images/output_92_0.png) 3812 | 3813 | 3814 | [[825 48] 3815 | [ 2 580]] 3816 | 3817 | 3818 | 3819 | ```python 3820 | # combine actual with predicted 3821 | y_test_combined = y_test.rename(index=str, columns={"y": "y_true"}).assign(y_pred=y_pred).assign(simulated=X_test_simulated['simulated_1'].values.astype(int)) 3822 | #y_test_combined = y_test_combined.assign(simulated_0=X_test_simulated['simulated_1'].values) 3823 | y_test_combined = y_test_combined.assign(got_right=(y_test_combined.y_true == y_test_combined.y_pred).astype(int)) 3824 | 3825 | y_test_combined.head() 3826 | ``` 3827 | 3828 | 3829 | 3830 | 3831 |
3832 | 3845 | 3846 | 3847 | 3848 | 3849 | 3850 | 3851 | 3852 | 3853 | 3854 | 3855 | 3856 | 3857 | 3858 | 3859 | 3860 | 3861 | 3862 | 3863 | 3864 | 3865 | 3866 | 3867 | 3868 | 3869 | 3870 | 3871 | 3872 | 3873 | 3874 | 3875 | 3876 | 3877 | 3878 | 3879 | 3880 | 3881 | 3882 | 3883 | 3884 | 3885 | 3886 | 3887 | 3888 | 3889 | 3890 | 3891 | 3892 |
y_truey_predsimulatedgot_right
INSURANCE0001
John0001
967462-020-148601111
AT0001
6364981111
3893 |
3894 | 3895 | 3896 | 3897 | 3898 | ```python 3899 | mask = (y_test_combined.simulated==0)# & (y_test_combined.y_true==0) 3900 | 3901 | cm2 = confusion_matrix(y_test_combined[mask].y_true.values, y_test_combined[mask].y_pred.values) 3902 | 3903 | if cm2.shape == (1,1): 3904 | cm2 = np.array([[0,0],[0,cm2[0,0]]]) 3905 | 3906 | 3907 | plot_confusion_matrix(cm2, 3908 | target_names=['Not a Group ID','Group ID'], 3909 | title='Confusion matrix', 3910 | cmap=None, 3911 | normalize=True); 3912 | 3913 | print('Real data') 3914 | print("Confusion matrix: \n",cm2, "\n") 3915 | 3916 | ``` 3917 | 3918 | 3919 | ![png](images/output_94_0.png) 3920 | 3921 | 3922 | Real data 3923 | Confusion matrix: 3924 | [[825 48] 3925 | [ 2 8]] 3926 | 3927 | 3928 | 3929 | 3930 | ```python 3931 | 3932 | ``` 3933 | 3934 | 3935 | ```python 3936 | 3937 | ``` 3938 | 3939 | 3940 | ```python 3941 | 3942 | ``` 3943 | 3944 | 3945 | ```python 3946 | 3947 | ``` 3948 | 3949 | 3950 | ```python 3951 | 3952 | ``` 3953 | 3954 | 3955 | ```python 3956 | 3957 | ``` 3958 | 3959 | 3960 | ```python 3961 | 3962 | ``` 3963 | 3964 | Find the area under the curve. 3965 | 3966 | 3967 | ```python 3968 | """from sklearn.metrics import roc_auc_score 3969 | 3970 | y_scores = forest_clf.predict_proba(X) 3971 | print('AUC:', round(roc_auc_score(y, y_scores[:,1]),3))""" 3972 | ``` 3973 | 3974 | 3975 | 3976 | 3977 | "from sklearn.metrics import roc_auc_score\n\ny_scores = forest_clf.predict_proba(X)\nprint('AUC:', round(roc_auc_score(y, y_scores[:,1]),3))" 3978 | 3979 | 3980 | 3981 | Find feature importances 3982 | 3983 | 3984 | ```python 3985 | """# feature importance 3986 | df_fi = pd.DataFrame(index = df.columns.tolist()[3:]) 3987 | df_fi = df_fi.assign(importance=forest_clf.feature_importances_) 3988 | df_fi = df_fi.sort_values(by=['importance'], ascending=False)""" 3989 | ``` 3990 | 3991 | 3992 | 3993 | 3994 | "# feature importance\ndf_fi = pd.DataFrame(index = df.columns.tolist()[3:])\ndf_fi = df_fi.assign(importance=forest_clf.feature_importances_)\ndf_fi = df_fi.sort_values(by=['importance'], ascending=False)" 3995 | 3996 | 3997 | 3998 | 3999 | ```python 4000 | 4001 | ``` 4002 | 4003 | Plot feature importances 4004 | 4005 | 4006 | ```python 4007 | # Plot feature importances 4008 | cols = X_train.columns.values 4009 | importances = forest_clf.feature_importances_ 4010 | indices = np.argsort(importances)[::-1] 4011 | 4012 | plt.figure() 4013 | plt.title("Feature importances") 4014 | plt.bar(range(X_train.shape[1]), importances[indices], 4015 | color="b", align="center") 4016 | plt.xticks(range(X_train.shape[1]), cols) 4017 | plt.xlim([-1, X.shape[1]]) 4018 | plt.xticks(rotation=45) 4019 | plt.show(); 4020 | ``` 4021 | 4022 | 4023 | ![png](images/output_108_0.png) 4024 | 4025 | 4026 | #### Plot decision boundaries 4027 | 4028 | https://scikit-learn.org/stable/auto_examples/ensemble/plot_voting_decision_regions.html 4029 | 4030 | 4031 | ```python 4032 | print(__doc__) 4033 | 4034 | from itertools import product 4035 | 4036 | import numpy as np 4037 | import matplotlib.pyplot as plt 4038 | 4039 | from sklearn import datasets 4040 | from sklearn.tree import DecisionTreeClassifier 4041 | from sklearn.neighbors import KNeighborsClassifier 4042 | from sklearn.svm import SVC 4043 | from sklearn.ensemble import VotingClassifier, RandomForestClassifier 4044 | from sklearn.linear_model import LogisticRegression 4045 | 4046 | # Loading some example data 4047 | iris = datasets.load_iris() 4048 | X = X_train.loc[:,['frac_alpha','frac_digit']].values 4049 | y = y_train.y.values 4050 | #X = iris.data[:, [0, 2]] 4051 | #y = iris.target 4052 | 4053 | # Training classifiers 4054 | clf1 = DecisionTreeClassifier(max_depth=4) 4055 | clf2 = KNeighborsClassifier(n_neighbors=7) 4056 | clf3 = LogisticRegression() 4057 | #clf3 = SVC(gamma=.1, kernel='rbf', probability=True) 4058 | eclf = RandomForestClassifier(random_state=102) 4059 | #eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2), 4060 | # ('svc', clf3)], 4061 | # voting='soft', weights=[2, 1, 2]) 4062 | 4063 | clf1.fit(X, y) 4064 | clf2.fit(X, y) 4065 | clf3.fit(X, y) 4066 | eclf.fit(X, y) 4067 | 4068 | # Plotting decision regions 4069 | x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1 4070 | y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1 4071 | xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), 4072 | np.arange(y_min, y_max, 0.1)) 4073 | 4074 | f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8)) 4075 | 4076 | for idx, clf, tt in zip(product([0, 1], [0, 1]), 4077 | [clf1, clf2, clf3, eclf], 4078 | ['Decision Tree (depth=4)', 'KNN (k=7)', 4079 | 'Logistic', 'Random Forest']): 4080 | 4081 | Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) 4082 | Z = Z.reshape(xx.shape) 4083 | 4084 | axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4) 4085 | axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, 4086 | s=20, edgecolor='k') 4087 | axarr[idx[0], idx[1]].set_title(tt) 4088 | 4089 | plt.show() 4090 | ``` 4091 | 4092 | Automatically created module for IPython interactive environment 4093 | 4094 | 4095 | 4096 | ![png](images/output_110_1.png) 4097 | 4098 | 4099 | 4100 | ```python 4101 | 4102 | ``` 4103 | 4104 | 4105 | ```python 4106 | 4107 | ``` 4108 | 4109 | 4110 | ```python 4111 | 4112 | ``` 4113 | 4114 | 4115 | ```python 4116 | # AUC 4117 | from sklearn.metrics import roc_auc_score 4118 | 4119 | y_test_scores = forest_clf.predict_proba(X_test) 4120 | print('AUC:', round(roc_auc_score(y_test, y_test_scores[:,1]),3)) 4121 | ``` 4122 | 4123 | AUC: 0.988 4124 | 4125 | 4126 | Train model on simulated data and test on real data. 4127 | 4128 | 4129 | ```python 4130 | # put all real group ids in test set. 4131 | np.random.seed(103) 4132 | ind_test = np.where(np.logical_and(df3.simulated_1==0, df3.y==1))[0] 4133 | sample = [i for i in range(df3.shape[0]) if i not in ind_test] 4134 | more_indices = np.random.choice(sample, int(len(ind_test)*20), replace=False) 4135 | more_indices1 = [i for i in more_indices if df3.simulated_1[i]==0] # throw away if it's simulated 4136 | ind_test = np.concatenate((ind_test, more_indices1)) 4137 | 4138 | ind_train = np.array([i for i in range(df3.shape[0]) if i not in ind_test]) 4139 | 4140 | X_train = df3.iloc[ind_train,:-1].copy() 4141 | X_test = df3.iloc[ind_test,:-1].copy() 4142 | 4143 | y_train = df3.iloc[ind_train,-1].copy() 4144 | y_test = df3.iloc[ind_test,-1].copy() 4145 | 4146 | X_train_simulated = pd.DataFrame(X_train['simulated_1'].copy()) 4147 | X_test_simulated = pd.DataFrame(X_test['simulated_1'].copy()) 4148 | 4149 | X_train.drop(columns=['simulated_0', 'simulated_1'], inplace=True) 4150 | X_test.drop(columns=['simulated_0', 'simulated_1'], inplace=True) 4151 | 4152 | y_train = pd.DataFrame(y_train) 4153 | y_test = pd.DataFrame(y_test) 4154 | ``` 4155 | 4156 | 4157 | ```python 4158 | forest_clf1 = RandomForestClassifier(random_state=102) 4159 | forest_clf1.fit(X_train, y_train.y.values) 4160 | 4161 | y_pred = forest_clf1.predict(X_test) 4162 | print("%d out of %d exmaples were wrong." 4163 | % ((y_test.y != y_pred).sum(), X_test.shape[0])) 4164 | ``` 4165 | 4166 | 15 out of 287 exmaples were wrong. 4167 | 4168 | 4169 | 4170 | ```python 4171 | cm3 = confusion_matrix(y_test.y.values, y_pred) 4172 | 4173 | if cm3.shape == (1,1): 4174 | cm3 = np.array([[0,0],[0,cm2[0,0]]]) 4175 | 4176 | 4177 | plot_confusion_matrix(cm3, 4178 | target_names=['Not a Group ID','Group ID'], 4179 | title='Confusion matrix', 4180 | cmap=None, 4181 | normalize=True); 4182 | 4183 | print('Test set has all the real group IDs') 4184 | print("Confusion matrix: \n",cm3, "\n") 4185 | ``` 4186 | 4187 | 4188 | ![png](images/output_118_0.png) 4189 | 4190 | 4191 | Test set has all the real group IDs 4192 | Confusion matrix: 4193 | [[254 11] 4194 | [ 4 18]] 4195 | 4196 | 4197 | 4198 | 4199 | ```python 4200 | 4201 | ``` 4202 | 4203 | 4204 | ```python 4205 | 4206 | ``` 4207 | 4208 | 4209 | ```python 4210 | 4211 | ``` 4212 | --------------------------------------------------------------------------------