├── Demo Notebook.ipynb ├── README.md ├── RadReportAnnotator.py ├── environment.yml └── pseudodata ├── labels └── labeled_reports.xlsx └── reports └── words.xlsx /Demo Notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# RadReportAnnotator Demo\n", 8 | "\n", 9 | "We demonstrate on data from the [Indiana University Chest X-ray Dataset (Demner-Fushman et al.)](https://www.ncbi.nlm.nih.gov/pubmed/26133894)\n", 10 | "\n", 11 | "This example can be adapted to your own collection of radiology reports exported from Montage \n", 12 | "and a manually-generated set of classification labels" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "metadata": {}, 18 | "source": [ 19 | "Import library:" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 1, 25 | "metadata": { 26 | "collapsed": true 27 | }, 28 | "outputs": [], 29 | "source": [ 30 | "import RadReportAnnotator as ra\n", 31 | "import os.path" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "Instantiate RadReportAnnotator object with paths to demo `reports` and `labels`. \n", 39 | "\n", 40 | "`Reports` contains 3,666 deidentified chest x-ray radiology reports. \n", 41 | "\n", 42 | "`Labels` contains binary labels for `Normal`, `Opacity`, `Cardiomegaly`, `Nodule`, and `Fibrosis` for 1,500 of these reports." 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 2, 48 | "metadata": { 49 | "collapsed": true 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "CXRAnnotator = ra.RadReportAnnotator(report_dir_path=os.path.join(\"pseudodata\",\"reports\"), \n", 54 | " validation_file_path=os.path.join(\"pseudodata\",\"labels\",\"labeled_reports.xlsx\"))" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "Set arguments for RadReportAnnotator here in define_config - see documentation in RadReportAnnotator for more information.\n", 62 | "\n", 63 | "Models that use only bag of words (`DO_BOW=True,DO_WORD2VEC=False`) have been competitive in our experience with those that use both bag of words and word embeddings (`DO_BOW=True, DO_WORD2VEC=True`). Word embeddings can take considerable time to train on larger datasets. \n", 64 | "\n", 65 | "In the below demo, we use bag of words features (`DO_BOW=True`) with 1, 2, and 3-grams (`N_GRAM_SIZES=[1,2,3]`)." 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "metadata": { 72 | "collapsed": true 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "CXRAnnotator.define_config(DO_BOW=True,\n", 77 | "\tDO_WORD2VEC=False,\n", 78 | "\tDO_PARAGRAPH_VECTOR=False,\n", 79 | "\tN_GRAM_SIZES=[1,2,3],\n", 80 | "\tSILVER_THRESHOLD=\"fiftypct\",\n", 81 | "\tNAME_UNID_REPORTS = \"ACCID\", \n", 82 | "\tNAME_TEXT_REPORTS =\"REPORT\", \n", 83 | "\tN_THRESH_CORPUS=10,\n", 84 | "\tN_THRESH_OUTCOMES=50)" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "Build corpus from reports" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": 4, 97 | "metadata": {}, 98 | "outputs": [ 99 | { 100 | "name": "stdout", 101 | "output_type": "stream", 102 | "text": [ 103 | "building pre-corpus\n", 104 | "pre-corpus built\n", 105 | "preprocessing reports\n" 106 | ] 107 | }, 108 | { 109 | "name": "stderr", 110 | "output_type": "stream", 111 | "text": [ 112 | "100%|█████████████████████████████████████████████████████████████████████████████| 3666/3666 [00:07<00:00, 473.86it/s]\n" 113 | ] 114 | }, 115 | { 116 | "name": "stdout", 117 | "output_type": "stream", 118 | "text": [ 119 | "creating n-grams\n" 120 | ] 121 | }, 122 | { 123 | "name": "stderr", 124 | "output_type": "stream", 125 | "text": [ 126 | "100%|████████████████████████████████████████████████████████████████████████████| 3666/3666 [00:00<00:00, 6268.53it/s]\n" 127 | ] 128 | }, 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "number of unique n-grams: 33865\n", 134 | "number of unique n-grams after filtering out low frequency tokens: 2425\n" 135 | ] 136 | } 137 | ], 138 | "source": [ 139 | "CXRAnnotator.build_corpus()" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "We can examine how the preprocessing works. Let's look at the original input text for report at index 500:" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 5, 152 | "metadata": {}, 153 | "outputs": [ 154 | { 155 | "data": { 156 | "text/plain": [ 157 | "' Comparison: None Indication: Central line placement Findings: The heart is borderline in size. The aorta is mildly tortuous. XXXX right IJ catheter is in XXXX with tip in proximal right atrium/cavoatrial junction. There is no pneumothorax. Lungs are grossly clear. There is no large effusion. Impression: Right IJ catheter tip in proximal right atrium. No pneumothorax. '" 158 | ] 159 | }, 160 | "execution_count": 5, 161 | "metadata": {}, 162 | "output_type": "execute_result" 163 | } 164 | ], 165 | "source": [ 166 | "CXRAnnotator.df_data['Report Text'].iloc[500]" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "Let's look this report after preprocessing:" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 6, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "name": "stdout", 183 | "output_type": "stream", 184 | "text": [ 185 | "['comparison', 'none', 'indic', 'central', 'line', 'placement', 'find', 'the', 'heart', 'is', 'borderlin', 'in', 'size', 'sentenceend', 'the', 'aorta', 'is', 'mildli', 'tortuou', 'sentenceend', 'xxxx', 'right', 'ij', 'cathet', 'is', 'in', 'xxxx', 'with', 'tip', 'in', 'proxim', 'right', 'atrium', 'cavoatri', 'junction', 'sentenceend', 'there', 'is', 'no', 'pneumothorax', 'sentenceend', 'lung', 'are', 'grossli', 'clear', 'sentenceend', 'there', 'is', 'no', 'larg', 'effus', 'sentenceend', 'impress', 'right', 'ij', 'cathet', 'tip', 'in', 'proxim', 'right', 'atrium', 'sentenceend', 'no', 'pneumothorax', 'sentenceend', 'sentenceend', 'sentenceend', 'sentenceend']\n" 186 | ] 187 | } 188 | ], 189 | "source": [ 190 | "print(CXRAnnotator.processed_reports[500])" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "Words were stemmed (\"indication\"-->\"indic\"), extra punctuation was removed, and periods were replaced with the special end character. Word2vec takes input in a format like this to learn word embeddings.\n", 198 | "\n", 199 | "Let's look at the n-gram features for this report, which will be used for bag of words modeling:" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 7, 205 | "metadata": {}, 206 | "outputs": [ 207 | { 208 | "name": "stdout", 209 | "output_type": "stream", 210 | "text": [ 211 | "['find_the_heart', 'the_heart_is', 'the_aorta_is', 'lung_are_grossli', 'are_grossli_clear', 'no_larg_effus', 'comparison_none', 'find_the', 'the_heart', 'heart_is', 'in_size', 'the_aorta', 'aorta_is', 'is_mildli', 'xxxx_right', 'is_in', 'in_xxxx', 'xxxx_with', 'with_tip', 'tip_in', 'right_atrium', 'there_is', 'no_pneumothorax', 'lung_are', 'are_grossli', 'grossli_clear', 'there_is', 'no_larg', 'larg_effus', 'impress_right', 'cathet_tip', 'tip_in', 'right_atrium', 'no_pneumothorax', 'comparison', 'none', 'indic', 'central', 'line', 'placement', 'find', 'the', 'heart', 'is', 'borderlin', 'in', 'size', 'the', 'aorta', 'is', 'mildli', 'tortuou', 'xxxx', 'right', 'cathet', 'is', 'in', 'xxxx', 'with', 'tip', 'in', 'right', 'atrium', 'junction', 'there', 'is', 'pneumothorax', 'lung', 'are', 'grossli', 'clear', 'there', 'is', 'larg', 'effus', 'impress', 'right', 'cathet', 'tip', 'in', 'right', 'atrium', 'pneumothorax']\n" 212 | ] 213 | } 214 | ], 215 | "source": [ 216 | "print(CXRAnnotator.ngram_reports[500])" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "Since we have `N_GRAM_SIZES=[1,2,3]` in this demo, we see individual words (1-grams), each 2 consecutive words (2-grams; e.g., 'comparison_none'), and each 3 consecutive words ('no_larg_effus') available as features. Sometimes these 2- and 3-grams are uninformative ('comparison_none'), at other times they may be useful ('no_pneumothorax'). Note that only n-grams appearing `N_THRESH_CORPUS` times in training data (10 in this demo) are included. " 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "Train Lasso logistic regression models using features from 60% of labeled reports and infer labels for 40% of labeled reports (for performance evaluation) and unlabeled reports (for ultimate application):" 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": 8, 236 | "metadata": {}, 237 | "outputs": [ 238 | { 239 | "name": "stdout", 240 | "output_type": "stream", 241 | "text": [ 242 | "generating features\n" 243 | ] 244 | }, 245 | { 246 | "name": "stderr", 247 | "output_type": "stream", 248 | "text": [ 249 | "100%|████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:00<00:00, 4099.24it/s]\n" 250 | ] 251 | }, 252 | { 253 | "name": "stdout", 254 | "output_type": "stream", 255 | "text": [ 256 | "total labels:6\n", 257 | "labels eligible for inference:4\n", 258 | "dimensionality of predictor matrix:(1500, 2425)\n", 259 | "n_train in modeling=900\n", 260 | "n_test in modeling=600\n", 261 | "i=0\n" 262 | ] 263 | }, 264 | { 265 | "name": "stderr", 266 | "output_type": "stream", 267 | "text": [ 268 | "100%|███████████████████████████████████████████████████████████████████████████| 2000/2000 [00:00<00:00, 26965.13it/s]\n", 269 | "100%|███████████████████████████████████████████████████████████████████████████| 1666/1666 [00:00<00:00, 19683.19it/s]\n" 270 | ] 271 | } 272 | ], 273 | "source": [ 274 | "binary_labels, proba_labels = CXRAnnotator.infer_labels()" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "Examine quality of predictions on held out 40% of labeled data." 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 9, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "data": { 291 | "text/html": [ 292 | "
\n", 293 | "\n", 306 | "\n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | "
AUCTrue +False +True -False -
Label (with calcs on held out 40 pct)
Normal0.9566792085332415
Opacity0.98186962175174
Cardiomegaly0.99397941185410
Nodule0.99175916365480
\n", 360 | "
" 361 | ], 362 | "text/plain": [ 363 | " AUC True + False + True - \\\n", 364 | "Label (with calcs on held out 40 pct) \n", 365 | "Normal 0.956679 208 53 324 \n", 366 | "Opacity 0.981869 62 17 517 \n", 367 | "Cardiomegaly 0.993979 41 18 541 \n", 368 | "Nodule 0.991759 16 36 548 \n", 369 | "\n", 370 | " False - \n", 371 | "Label (with calcs on held out 40 pct) \n", 372 | "Normal 15 \n", 373 | "Opacity 4 \n", 374 | "Cardiomegaly 0 \n", 375 | "Nodule 0 " 376 | ] 377 | }, 378 | "execution_count": 9, 379 | "metadata": {}, 380 | "output_type": "execute_result" 381 | } 382 | ], 383 | "source": [ 384 | "CXRAnnotator.accuracy" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "Notice `Fibrosis` was filtered out despite appearing in input data as we had very few positive observations. It is important to ensure that sufficient positive and negative cases for each label exist in your labeled data.\n", 392 | "\n", 393 | "Rare labels with high AUC may still have a significant number of false positives (`Nodule`). Be aware of noise introduced by your labeling process before using inferred labels to train convolutional neural networks or other algorithms, and consider the positive predictive value (PPV) of a positive label. Additional labeled examples, particularly of rare pathology, may help improve accuracy. \n", 394 | "\n", 395 | "Recent results ([Ghafoorian et al.](https://arxiv.org/abs/1801.05040) [Rajpurkar et al.](https://arxiv.org/abs/1711.05225)) demonstrate that deep learning can achieve impressive results when trained to a large noisily labeled radiological imaging dataset." 396 | ] 397 | }, 398 | { 399 | "cell_type": "markdown", 400 | "metadata": {}, 401 | "source": [ 402 | "Examine a few probabilistic predictions:" 403 | ] 404 | }, 405 | { 406 | "cell_type": "code", 407 | "execution_count": 10, 408 | "metadata": {}, 409 | "outputs": [ 410 | { 411 | "data": { 412 | "text/html": [ 413 | "
\n", 414 | "\n", 427 | "\n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | "
NormalOpacityCardiomegalyNodule
Accession Number
1036610.1139530.0073050.0221560.009483
1036620.2832030.0073050.0221560.009483
1036630.2832030.0073050.0221560.009483
1036640.0001290.0605470.0588070.037109
1036650.0202330.9995120.0114060.019058
\n", 482 | "
" 483 | ], 484 | "text/plain": [ 485 | " Normal Opacity Cardiomegaly Nodule\n", 486 | "Accession Number \n", 487 | "103661 0.113953 0.007305 0.022156 0.009483\n", 488 | "103662 0.283203 0.007305 0.022156 0.009483\n", 489 | "103663 0.283203 0.007305 0.022156 0.009483\n", 490 | "103664 0.000129 0.060547 0.058807 0.037109\n", 491 | "103665 0.020233 0.999512 0.011406 0.019058" 492 | ] 493 | }, 494 | "execution_count": 10, 495 | "metadata": {}, 496 | "output_type": "execute_result" 497 | } 498 | ], 499 | "source": [ 500 | "proba_labels.tail()" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "Examine a few binary predictions - these override to manual labels when available:" 508 | ] 509 | }, 510 | { 511 | "cell_type": "code", 512 | "execution_count": 11, 513 | "metadata": {}, 514 | "outputs": [ 515 | { 516 | "data": { 517 | "text/html": [ 518 | "
\n", 519 | "\n", 532 | "\n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | "
NormalOpacityCardiomegalyNodule
Accession Number
1036610000
1036620000
1036630000
1036640000
1036650100
\n", 587 | "
" 588 | ], 589 | "text/plain": [ 590 | " Normal Opacity Cardiomegaly Nodule\n", 591 | "Accession Number \n", 592 | "103661 0 0 0 0\n", 593 | "103662 0 0 0 0\n", 594 | "103663 0 0 0 0\n", 595 | "103664 0 0 0 0\n", 596 | "103665 0 1 0 0" 597 | ] 598 | }, 599 | "execution_count": 11, 600 | "metadata": {}, 601 | "output_type": "execute_result" 602 | } 603 | ], 604 | "source": [ 605 | "binary_labels.tail()" 606 | ] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "metadata": {}, 611 | "source": [ 612 | "You can examine individual report predictions; here are report and predictions for a report that manual reviewers coded as `Normal`:" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": 12, 618 | "metadata": {}, 619 | "outputs": [ 620 | { 621 | "name": "stdout", 622 | "output_type": "stream", 623 | "text": [ 624 | " Comparison: None. Indication: XXXX, chest pain and XXXX x2 weeks. Findings: The cardiomediastinal silhouette and pulmonary vasculature are within normal limits in size. The lungs are clear of focal airspace disease, pneumothorax, or pleural effusion. There are no acute bony findings. Impression: No acute cardiopulmonary findings. \n", 625 | "\n", 626 | "\n", 627 | "Normal 0.969727\n", 628 | "Opacity 0.001776\n", 629 | "Cardiomegaly 0.000642\n", 630 | "Nodule 0.000948\n", 631 | "Name: 101700, dtype: float64\n", 632 | "\n", 633 | "\n", 634 | "Normal 1\n", 635 | "Opacity 0\n", 636 | "Cardiomegaly 0\n", 637 | "Nodule 0\n", 638 | "Name: 101700, dtype: int32\n" 639 | ] 640 | } 641 | ], 642 | "source": [ 643 | "#normal report\n", 644 | "print(CXRAnnotator.df_data['Report Text'].iloc[1700])\n", 645 | "print(\"\\n\")\n", 646 | "print(proba_labels.iloc[1700])\n", 647 | "print(\"\\n\")\n", 648 | "print(binary_labels.iloc[1700])" 649 | ] 650 | }, 651 | { 652 | "cell_type": "markdown", 653 | "metadata": {}, 654 | "source": [ 655 | "Here are report and predictions for a report that manual reviewers coded as positive for `Cardiomegaly`:" 656 | ] 657 | }, 658 | { 659 | "cell_type": "code", 660 | "execution_count": 13, 661 | "metadata": {}, 662 | "outputs": [ 663 | { 664 | "name": "stdout", 665 | "output_type": "stream", 666 | "text": [ 667 | " Comparison: PA and lateral chest x-XXXX dated XXXX. Indication: XXXX-year-old female with chest pain. Findings: The heart size is enlarged. Tortuous aorta. Otherwise the mediastinal contour is within normal limits. The lungs are free of any focal infiltrates. There are no nodules or masses. No visible pneumothorax. No visible pleural fluid. The XXXX are grossly normal. There is no visible free intraperitoneal air under the diaphragm. Impression: 1. Cardiomegaly without lung infiltrates. \n", 668 | "\n", 669 | "\n", 670 | "Normal 0.008018\n", 671 | "Opacity 0.001008\n", 672 | "Cardiomegaly 0.981445\n", 673 | "Nodule 0.056152\n", 674 | "Name: 102100, dtype: float64\n", 675 | "\n", 676 | "\n", 677 | "Normal 0\n", 678 | "Opacity 0\n", 679 | "Cardiomegaly 1\n", 680 | "Nodule 0\n", 681 | "Name: 102100, dtype: int32\n" 682 | ] 683 | } 684 | ], 685 | "source": [ 686 | "print(CXRAnnotator.df_data['Report Text'].iloc[2100])\n", 687 | "print(\"\\n\")\n", 688 | "print(proba_labels.iloc[2100])\n", 689 | "print(\"\\n\")\n", 690 | "print(binary_labels.iloc[2100])" 691 | ] 692 | }, 693 | { 694 | "cell_type": "markdown", 695 | "metadata": {}, 696 | "source": [ 697 | "Here are report and predictions for a report that manual reviewers coded as positive for `Opacity`:" 698 | ] 699 | }, 700 | { 701 | "cell_type": "code", 702 | "execution_count": 14, 703 | "metadata": {}, 704 | "outputs": [ 705 | { 706 | "name": "stdout", 707 | "output_type": "stream", 708 | "text": [ 709 | " Comparison: XXXX, XXXX Indication: XXXX-year-old XXXX with chest pain. Findings: The heart size is stable. The aorta is ectatic and atherosclerotic but stable. XXXX sternotomy XXXX are again noted. The scarring in the left lower lobe is again noted and unchanged from prior exam. There are mild bilateral prominent lung interstitial opacities consistent with emphysematous disease. The calcified granulomas are stable. Impression: 1. Changes of emphysema and left lower lobe scarring, both stable. 2. Unchanged degenerative and atherosclerotic changes of the thoracic aorta. \n", 710 | "\n", 711 | "\n", 712 | "Normal 0.000000\n", 713 | "Opacity 0.981445\n", 714 | "Cardiomegaly 0.125977\n", 715 | "Nodule 0.234497\n", 716 | "Name: 102770, dtype: float64\n", 717 | "\n", 718 | "\n", 719 | "Normal 0\n", 720 | "Opacity 1\n", 721 | "Cardiomegaly 0\n", 722 | "Nodule 0\n", 723 | "Name: 102770, dtype: int32\n" 724 | ] 725 | } 726 | ], 727 | "source": [ 728 | "#opacity\n", 729 | "print(CXRAnnotator.df_data['Report Text'].iloc[2770])\n", 730 | "print(\"\\n\")\n", 731 | "print(proba_labels.iloc[2770])\n", 732 | "print(\"\\n\")\n", 733 | "print(binary_labels.iloc[2770])" 734 | ] 735 | } 736 | ], 737 | "metadata": { 738 | "kernelspec": { 739 | "display_name": "Python 3", 740 | "language": "python", 741 | "name": "python3" 742 | }, 743 | "language_info": { 744 | "codemirror_mode": { 745 | "name": "ipython", 746 | "version": 3 747 | }, 748 | "file_extension": ".py", 749 | "mimetype": "text/x-python", 750 | "name": "python", 751 | "nbconvert_exporter": "python", 752 | "pygments_lexer": "ipython3", 753 | "version": "3.6.2" 754 | } 755 | }, 756 | "nbformat": 4, 757 | "nbformat_minor": 2 758 | } 759 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RadReportAnnotator 2 | 3 | Authors: jrzech, eko 4 | 5 | Provides a library of methods for automatically inferring labels for a corpus of radiological reports given a set of manually-labeled data. These methods are described in our publication [Natural Language–based Machine Learning Models for the Annotation of Clinical Radiology Reports](https://doi.org/10.1148/radiol.2018171093). 6 | 7 | ## Getting Started: 8 | 9 | To configure your own local instance (assumes [Anaconda is installed](https://www.anaconda.com/download/)): 10 | 11 | ``` 12 | git clone https://www.github.com/aisinai/rad-report-annotator.git 13 | cd rad-report-annotator 14 | conda env create -f environment.yml 15 | source activate rad_env 16 | python -m ipykernel install --user --name rad_env --display-name "Python (rad_env)" 17 | ``` 18 | 19 | *Note as of Oct 11, 2022: this conda environment builds on Linux and Windows, but not on Mac as older versions of gensim for Mac are not available in conda-forge.* 20 | 21 | To see a demo of the library on data from the [Indiana University Chest X-ray Dataset (Demner-Fushman et al.)](https://www.ncbi.nlm.nih.gov/pubmed/26133894), please open `Demo Notebook.ipynb` and run all cells. 22 | 23 | -------------------------------------------------------------------------------- /RadReportAnnotator.py: -------------------------------------------------------------------------------- 1 | """ 2 | RadReportAnnotator 3 | Authors: jrzech, eko 4 | 5 | This is a library of methods for automatically inferring labels for a corpus or radiological documents given a set of manually-labeled data. 6 | 7 | """ 8 | 9 | #usual imports for data science 10 | import numpy as np 11 | import pandas as pd 12 | import sys 13 | import os 14 | import math 15 | from tqdm import tqdm 16 | 17 | #sklearn 18 | from sklearn.model_selection import StratifiedKFold 19 | from sklearn.model_selection import GridSearchCV 20 | from sklearn.linear_model import LogisticRegression 21 | from sklearn.metrics import roc_auc_score 22 | 23 | #NLP imports 24 | from nltk.tokenize import RegexpTokenizer 25 | from nltk.stem.porter import PorterStemmer 26 | import re 27 | 28 | #gensim for word embedding featurization 29 | import gensim 30 | from collections import namedtuple 31 | 32 | #misc 33 | import glob 34 | import os.path 35 | import multiprocessing 36 | import random 37 | 38 | 39 | def join_montage_files(data_dir,NAME_UNID_REPORTS, NAME_TEXT_REPORTS): 40 | """ 41 | Joins several montage files in excel format into a single pandas dataframe 42 | Args: 43 | data_dir: a filepath pointing to a directory containing montage files in excel format 44 | NAME_UNID_REPORTS: column name of unique id / accession id in reports xlsx 45 | NAME_TEXT_REPORTS: column name of report text in reports xlsx 46 | Returns: 47 | df_data: a pandas dataframe containing texts from montage 48 | """ 49 | print("building pre-corpus") 50 | datafiles = os.listdir(data_dir) 51 | df_data = pd.read_excel(os.path.join(data_dir,datafiles[0])) 52 | datafiles.remove(datafiles[0]) 53 | for subcorpus in datafiles: 54 | df_data = df_data.append(pd.read_excel(os.path.join(data_dir,subcorpus))) 55 | print('pre-corpus built') 56 | 57 | df_data.rename(columns={NAME_UNID_REPORTS:'Accession Number',NAME_TEXT_REPORTS:'Report Text'},inplace=True) 58 | return df_data 59 | 60 | def preprocess_data(df_data, en_stop, stem_words=True): 61 | """ 62 | Takes a dataframe of montage files and a list of stop words and returns a list of lda_inputs. The lda_inputs 63 | list consists of sublists of stemmed unigrams. 64 | Args: 65 | df_data: a dataframe of joined montage files. 66 | en_stop: a list of english stop_words from the stop_words library 67 | stem_words: argument indicating whether or not to stem words 68 | Returns: 69 | lda_inputs: a list of lists of stemmed words from each text within the montage dataframe 70 | """ 71 | if(stem_words==False): 72 | print("NOTE - NOT STEMMING") 73 | p_stemmer = PorterStemmer() 74 | processed_reports = [] 75 | accession_index=[] 76 | 77 | print("preprocessing reports") 78 | for i in tqdm(range(0,df_data.shape[0])): 79 | 80 | tokenizer = RegexpTokenizer(r'\w+') 81 | process = df_data['Report Text'].iloc[i] 82 | 83 | process = str(process) 84 | process = process + "..." # add a period, sometimes it's missing at end 85 | process = process.lower() 86 | 87 | z = len(process) 88 | k = 0 89 | #remove line breaks 90 | process=process.replace("^M", " ") # 91 | process=process.replace("\n", " ") # 92 | process=process.replace("\r", " ") # 93 | process=process.replace("_", " ") # 94 | process=process.replace("-", " ") # 95 | process=process.replace(",", " , ") # 96 | process=process.replace(" ", " ") # 97 | process=process.replace(" ", " ") # 98 | process=process.replace(" ", " ") # 99 | process=process.replace(" ", " ") # 100 | process=process.replace(" ", " ") # 101 | 102 | process = re.sub(r'\d+', '',process) 103 | process=process.replace(".", " SENTENCEEND ") # create end characters 104 | 105 | process_tokenized = tokenizer.tokenize(process) 106 | process_stopped = [i for i in process_tokenized if not i in en_stop] 107 | 108 | if(stem_words==True): 109 | process_stemmed = [p_stemmer.stem(i) for i in process_stopped] 110 | else: 111 | process_stemmed = process_stopped 112 | 113 | processed_reports.append(process_stemmed) 114 | #include n grams in lda_input 115 | return processed_reports 116 | 117 | def remove_infrequent_tokens(processed_reports,freq_threshold,labeled_indices): 118 | """ 119 | Takes a list of processed_preports and removes infrequent tokens (defined as occurring < freq_threshold times) from them. 120 | Args: 121 | processed_reports: list of lists of stemmed words after initial processing, where each entry corresponds to a report 122 | freq_threshold: count threshold, remove words occuring < freq_threshold time from corpus. note - considers only unlabeled corpus, not labeled corpus, to avoid peeking into labeled data. 123 | labeled_indices: indices of processed_reports that are labeled reports - these are excluded from frequency calculations. 124 | Returns: 125 | process_reports_postcountfilter: list of lists of stemmed words after initial processing, where each entry corresponds to a report, after low frequency words have been removed 126 | """ 127 | word_count = common_stems(processed_reports, labeled_indices) 128 | d = dict((k,v) for k, v in word_count.items() if v >= freq_threshold) 129 | process_reports_postcountfilter=[[] for x in range(0,len(processed_reports))] 130 | for i in range(0,len(processed_reports)): 131 | for token in processed_reports[i]: 132 | if token in d: 133 | process_reports_postcountfilter[i].append(token) 134 | return process_reports_postcountfilter 135 | 136 | def create_ngrams(processed_reports, labeled_indices, N_GRAM_SIZES, freq_threshold): 137 | """ 138 | Takes a a processed_reports list, specified n_gram size list, and a frequency threshold at which 139 | to eliminate tokens with < frequency of appearance returns creates n_grams as well as removes ngrams that signify end of sentence 140 | Args: 141 | processed_reports: a list of text lists of stemmed unigrams ready for conversion into n-gram text lists 142 | labeled_indices: exclude these from calculcation of n-gram cutoff 143 | N_GRAM_SIZES: a list of ints specifying the n-gram sizes to include in the texts of the future corpus 144 | freq_threshold: the frequency threshold for n-gram inclusions. N-grams that occur with frequency < threshold will be removed from corpus 145 | Returns: 146 | processed_outputs_clean: a list of text lists of n-grams that are ready to be processed into a corpus 147 | """ 148 | processed_outputs = [] 149 | print("creating n-grams") 150 | for report in tqdm(processed_reports[:]): 151 | new_report = [] 152 | end=len(report) 153 | #CREATES 4-grams - for all n-grams, we don't allow "no" to be in middle of n-gram, don't allow sentenceend token to be in n-gram 154 | if 4 in N_GRAM_SIZES: 155 | for i in range (0,end-3): 156 | if (report[i+1] != "no" and report[i+2] != "no" and report[i+3] != "no" and report[i].lower()!= "sentenceend" and report[i+1].lower()!= "sentenceend" and report[i+2].lower()!= "sentenceend" and report[i+3]!= "sentenceend"): #no only at beginning 157 | new_report.append(report[i] +"_" +report[i+1] + "_" + report[i+2] + "_" + report[i+3]) 158 | #CREATES 3-grams 159 | if 3 in N_GRAM_SIZES: 160 | for i in range (0,end-2): 161 | if (report[i+1] != "no" and report[i+2] != "no" and report[i].lower()!= "sentenceend" and report[i+1].lower()!= "sentenceend" and report[i+2].lower()!= "sentenceend"): #no only at beginning 162 | new_report.append(report[i] +"_" +report[i+1] + "_" + report[i+2]) 163 | #CREATES 2-grams 164 | if 2 in N_GRAM_SIZES: 165 | for i in range (0,end-1): 166 | if (report[i+1] != "no" and report[i].lower()!= "sentenceend" and report[i+1].lower()!= "sentenceend"): #no only at beginning 167 | new_report.append(report[i] +"_" +report[i+1]) 168 | #CREATES unigrams 169 | if 1 in N_GRAM_SIZES: 170 | for i in range (0,end): 171 | if(report[i].lower()!= "sentenceend" and report[i]!= "no"): # we take no out as a unigram in bow 172 | new_report.append(report[i]) 173 | processed_outputs.append(new_report) 174 | 175 | #remove low freq tokens 176 | word_count = common_stems(processed_outputs, labeled_indices) 177 | print("number of unique n-grams:", len(word_count)) 178 | d = dict((k,v) for k, v in word_count.items() if v >= freq_threshold) 179 | print("number of unique n-grams after filtering out low frequency tokens:", len(d)) 180 | 181 | #remove tokens that occurred infrequently from processed_outputs --> processed_outputs_clean 182 | processed_outputs_clean=[[] for x in range(0,len(processed_outputs))] 183 | for i in range(0,len(processed_outputs)): 184 | for token in processed_outputs[i]: 185 | if token in d: 186 | processed_outputs_clean[i].append(token) 187 | return processed_outputs_clean 188 | 189 | def get_labeled_indices(df_data,validation_file,TRAIN_INDEX_OVERRIDE): 190 | """ 191 | Returns numerical indices of reports in df_data for which we have labeled data in validation_file; will set labeled reports as unlabeled if in TRAIN_INDEX_OVERRIDE 192 | Args: 193 | df_data: dataframe containing report text and accession ids 194 | validation_file: dataframe containing accession ids and labels 195 | TRAIN_INDEX_OVERRIDE: list of numerical indices to treat as unlabeled; necessary to train d2v model if all your data is labeled as it uses exclusively unlabeled data to train to avoid peeking into labeled data 196 | Returns: 197 | return_indices: indices we treat as labeled 198 | """ 199 | validation = pd.read_excel(validation_file) 200 | validation.set_index('Accession Number') 201 | validation_cases=validation['Accession Number'].tolist() 202 | all_indices = df_data['Accession Number'].tolist() 203 | return_indices=[] 204 | for i in all_indices: 205 | if i in validation_cases and i not in TRAIN_INDEX_OVERRIDE: # if something is manually overrided to be in train, don't put it in test 206 | return_indices.append(True) 207 | else: 208 | return_indices.append(False) 209 | return return_indices 210 | 211 | def common_stems(ngram_list, exclude_indices): 212 | """ 213 | Takes a list of ngrams, ngram_list, and returns the most frequently appearing stems as a dict item of word:word_count pairs 214 | is flagged to write output to memory. 215 | Args: 216 | ngram_list: list of all n_grams 217 | exclude_indices:rows to ignore when doing count (labeled data) 218 | Returns: 219 | word_count: dict of ngram:ngram_count pairs 220 | """ 221 | word_count={} 222 | i=0 223 | excluded=0 224 | for entry in ngram_list: 225 | if exclude_indices[i]==False: 226 | for word in entry: 227 | if word not in word_count: 228 | #add word with entry 1 229 | word_count[word] = 1 230 | else: 231 | #increment entry by 1 232 | word_count[word]=word_count[word]+1 233 | else: 234 | excluded=excluded+1 235 | i=i+1 236 | d = dict((k,v) for k, v in word_count.items()) 237 | 238 | return word_count 239 | 240 | 241 | def build_train_test_corpus(df_data, ngram_list, labeled_filepath,TRAIN_INDEX_OVERRIDE): 242 | """ 243 | Takes the master corpus, the ngram_list, and a filepath pointing to a labeled spreadsheet 244 | and builds a labeled_corpus consisting of labelled data and an unlabeled_corpus 245 | of non-labelled data 246 | Args: 247 | df_data: a dataframe consisting of the original set of excel files with report text and accession id 248 | ngram_list: list of all n-grams in corpus 249 | labeled_filepath: path to file containing accession ids and labels 250 | TRAIN_INDEX_OVERRIDE: indices to treat as unlabeled data regardless of presence of labels. 251 | Returns: 252 | train_corpus: a corpus consisting of unlabelled texts that will be used for model construction 253 | test_corpus: a corpus consisting of labelled held-out texts that will be used for model validation 254 | dictionary: a dictionary compromised of the LDA input n-grams 255 | labeled_indices: the indices for the validation files 256 | """ 257 | dictionary = gensim.corpora.Dictionary(ngram_list) 258 | corpus = [dictionary.doc2bow(input) for input in ngram_list] 259 | if(not labeled_filepath is None): 260 | outcomes = pd.read_excel(labeled_filepath) 261 | outcomes.set_index('Accession Number') 262 | labeled_cases=outcomes['Accession Number'].tolist() 263 | else: 264 | labeled_cases=[] 265 | labeled_indices = [] 266 | not_labeled_indices = [] 267 | train_data_lda = np.ones(df_data.shape[0],dtype=bool) 268 | num_removed=0 269 | for i in range(0,df_data.shape[0]): 270 | if df_data['Accession Number'].iloc[i] in labeled_cases and df_data['Accession Number'].iloc[i] not in TRAIN_INDEX_OVERRIDE: 271 | train_data_lda[i]=False 272 | labeled_indices.append(i) 273 | num_removed += 1 274 | else: 275 | not_labeled_indices.append(i) 276 | unlabeled_corpus = [corpus[i] for i in not_labeled_indices] 277 | labeled_corpus = [corpus[i] for i in labeled_indices] 278 | 279 | return corpus, unlabeled_corpus, labeled_corpus, dictionary, labeled_indices 280 | 281 | 282 | def build_d2v_corpora(df_data,d2v_inputs,labeled_indices): 283 | """ 284 | Build corpora in format for doc2vec gensim implementation 285 | Args: 286 | df_data: a dataframe consisting of the original set of excel files with report text and accession id 287 | d2v_inputs: list of lists of tokens, where each entry in d2v_inputs corresponds to a report 288 | labeled_indices: indices of labeled reports (and those we treat as labeled due to TRAIN_INDEX_OVERRIDE) 289 | Returns: 290 | unlabeled_corpus: a corpus consisting of unlabelled texts that will be used for feature construction 291 | labeled_corpus: a corpus consisting of labelled held-out texts that will be used for Lasso regression training 292 | total_unlabeled_words: count of total words in unlabeled corpus 293 | """ 294 | 295 | SentimentDocument = namedtuple('SentimentDocument', 'words tags') 296 | unlabeled_docs = [] 297 | labeled_docs = [] 298 | total_unlabeled_words=0 299 | i=0 300 | for line in d2v_inputs: 301 | words = line # [x for x in line if x != 'END'] 302 | tags = '' + str(df_data['Accession Number'].iloc[i]) 303 | if(i in labeled_indices): 304 | labeled_docs.append(SentimentDocument(words,tags)) 305 | else: 306 | unlabeled_docs.append(SentimentDocument(words,tags)) 307 | total_unlabeled_words+=len(words) 308 | i+=1 309 | 310 | print('%d unlabeled reports for featurization, %d labeled reports for modeling' % (len(unlabeled_docs), len(labeled_docs))) 311 | return unlabeled_docs, labeled_docs, total_unlabeled_words 312 | 313 | 314 | def train_d2v(unlabeled_docs, labeled_docs, D2V_EPOCH, DIM_DOC2VEC, W2V_DM, W2V_WINDOW, total_unlabeled_words): 315 | """ 316 | Train doc2vec/word2vec model. 317 | 318 | Args: 319 | unlabeled_docs: unlabeled corpus 320 | labeled_docs: labeled corpus 321 | D2V_EPOCHS: number of epochs to train d2v model; 20 has worked well in our experiments; parameter for gensim doc2vec 322 | DIM_DOC2VEC: dimensionality of embedding vectors, we explored values 50-800; parameter for gensim doc2vec 323 | W2V_DM: 1 is PV-DM, otherwise PV-DBOW; parameter for gensim doc2vec 324 | W2V_WINDOW: number of words window to use in doc2vec model; parameter for gensim doc2vec 325 | total_unlabeled_words: total words in unlabeled corpus; argument for gensim doc2vec 326 | 327 | Returns: 328 | d2vmodel: trained doc2vec model. 329 | """ 330 | 331 | cores = multiprocessing.cpu_count() 332 | assert gensim.models.doc2vec.FAST_VERSION > -1, "speed up" 333 | print("started doc2vec training") 334 | d2vmodel = gensim.models.Doc2Vec(dm=W2V_DM, size=DIM_DOC2VEC, window=W2V_WINDOW, negative=5, hs=0, min_count=2, workers=cores) 335 | d2vmodel.build_vocab(unlabeled_docs + labeled_docs) 336 | d2vmodel.train(unlabeled_docs, total_words=total_unlabeled_words, epochs=D2V_EPOCH) 337 | print("finished doc2vec training") 338 | return d2vmodel 339 | 340 | def calc_auc(predictor_matrix,eligible_outcomes_aligned, all_outcomes_aligned,N_LABELS, pred_type, header,ASSIGNFOLD_USING_ROW=False): 341 | """ 342 | Train Lasso models using 60% of labeled data with generated features and labels; calculate AUC, accuracy, 343 | confusion matrix for each label on remaining 40% of labeled data. 344 | 345 | Args: 346 | 347 | predictor_matrix: numpy matrix of features available to use as input to Lasso logistic regression 348 | eligible_outcomes_aligned: dataframe of labels we are predicting 349 | all_outcomes_aligned: dataframe of all labels, including those we excluded due to infrequent positive/negative occurences - we use it for accession id 350 | N_LABELS: total number of labels we are predicting 351 | pred_type: label indicating what variables went into predictor_matrix 352 | results_dir: directory to which to save results 353 | header: header for predictor matrix 354 | ASSIGNFOLD_USING_ROW: normally 60/40 split done randomly, you can fix it to use first 60% of rows if you need replicability 355 | but be wary of introducing distortion into train/test set with dates, etc.: recommend randomly sorting 356 | rows in excel beforehand if you opt for this. 357 | 358 | Returns: 359 | 360 | lasso_models: list of all trained lasso logistic regression models from sklearn, where index corresponds to relative index in columns of eligible_outcomes_aligend 361 | """ 362 | 363 | if predictor_matrix.shape[1]!=len(header): 364 | print("predictor_matrix.shape[1]="+str(predictor_matrix.shape[1])) 365 | print("len(header)"+str(len(header))) 366 | raise ValueError("predictor_matrix shape doesn't match header, investigate") 367 | all_coef = pd.concat([ pd.DataFrame(header)], axis = 1) 368 | 369 | lasso_models={} 370 | model_types = ["Lasso"] 371 | 372 | r = list(range(eligible_outcomes_aligned.shape[0])) 373 | random.shuffle(r) 374 | 375 | if(ASSIGNFOLD_USING_ROW): 376 | assignfold = pd.DataFrame(data=list(range(eligible_outcomes_aligned.shape[0])), columns=['train']) 377 | else: 378 | assignfold = pd.DataFrame(data=r, columns=['train']) 379 | 380 | cutoff = np.floor(0.6*eligible_outcomes_aligned.shape[0]) 381 | 382 | train=assignfold['train']=cutoff 384 | 385 | N_TRAIN=eligible_outcomes_aligned.ix[train,:].shape[0] 386 | N_HELDOUT=eligible_outcomes_aligned.ix[test,:].shape[0] 387 | print("n_train in modeling="+str(N_TRAIN)) 388 | print("n_test in modeling="+str(N_HELDOUT)) 389 | 390 | confusion = pd.DataFrame(data=np.zeros(shape=(eligible_outcomes_aligned.shape[1]*len(model_types),6),dtype=np.int),columns=['Label (with calcs on held out 40 pct)','AUC','True +','False +','True -','False -']) 391 | 392 | resultrow=0 393 | for i in range(0,N_LABELS): 394 | PROCEED=True; 395 | #need to make sure we don't have an invalid setting -- ie, a train[x] set of labels that is uniform, else Lasso regression fails 396 | if(len(set(eligible_outcomes_aligned.ix[train,i].tolist())))==1: 397 | PROCEED=False; 398 | raise ValueError ("fed label to lasso regression with no variation - cannot compute - please investigate data") 399 | 400 | if(PROCEED): 401 | 402 | for model_type in model_types: 403 | if(model_type=="Lasso"): 404 | parameters = { "penalty": ['l1'], 405 | "C": [64,32,16,8,4,2,1,0.5,0.25,0.1,0.05,0.025,0.01,0.005] 406 | } 407 | try: 408 | cv = StratifiedKFold(n_splits=5) 409 | grid_search = GridSearchCV(LogisticRegression(), param_grid=parameters, scoring='neg_log_loss', cv=cv) 410 | grid_search.fit(predictor_matrix[train,:],np.array(eligible_outcomes_aligned.ix[train,i])) 411 | best_parameters0 = grid_search.best_estimator_.get_params() 412 | model0 = LogisticRegression(**best_parameters0) 413 | except: 414 | raise ValueError ("error in lasso regression - likely data issue, may involve rare labels - please investigate data") 415 | model0.fit(predictor_matrix[np.array(train),:],eligible_outcomes_aligned.ix[train,i]) 416 | pred0=model0.predict_proba(predictor_matrix[np.array(test),:])[:,1] 417 | coef = pd.concat([ pd.DataFrame(header),pd.DataFrame(np.transpose(model0.coef_))], axis = 1) 418 | df0 = pd.DataFrame({'predict':pred0,'target':eligible_outcomes_aligned.ix[test,i], 'label':all_outcomes_aligned['Accession Number'][test]}) 419 | 420 | calc_auc=roc_auc_score(np.array(df0['target']),np.array(df0['predict'])) 421 | if(i%10==0): 422 | print("i="+str(i)) 423 | save_name=str(list(eligible_outcomes_aligned.columns.values)[i]) 424 | 425 | target_predicted=''.join(e for e in save_name if e.isalnum()) 426 | 427 | #confusion: outcome TP TN FP FN 428 | thresh = np.mean(df0['target']) 429 | FP=0 430 | FN=0 431 | TP=0 432 | TN=0 433 | for j in df0.index: 434 | cpred=df0.ix[j][1] 435 | ctarget = df0.ix[j][2] 436 | 437 | if cpred>=thresh and ctarget==1: 438 | TP+=1 439 | if cpred=thresh and ctarget==0: 442 | FP+=1 443 | if cpred0): temp_avg = np.divide(temp_avg,m_avg) #if vector was empty, just leave it zero 558 | 559 | for k in range(0,DIM_DOC2VEC): 560 | w2v_matrix[j,k]=temp_avg[k] 561 | 562 | j+=1 563 | return bow_matrix, pv_matrix,w2v_matrix,accid_list,orig_text,orig_input 564 | 565 | def generate_wholeset_features(DIM_DOC2VEC, 566 | processed_reports, 567 | DO_PARAGRAPH_VECTOR, 568 | DO_WORD2VEC, 569 | dictionary, 570 | corpus, 571 | d2vmodel, 572 | d2v_inputs): 573 | """ 574 | Generate numerical features to be used in Lasso logistic regressions using text data for all reports (labeled and unlabeled) 575 | 576 | Args: 577 | 578 | DIM_DOC2VEC: embedding dimensionality of doc2vec 579 | processed_reports: list of list of words, each entry in original list corresponding to a report 580 | DO_PARAGRAPH_VECTOR: use paragraph vector features? 581 | DO_WORD2VEC: use average word embedding features? 582 | dictionary: a dictionary compromised of the LDA input n-grams 583 | corpus: corpus with both unlabeled and labeled data, list of lists 584 | d2vmodel: trained doc2vec model object 585 | d2v_inputs: reports processed into d2v input format 586 | 587 | Returns: 588 | 589 | bow_matrix: numpy matrix with indicator bow features (1 if word present, 0 else), each row corresponds to a report 590 | pv_matrix: numpy matrix with paragraph vector embedding features, each row corresponds to a report 591 | w2v_matrix: numpy matrix with average word embedding features, each row corresponds to a report 592 | """ 593 | 594 | bow_matrix = np.zeros(shape=(len(corpus),len(dictionary)),dtype=np.int) 595 | pv_matrix = np.zeros(shape=(len(corpus),DIM_DOC2VEC),dtype=np.float64) 596 | w2v_matrix = np.zeros(shape=(len(corpus),DIM_DOC2VEC),dtype=np.float64) 597 | 598 | j=0 599 | for i in tqdm(range(0,len(corpus))): 600 | 601 | #fill feature columns - if ngram shows up in the document, mark it as 1, else leave as 0 602 | for k in range(0,len(corpus[i])): 603 | bow_matrix[j][corpus[i][k][0]]=1 604 | 605 | if(DO_PARAGRAPH_VECTOR): 606 | vect = d2vmodel.infer_vector(d2v_inputs[i],alpha=0.01, steps=50) 607 | 608 | for k in range(0,len(vect)): 609 | pv_matrix[j,k]=vect[k] 610 | 611 | if(DO_WORD2VEC): 612 | 613 | #we want to use vectors based on word average: 614 | temp_avg =np.zeros(shape=(DIM_DOC2VEC),dtype=np.float64) 615 | m_avg=0 616 | real_words=0 617 | for k in range(0,len(d2v_inputs[i])): 618 | 619 | #ignore special end character, otherwise proceed 620 | if(d2v_inputs[i][k].lower()!="sentenceend"): 621 | real_words+=0 622 | try: 623 | #if it can't find the word, zero it out 624 | weight_avg = 1 625 | temp_avg = np.add(temp_avg,weight_avg*d2vmodel[d2v_inputs[i][k]]) 626 | m_avg +=weight_avg 627 | except: 628 | pass # do nothing 629 | 630 | if(real_words>0): temp_avg = np.divide(temp_avg,m_avg) #if vector was empty, just leave it zero 631 | 632 | for k in range(0,DIM_DOC2VEC): 633 | w2v_matrix[j,k]=temp_avg[k] 634 | 635 | j+=1 636 | return bow_matrix,pv_matrix,w2v_matrix 637 | 638 | 639 | def generate_outcomes(labeled_file,accid_list,N_THRESH_OUTCOMES): 640 | """ 641 | Generate dataframe of labels to be used in Lasso logistic regressions 642 | 643 | Args: 644 | labeled_file: path to file with labels and accession ids 645 | accid_list: list of accession ids of each row in the labeled data that are also present in exported reports; 646 | needed to eliminate labeled reports for which we have no text (mistranscribed accession IDs, etc.) 647 | N_THRESH_OUTCOMES: eliminate outcomes that don't have this many positive / negative examples 648 | 649 | Returns: 650 | 651 | eligible_outcomes_aligned: dataframe of labels eligible for prediction 652 | all_outcomes_aligned: dataframe of all labels 653 | N_LABELS: total number of labels we predict 654 | outcome_header_list: list of headers corresponding to each label 655 | """ 656 | 657 | outcomes = pd.read_excel(labeled_file) 658 | outcomes.set_index('Accession Number') 659 | outcomes_aligned2 = pd.DataFrame(data=accid_list, index=accid_list, columns=['Accession Number']) 660 | all_outcomes_aligned = pd.merge(outcomes_aligned2, outcomes, sort=False) 661 | 662 | #modify outcome matrix to only include outcomes with n_thresh_outcomes +/- observations 663 | 664 | outcome_remove=[] 665 | N_LABELS=all_outcomes_aligned.shape[1] 666 | print("total labels:"+str(N_LABELS)) 667 | for i in range(0,N_LABELS): 668 | check=sum(all_outcomes_aligned.iloc[:,i]) 669 | 670 | if(check((all_outcomes_aligned.shape)[0]-N_THRESH_OUTCOMES)): 673 | outcome_remove.append(i) 674 | elif(math.isnan(check)): 675 | outcome_remove.append(i) 676 | 677 | eligible_outcomes_aligned=all_outcomes_aligned.drop(all_outcomes_aligned.columns[outcome_remove],axis=1) 678 | 679 | N_LABELS=eligible_outcomes_aligned.shape[1] 680 | print("labels eligible for inference:"+str(N_LABELS)) 681 | 682 | outcome_header_list=list(eligible_outcomes_aligned) 683 | outcome_header_list=[x.replace(",",".") for x in outcome_header_list] 684 | outcome_header_list=",".join(outcome_header_list) 685 | 686 | return eligible_outcomes_aligned,all_outcomes_aligned, N_LABELS, outcome_header_list 687 | 688 | 689 | def write_silver_standard_labels(corpus, 690 | N_LABELS, 691 | eligible_outcomes_aligned, 692 | DIM_DOC2VEC, 693 | processed_reports, 694 | DO_BOW, 695 | DO_PARAGRAPH_VECTOR, 696 | DO_WORD2VEC, 697 | dictionary, 698 | d2vmodel, 699 | d2v_inputs, 700 | lasso_models, 701 | accid_list, 702 | labeled_indices, 703 | df_data, 704 | SILVER_THRESHOLD): 705 | """ 706 | Generate inferred labels using trained Lasso regression models; override with hand-labeled data when available. 707 | 708 | Args: 709 | 710 | corpus: list of lists of tokens, each entry in original list corresponds to report 711 | N_LABELS: total labels we predict 712 | eligible_outcomes_aligned: dataframe of eligible labels for prediction 713 | DIM_DOC2VEC: embedding dimensionality of average word embedding features 714 | processed_reports: corpus of processed reports 715 | DO_BOW: include bag of words features? 716 | DO_PARAGRAPH_VECTOR: include paragraph vector features? 717 | DO_WORD2VEC: include average word embedding features? 718 | dictionary: dictionary mapping word to integer representation 719 | d2vmodel: trained doc2vec feature 720 | d2v_inputs: reports processed into doc2vec format 721 | lasso_models: list of saved Lasso logistic regression models, each index corresponds to a corresponding column in eligible_outcomes_aligned 722 | accid_list: list of accession ids of each row in the labeled data that are also present in exported reports 723 | labeled_indices: indices of labeled data (or data we treat as labeled because of TRAIN_INDEX_OVERRIDE) 724 | df_data: dataframe containing original reports and accession ids 725 | SILVER_THRESHOLD: "mean" or "fiftypct", defines threshold for converting probabilities to binary labels (mean of label vs. 50%). 726 | note that in either case it will be overridden with true labels when available 727 | 728 | Returns: 729 | 730 | pred_outcome_df: dataframe containing accession ids and inferred labels 731 | 732 | """ 733 | 734 | pred_outcome_matrix_binary = np.zeros(shape=(len(corpus),N_LABELS),dtype=np.int) 735 | pred_outcome_matrix_proba = np.zeros(shape=(len(corpus),N_LABELS),dtype=np.float16) 736 | 737 | #we classify as true/false based on mean of predictor - note dependence on self.SILVER_THRESHOLD 738 | if(SILVER_THRESHOLD=="mean"): 739 | class_thresh = eligible_outcomes_aligned.mean(axis=0) 740 | elif(SILVER_THRESHOLD=="fiftypct"): 741 | class_thresh = [0.5]*eligible_outcomes_aligned.shape[1] 742 | 743 | for x in range(0,len(corpus),2000): 744 | #generate features for whole dataset so we can return inferred labels for deep learning on images themselves 745 | whole_bow_matrix,whole_pv_matrix,whole_w2v_matrix=generate_wholeset_features( 746 | DIM_DOC2VEC, 747 | processed_reports[x:x+2000], 748 | DO_PARAGRAPH_VECTOR, 749 | DO_WORD2VEC,dictionary,corpus[x:x+2000],d2vmodel,d2v_inputs[x:x+2000]) 750 | 751 | #use everything available for prediction - done in chunks to avoid memory issues 752 | #whole_combined_matrix=np.hstack((whole_w2v_matrix,whole_bow_matrix,whole_pv_matrix)) 753 | if(DO_BOW and DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_bow_matrix,whole_w2v_matrix,whole_pv_matrix)) 754 | 755 | if(DO_BOW and DO_WORD2VEC and not DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_bow_matrix,whole_w2v_matrix)) 756 | if(DO_BOW and not DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_bow_matrix,whole_pv_matrix)) 757 | if(not DO_BOW and DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_w2v_matrix,whole_pv_matrix)) 758 | 759 | if(DO_BOW and not DO_WORD2VEC and not DO_PARAGRAPH_VECTOR): whole_combined_matrix=whole_bow_matrix 760 | if(not DO_BOW and DO_WORD2VEC and not DO_PARAGRAPH_VECTOR): whole_combined_matrix=whole_w2v_matrix 761 | if(not DO_BOW and not DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=whole_pv_matrix 762 | 763 | for i in range(0,N_LABELS): 764 | pred_proba=lasso_models[i].predict_proba(whole_combined_matrix)[:,1] 765 | pred_binary = (pred_proba > class_thresh[i]).astype(int) 766 | pred_outcome_matrix_proba[x:x+2000,i]=pred_proba 767 | pred_outcome_matrix_binary[x:x+2000,i]=pred_binary 768 | 769 | #generate list of accession #s for export 770 | accession_list = [] 771 | for i in range(0,len(corpus)): 772 | accession_list.append(df_data['Accession Number'].iloc[i]) 773 | 774 | pred_outcome_proba_df = pd.DataFrame(pred_outcome_matrix_proba, index = accession_list, columns = list(eligible_outcomes_aligned.columns.values) ) 775 | pred_outcome_binary_df = pd.DataFrame(pred_outcome_matrix_binary, index = accession_list, columns = list(eligible_outcomes_aligned.columns.values) ) 776 | 777 | #get accuracy by column 778 | 779 | outcome_lookup ={} 780 | for i in range(0,len(accid_list)): 781 | outcome_lookup[accid_list[i]]=i 782 | 783 | errors = np.zeros(shape=(N_LABELS,1),dtype=np.int) 784 | denom = np.zeros(shape=(N_LABELS,1),dtype=np.int) 785 | tp = np.zeros(shape=(N_LABELS,1),dtype=np.int) 786 | fp = np.zeros(shape=(N_LABELS,1),dtype=np.int) 787 | tn = np.zeros(shape=(N_LABELS,1),dtype=np.int) 788 | fn = np.zeros(shape=(N_LABELS,1),dtype=np.int) 789 | 790 | for i in range(0,len(corpus)): 791 | if i in labeled_indices: # need to evaluate 792 | #grab accession # 793 | accno = df_data['Accession Number'].iloc[i] 794 | 795 | for k in range(0,N_LABELS): 796 | 797 | #does our predicted value match the true one? if not, record discrepancy 798 | if(eligible_outcomes_aligned.ix[outcome_lookup[accno],k]!=pred_outcome_binary_df.iloc[i,k]): 799 | errors[k]+=1 800 | denom[k]+=1 801 | 802 | #set probabilistic predictions to labeled ones regardless 803 | pred_outcome_proba_df.iloc[i,k]=eligible_outcomes_aligned.ix[outcome_lookup[accno],k] 804 | 805 | #if disagreement btw pred and hand-labeled data, use hand labeled 806 | if(eligible_outcomes_aligned.ix[outcome_lookup[accno],k]!=pred_outcome_binary_df.iloc[i,k]): 807 | pred_outcome_binary_df.iloc[i,k]=eligible_outcomes_aligned.ix[outcome_lookup[accno],k] 808 | 809 | #print('classifier accuracy by label on all labeled data including that used to train it (process integrity check)') 810 | #print(str(1-(errors/denom))) 811 | pred_outcome_binary_df.set_index(df_data['Accession Number'],inplace=True) 812 | pred_outcome_proba_df.set_index(df_data['Accession Number'],inplace=True) 813 | return pred_outcome_binary_df,pred_outcome_proba_df 814 | 815 | 816 | def give_stop_words(): 817 | """ 818 | Returns list of stop words. 819 | 820 | Arguments: 821 | 822 | None 823 | 824 | Returns: 825 | 826 | stop_words: a list of stop words. note - we have removed stop words from this example; you can add them below if you have a list of stop words for your application. 827 | """ 828 | #stop_words=["word1", "word2", ...] 829 | stop_words=[] 830 | 831 | 832 | return stop_words 833 | 834 | 835 | class RadReportAnnotator(object): 836 | 837 | def __init__(self, report_dir_path, validation_file_path): 838 | """ 839 | Initialize RadReportAnnotator class 840 | 841 | Args: 842 | 843 | report_dir_path: FOLDER where reports are located in montage xls. Expects columns titled "Accession Number" and "Report Text"; can specify alternate labels in define_config() 844 | validation_file_path: FILE with human-labeled reports file. Expects column titled "Accession Number" as first column, every subsequent column will be interpreted as a label to be predicted. 845 | 846 | Returns: 847 | 848 | Nothing 849 | 850 | """ 851 | 852 | #USER MODIFIABLE SETTINGS - USE define_config() TO SET 853 | self.DO_BOW=None 854 | self.DO_WORD2VEC=None 855 | self.DO_PARAGRAPH_VECTOR=None 856 | self.DO_SILVER_STANDARD=None 857 | self.STEM_WORDS=None 858 | self.N_GRAM_SIZES = None 859 | self.DIM_DOC2VEC = None 860 | self.N_THRESH_CORPUS=None 861 | self.N_THRESH_OUTCOMES=None 862 | self.TRAIN_INDEX_OVERRIDE = None 863 | self.SILVER_THRESHOLD=None 864 | self.NAME_UNID_LABELED_FILE = None 865 | self.NAME_UNID_REPORTS= None 866 | self.NAME_TEXT_REPORTS= None 867 | 868 | 869 | #SETTINGS YOU WILL LIKELY WITH TO LEAVE AS IS, BUT CAN MODIFY IF NEEDED 870 | self.D2V_EPOCH = 20 # 20 works well, # of epochs to train D2V for 871 | self.W2V_DM = 1 # 1 is PV-DM, otherwise PV-DBOW 872 | self.W2V_WINDOW = 5 #we can try 3,5,7 873 | self.data_dir = report_dir_path #"Base directory for raw reports 874 | self.validation_file = validation_file_path #"File containing report annotations") 875 | self.ASSIGNFOLD_USING_ROW=False # normally in lasso regression modeling 60% train / 40% test splits are done randomly. you can do them by row if you need consistency across runs 876 | 877 | 878 | #MENTIONING CLASS OBJECTS USED INTERNALLY LATER 879 | self.df_data=None 880 | self.processed_reports=None 881 | self.labeled_indices=None 882 | self.d2v_inputs=None 883 | self.ngram_reports =None 884 | self.corpus = None 885 | self.train_corpus = None 886 | self.test_corpus = None 887 | self.dictionary = None 888 | self.labeled_indices = None 889 | self.train_docs = None # w2v 890 | self.test_docs = None 891 | self.d2vmodel = None 892 | self.bow_matrix = None 893 | self.combined = None 894 | self.pv_matrix = None 895 | self.w2v_matrix = None 896 | self.accid_list = None 897 | self.orig_text = None 898 | self.orig_input = None 899 | self.eligible_outcomes_aligned = None 900 | self.all_outcomes_aligned = None 901 | self.N_LABELS = None 902 | self.outcome_header_list = None 903 | self.lasso_models = None 904 | self.inferred_binary_labels = None 905 | self.inferred_proba_labels = None 906 | self.headers = None 907 | self.accuracy = None 908 | 909 | 910 | def define_config(self, DO_BOW=True, DO_WORD2VEC=False, DO_PARAGRAPH_VECTOR=False,DO_SILVER_STANDARD=True,STEM_WORDS=True,N_GRAM_SIZES=[1],DIM_DOC2VEC=200,N_THRESH_CORPUS=1,N_THRESH_OUTCOMES=1,TRAIN_INDEX_OVERRIDE=[], SILVER_THRESHOLD="mean", NAME_UNID_REPORTS="Accession Number",NAME_TEXT_REPORTS="Report Text"): 911 | """ 912 | Sets parameters for RadReportAnnotator. 913 | 914 | Args: 915 | 916 | DO_BOW: True to use indicator bag of words-based features (1 if word present in doc, 0 if not). 917 | DO_WORD2VEC: True to use word2vec-based average word embedding fatures. 918 | DO_PARAGRAPH_VECTOR: True to use word2vec-based paragraph vector embedding fatures. 919 | DO_SILVER_STANDARD: True to infer labels for unlabeled reports. 920 | STEM_WORDS: True to stem words for BOW analysis; words are unstemmed in doc2vec analysis 921 | N_GRAM_SIZES: Which set of n-grams to use in BOW analysis: [1] = 1 grams only, [3] = 3 grams only, [1,2,3] = 1, 2, and 3- grams. 922 | DIM_DOC2VEC: Dimensionality of doc2vec manifold; recommend value in 50 to 400 923 | N_THRESH_CORPUS: ignore any n-grams that appear fewer than N times in the entire corpus 924 | N_THRESH_OUTCOMES: do not train models for labels that don't have at least this many positive and negative examples. 925 | TRAIN_INDEX_OVERRIDE: list of accession numbers we force to be treated as unlabeled data even though they are labeled (ie, these will *not* be used in Lasso regressions). May be used if all of your reports are labeled, as some unlabeled reports are required for d2v training. 926 | SILVER_THRESHOLD: how to threshold probability predictions in infer_labels to get binary labels. 927 | can be ["mean","mostlikely"] 928 | mean sets any predicted probability greater than population mean to 1, else 0; e.g., prediction 0.10 in a label with average 0.05 is set to 1 929 | mostlikely sets any predicted probability >50% to 1, otherwise 0 930 | both settings have issues, and class imbalance is a major issue in training convolutional nets. 931 | we recommend using probabilities if your model can accomodate it. 932 | NAME_UNID_REPORTS: column name of accession number / unique report id in the read-in *reports* file. provided for convenience as there may be many report files. 933 | NAME_TEXT_REPORTS: column name of report text in the read-in reports file. provided for convenience as there may be many report files. 934 | Returns: 935 | 936 | Nothing 937 | 938 | """ 939 | 940 | self.DO_BOW=DO_BOW #generate results for bag of words approach? 941 | self.DO_WORD2VEC=DO_WORD2VEC #generate resultes (tfidf and avg weight) for word2vec approach? 942 | self.DO_PARAGRAPH_VECTOR=DO_PARAGRAPH_VECTOR #generate results for paragraph vector approach? 943 | self.DO_SILVER_STANDARD=DO_SILVER_STANDARD #generate silver standard labels? 944 | self.STEM_WORDS=STEM_WORDS #should we stem words for BOW, LDA analysis? (we never stem words or doc2vec/w2v analysis, see below) 945 | if not N_GRAM_SIZES in ([1],[2],[3],[1,2],[1,3],[1,2,3]): 946 | raise ValueError('Invalid N_GRAM_SIZES argument:'+str(N_GRAM_SIZES)+", please review documentation for proper format (e.g., [1])") 947 | self.N_GRAM_SIZES = N_GRAM_SIZES # how many n-grams to use in BOW, LDA analyses? [1] = 1 grams only, [3] = 3 grams only, [1,2,3] = 1, 2, and 3- grams. 948 | self.DIM_DOC2VEC = DIM_DOC2VEC #dimensionality of doc2vec manifold 949 | self.N_THRESH_CORPUS=N_THRESH_CORPUS # delete any n-grams that appear fewer than N times in the entire corpus 950 | self.N_THRESH_OUTCOMES=N_THRESH_OUTCOMES # delete any predictors that don't have at least N-many positive and negative examples 951 | self.TRAIN_INDEX_OVERRIDE = TRAIN_INDEX_OVERRIDE # define a list of indices you want to force to be included as unlabeled data even though they are labeled (ie, these will *not* be used for predictions). Some unlabeled reports are required for d2v training.""" 952 | self.SILVER_THRESHOLD=SILVER_THRESHOLD 953 | self.NAME_UNID_REPORTS = NAME_UNID_REPORTS 954 | self.NAME_TEXT_REPORTS = NAME_TEXT_REPORTS 955 | 956 | if(self.DO_BOW==False and self.DO_WORD2VEC==False and self.DO_PARAGRAPH_VECTOR==False): raise ValueError("DO_BOW and DO_WORD2VEC and DO_PARAGRAPH_VECTOR cannot both be false") 957 | 958 | def build_corpus(self): 959 | """ 960 | Builds corpus of reports and and generates numerical features from reports for later analysis. 961 | Please run define_config() beforehand. 962 | 963 | Arguments: 964 | 965 | None 966 | 967 | Returns: 968 | 969 | None 970 | """ 971 | 972 | #assemble dataframe of reports 973 | self.df_data = join_montage_files(self.data_dir, self.NAME_UNID_REPORTS, self.NAME_TEXT_REPORTS) # build dataframe with all the report text 974 | 975 | #get list of stop words 976 | en_stop = give_stop_words() 977 | 978 | # preprocess report text, get list with length (# reports) and text after first round of processing. 979 | # if curious to see how it works, look at processed_reports[0] to see first report. 980 | self.processed_reports = preprocess_data(self.df_data, en_stop, stem_words=True) 981 | 982 | #determine which indices should be used for 983 | self.labeled_indices = get_labeled_indices(self.df_data,self.validation_file,self.TRAIN_INDEX_OVERRIDE) 984 | 985 | #build n-grams of desired size, takes a list of sizes and frequency threshold as inputs 986 | self.ngram_reports = create_ngrams( 987 | self.processed_reports, 988 | self.labeled_indices, 989 | N_GRAM_SIZES=self.N_GRAM_SIZES, 990 | freq_threshold=self.N_THRESH_CORPUS) #now we create n-grams 991 | 992 | # generate inputs for doc2vec/word2vec model 993 | # can see example report - d2v_inputs[0] 994 | self.d2v_inputs= remove_infrequent_tokens(self.processed_reports,self.N_THRESH_CORPUS,self.labeled_indices) 995 | 996 | #assemble train/test corpora and a word dict. 997 | self.corpus, self.train_corpus, self.test_corpus, self.dictionary, self.labeled_indices = build_train_test_corpus( 998 | self.df_data, 999 | self.ngram_reports, 1000 | self.validation_file, 1001 | self.TRAIN_INDEX_OVERRIDE) 1002 | 1003 | #train doc2vec/word2vec if indicated: 1004 | if(self.DO_WORD2VEC or self.DO_PARAGRAPH_VECTOR): 1005 | self.train_docs, self.test_docs, self.total_train_words = build_d2v_corpora(self.df_data,self.d2v_inputs,self.labeled_indices) 1006 | self.d2vmodel=train_d2v(self.train_docs, self.test_docs, self.D2V_EPOCH, self.DIM_DOC2VEC, self.W2V_DM, self.W2V_WINDOW, self.total_train_words) 1007 | 1008 | def infer_labels(self): 1009 | """ 1010 | Infers labels for unlabeled documents. 1011 | Please run build_corpus() beforehand. 1012 | 1013 | Arguments: 1014 | 1015 | None 1016 | 1017 | Returns: 1018 | 1019 | self.inferred_labels: dataframe containing inferred labels 1020 | """ 1021 | 1022 | #get the numerical features of text we need to train models for labels 1023 | self.bow_matrix, self.pv_matrix,self.w2v_matrix,self.accid_list,self.orig_text,self.orig_input=generate_labeled_data_features( 1024 | self.validation_file, 1025 | self.labeled_indices, 1026 | self.DIM_DOC2VEC, 1027 | self.df_data, 1028 | self.processed_reports, 1029 | self.DO_PARAGRAPH_VECTOR, 1030 | self.DO_WORD2VEC, 1031 | self.dictionary, 1032 | self.corpus, 1033 | self.d2vmodel, 1034 | self.d2v_inputs) 1035 | 1036 | #get and process labels for reports 1037 | self.eligible_outcomes_aligned,self.all_outcomes_aligned, self.N_LABELS, self.outcome_header_list = generate_outcomes( 1038 | self.validation_file, 1039 | self.accid_list, 1040 | self.N_THRESH_OUTCOMES) 1041 | 1042 | #to generate silver standard labels -- use whatever features are generated (word2vec average word embeddings, bow features, paragraph vector matrix) 1043 | if(self.DO_BOW and self.DO_WORD2VEC and self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.bow_matrix,self.w2v_matrix,self.pv_matrix)) 1044 | 1045 | if(self.DO_BOW and self.DO_WORD2VEC and not self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.bow_matrix,self.w2v_matrix)) 1046 | if(self.DO_BOW and not self.DO_WORD2VEC and self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.bow_matrix,self.pv_matrix)) 1047 | if(not self.DO_BOW and self.DO_WORD2VEC and self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.w2v_matrix,self.pv_matrix)) 1048 | 1049 | if(self.DO_BOW and not self.DO_WORD2VEC and not self.DO_PARAGRAPH_VECTOR): self.combined=self.bow_matrix 1050 | if(not self.DO_BOW and self.DO_WORD2VEC and not self.DO_PARAGRAPH_VECTOR): self.combined=self.w2v_matrix 1051 | if(not self.DO_BOW and not self.DO_WORD2VEC and self. DO_PARAGRAPH_VECTOR): self.combined=self.pv_matrix 1052 | 1053 | #create header for combined predictor matrix so we can interpret coefficients 1054 | self.headers=[] 1055 | if(self.DO_BOW): self.headers=self.headers + [self.dictionary[i] for i in self.dictionary] 1056 | if(self.DO_WORD2VEC): self.headers=self.headers + ["W2V"+str(i) for i in range(0,self.DIM_DOC2VEC)] 1057 | if(self.DO_PARAGRAPH_VECTOR): self.headers=self.headers + ["PV"+str(i) for i in range(0,self.DIM_DOC2VEC)] 1058 | 1059 | 1060 | pred_type = "combined" # a label for results 1061 | print("dimensionality of predictor matrix:"+str(self.combined.shape)) 1062 | 1063 | #run lasso regressions 1064 | self.lasso_models, self.accuracy = calc_auc(self.combined,self.eligible_outcomes_aligned,self.all_outcomes_aligned, self.N_LABELS, pred_type, self.headers,self.ASSIGNFOLD_USING_ROW) 1065 | 1066 | #infer labels 1067 | self.inferred_binary_labels, self.inferred_proba_labels = write_silver_standard_labels(self.corpus, 1068 | self.N_LABELS, 1069 | self.eligible_outcomes_aligned, 1070 | self.DIM_DOC2VEC, 1071 | self.processed_reports, 1072 | self.DO_BOW, 1073 | self.DO_PARAGRAPH_VECTOR, 1074 | self.DO_WORD2VEC, 1075 | self.dictionary, 1076 | self.d2vmodel, 1077 | self.d2v_inputs, 1078 | self.lasso_models, 1079 | self.accid_list, 1080 | self.labeled_indices, 1081 | self.df_data, 1082 | self.SILVER_THRESHOLD) 1083 | return self.inferred_binary_labels, self.inferred_proba_labels 1084 | 1085 | 1086 | 1087 | 1088 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: rad_env 2 | channels: 3 | - conda-forge 4 | dependencies: 5 | - python=3.6.5 6 | - pandas=0.22.0 7 | - numpy=1.14.2 8 | - tqdm=4.23.0 9 | - scikit-learn=0.19.0 10 | - xlrd=1.1.0 11 | - jupyterlab=0.35.0 12 | - nb_conda_kernels=2.1.1 13 | - nltk=3.4.4 14 | - gensim=3.5.0 15 | -------------------------------------------------------------------------------- /pseudodata/labels/labeled_reports.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aisinai/rad-report-annotator/61ee948866bb09d272fc75210c63dcb818b3c21d/pseudodata/labels/labeled_reports.xlsx -------------------------------------------------------------------------------- /pseudodata/reports/words.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aisinai/rad-report-annotator/61ee948866bb09d272fc75210c63dcb818b3c21d/pseudodata/reports/words.xlsx --------------------------------------------------------------------------------