├── Demo Notebook.ipynb
├── README.md
├── RadReportAnnotator.py
├── environment.yml
└── pseudodata
├── labels
└── labeled_reports.xlsx
└── reports
└── words.xlsx
/Demo Notebook.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# RadReportAnnotator Demo\n",
8 | "\n",
9 | "We demonstrate on data from the [Indiana University Chest X-ray Dataset (Demner-Fushman et al.)](https://www.ncbi.nlm.nih.gov/pubmed/26133894)\n",
10 | "\n",
11 | "This example can be adapted to your own collection of radiology reports exported from Montage \n",
12 | "and a manually-generated set of classification labels"
13 | ]
14 | },
15 | {
16 | "cell_type": "markdown",
17 | "metadata": {},
18 | "source": [
19 | "Import library:"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 1,
25 | "metadata": {
26 | "collapsed": true
27 | },
28 | "outputs": [],
29 | "source": [
30 | "import RadReportAnnotator as ra\n",
31 | "import os.path"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "Instantiate RadReportAnnotator object with paths to demo `reports` and `labels`. \n",
39 | "\n",
40 | "`Reports` contains 3,666 deidentified chest x-ray radiology reports. \n",
41 | "\n",
42 | "`Labels` contains binary labels for `Normal`, `Opacity`, `Cardiomegaly`, `Nodule`, and `Fibrosis` for 1,500 of these reports."
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 2,
48 | "metadata": {
49 | "collapsed": true
50 | },
51 | "outputs": [],
52 | "source": [
53 | "CXRAnnotator = ra.RadReportAnnotator(report_dir_path=os.path.join(\"pseudodata\",\"reports\"), \n",
54 | " validation_file_path=os.path.join(\"pseudodata\",\"labels\",\"labeled_reports.xlsx\"))"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "Set arguments for RadReportAnnotator here in define_config - see documentation in RadReportAnnotator for more information.\n",
62 | "\n",
63 | "Models that use only bag of words (`DO_BOW=True,DO_WORD2VEC=False`) have been competitive in our experience with those that use both bag of words and word embeddings (`DO_BOW=True, DO_WORD2VEC=True`). Word embeddings can take considerable time to train on larger datasets. \n",
64 | "\n",
65 | "In the below demo, we use bag of words features (`DO_BOW=True`) with 1, 2, and 3-grams (`N_GRAM_SIZES=[1,2,3]`)."
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 3,
71 | "metadata": {
72 | "collapsed": true
73 | },
74 | "outputs": [],
75 | "source": [
76 | "CXRAnnotator.define_config(DO_BOW=True,\n",
77 | "\tDO_WORD2VEC=False,\n",
78 | "\tDO_PARAGRAPH_VECTOR=False,\n",
79 | "\tN_GRAM_SIZES=[1,2,3],\n",
80 | "\tSILVER_THRESHOLD=\"fiftypct\",\n",
81 | "\tNAME_UNID_REPORTS = \"ACCID\", \n",
82 | "\tNAME_TEXT_REPORTS =\"REPORT\", \n",
83 | "\tN_THRESH_CORPUS=10,\n",
84 | "\tN_THRESH_OUTCOMES=50)"
85 | ]
86 | },
87 | {
88 | "cell_type": "markdown",
89 | "metadata": {},
90 | "source": [
91 | "Build corpus from reports"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 4,
97 | "metadata": {},
98 | "outputs": [
99 | {
100 | "name": "stdout",
101 | "output_type": "stream",
102 | "text": [
103 | "building pre-corpus\n",
104 | "pre-corpus built\n",
105 | "preprocessing reports\n"
106 | ]
107 | },
108 | {
109 | "name": "stderr",
110 | "output_type": "stream",
111 | "text": [
112 | "100%|█████████████████████████████████████████████████████████████████████████████| 3666/3666 [00:07<00:00, 473.86it/s]\n"
113 | ]
114 | },
115 | {
116 | "name": "stdout",
117 | "output_type": "stream",
118 | "text": [
119 | "creating n-grams\n"
120 | ]
121 | },
122 | {
123 | "name": "stderr",
124 | "output_type": "stream",
125 | "text": [
126 | "100%|████████████████████████████████████████████████████████████████████████████| 3666/3666 [00:00<00:00, 6268.53it/s]\n"
127 | ]
128 | },
129 | {
130 | "name": "stdout",
131 | "output_type": "stream",
132 | "text": [
133 | "number of unique n-grams: 33865\n",
134 | "number of unique n-grams after filtering out low frequency tokens: 2425\n"
135 | ]
136 | }
137 | ],
138 | "source": [
139 | "CXRAnnotator.build_corpus()"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "We can examine how the preprocessing works. Let's look at the original input text for report at index 500:"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 5,
152 | "metadata": {},
153 | "outputs": [
154 | {
155 | "data": {
156 | "text/plain": [
157 | "' Comparison: None Indication: Central line placement Findings: The heart is borderline in size. The aorta is mildly tortuous. XXXX right IJ catheter is in XXXX with tip in proximal right atrium/cavoatrial junction. There is no pneumothorax. Lungs are grossly clear. There is no large effusion. Impression: Right IJ catheter tip in proximal right atrium. No pneumothorax. '"
158 | ]
159 | },
160 | "execution_count": 5,
161 | "metadata": {},
162 | "output_type": "execute_result"
163 | }
164 | ],
165 | "source": [
166 | "CXRAnnotator.df_data['Report Text'].iloc[500]"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "Let's look this report after preprocessing:"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 6,
179 | "metadata": {},
180 | "outputs": [
181 | {
182 | "name": "stdout",
183 | "output_type": "stream",
184 | "text": [
185 | "['comparison', 'none', 'indic', 'central', 'line', 'placement', 'find', 'the', 'heart', 'is', 'borderlin', 'in', 'size', 'sentenceend', 'the', 'aorta', 'is', 'mildli', 'tortuou', 'sentenceend', 'xxxx', 'right', 'ij', 'cathet', 'is', 'in', 'xxxx', 'with', 'tip', 'in', 'proxim', 'right', 'atrium', 'cavoatri', 'junction', 'sentenceend', 'there', 'is', 'no', 'pneumothorax', 'sentenceend', 'lung', 'are', 'grossli', 'clear', 'sentenceend', 'there', 'is', 'no', 'larg', 'effus', 'sentenceend', 'impress', 'right', 'ij', 'cathet', 'tip', 'in', 'proxim', 'right', 'atrium', 'sentenceend', 'no', 'pneumothorax', 'sentenceend', 'sentenceend', 'sentenceend', 'sentenceend']\n"
186 | ]
187 | }
188 | ],
189 | "source": [
190 | "print(CXRAnnotator.processed_reports[500])"
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "Words were stemmed (\"indication\"-->\"indic\"), extra punctuation was removed, and periods were replaced with the special end character. Word2vec takes input in a format like this to learn word embeddings.\n",
198 | "\n",
199 | "Let's look at the n-gram features for this report, which will be used for bag of words modeling:"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": 7,
205 | "metadata": {},
206 | "outputs": [
207 | {
208 | "name": "stdout",
209 | "output_type": "stream",
210 | "text": [
211 | "['find_the_heart', 'the_heart_is', 'the_aorta_is', 'lung_are_grossli', 'are_grossli_clear', 'no_larg_effus', 'comparison_none', 'find_the', 'the_heart', 'heart_is', 'in_size', 'the_aorta', 'aorta_is', 'is_mildli', 'xxxx_right', 'is_in', 'in_xxxx', 'xxxx_with', 'with_tip', 'tip_in', 'right_atrium', 'there_is', 'no_pneumothorax', 'lung_are', 'are_grossli', 'grossli_clear', 'there_is', 'no_larg', 'larg_effus', 'impress_right', 'cathet_tip', 'tip_in', 'right_atrium', 'no_pneumothorax', 'comparison', 'none', 'indic', 'central', 'line', 'placement', 'find', 'the', 'heart', 'is', 'borderlin', 'in', 'size', 'the', 'aorta', 'is', 'mildli', 'tortuou', 'xxxx', 'right', 'cathet', 'is', 'in', 'xxxx', 'with', 'tip', 'in', 'right', 'atrium', 'junction', 'there', 'is', 'pneumothorax', 'lung', 'are', 'grossli', 'clear', 'there', 'is', 'larg', 'effus', 'impress', 'right', 'cathet', 'tip', 'in', 'right', 'atrium', 'pneumothorax']\n"
212 | ]
213 | }
214 | ],
215 | "source": [
216 | "print(CXRAnnotator.ngram_reports[500])"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "Since we have `N_GRAM_SIZES=[1,2,3]` in this demo, we see individual words (1-grams), each 2 consecutive words (2-grams; e.g., 'comparison_none'), and each 3 consecutive words ('no_larg_effus') available as features. Sometimes these 2- and 3-grams are uninformative ('comparison_none'), at other times they may be useful ('no_pneumothorax'). Note that only n-grams appearing `N_THRESH_CORPUS` times in training data (10 in this demo) are included. "
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {},
229 | "source": [
230 | "Train Lasso logistic regression models using features from 60% of labeled reports and infer labels for 40% of labeled reports (for performance evaluation) and unlabeled reports (for ultimate application):"
231 | ]
232 | },
233 | {
234 | "cell_type": "code",
235 | "execution_count": 8,
236 | "metadata": {},
237 | "outputs": [
238 | {
239 | "name": "stdout",
240 | "output_type": "stream",
241 | "text": [
242 | "generating features\n"
243 | ]
244 | },
245 | {
246 | "name": "stderr",
247 | "output_type": "stream",
248 | "text": [
249 | "100%|████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:00<00:00, 4099.24it/s]\n"
250 | ]
251 | },
252 | {
253 | "name": "stdout",
254 | "output_type": "stream",
255 | "text": [
256 | "total labels:6\n",
257 | "labels eligible for inference:4\n",
258 | "dimensionality of predictor matrix:(1500, 2425)\n",
259 | "n_train in modeling=900\n",
260 | "n_test in modeling=600\n",
261 | "i=0\n"
262 | ]
263 | },
264 | {
265 | "name": "stderr",
266 | "output_type": "stream",
267 | "text": [
268 | "100%|███████████████████████████████████████████████████████████████████████████| 2000/2000 [00:00<00:00, 26965.13it/s]\n",
269 | "100%|███████████████████████████████████████████████████████████████████████████| 1666/1666 [00:00<00:00, 19683.19it/s]\n"
270 | ]
271 | }
272 | ],
273 | "source": [
274 | "binary_labels, proba_labels = CXRAnnotator.infer_labels()"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "Examine quality of predictions on held out 40% of labeled data."
282 | ]
283 | },
284 | {
285 | "cell_type": "code",
286 | "execution_count": 9,
287 | "metadata": {},
288 | "outputs": [
289 | {
290 | "data": {
291 | "text/html": [
292 | "
\n",
293 | "\n",
306 | "
\n",
307 | " \n",
308 | " \n",
309 | " | \n",
310 | " AUC | \n",
311 | " True + | \n",
312 | " False + | \n",
313 | " True - | \n",
314 | " False - | \n",
315 | "
\n",
316 | " \n",
317 | " Label (with calcs on held out 40 pct) | \n",
318 | " | \n",
319 | " | \n",
320 | " | \n",
321 | " | \n",
322 | " | \n",
323 | "
\n",
324 | " \n",
325 | " \n",
326 | " \n",
327 | " Normal | \n",
328 | " 0.956679 | \n",
329 | " 208 | \n",
330 | " 53 | \n",
331 | " 324 | \n",
332 | " 15 | \n",
333 | "
\n",
334 | " \n",
335 | " Opacity | \n",
336 | " 0.981869 | \n",
337 | " 62 | \n",
338 | " 17 | \n",
339 | " 517 | \n",
340 | " 4 | \n",
341 | "
\n",
342 | " \n",
343 | " Cardiomegaly | \n",
344 | " 0.993979 | \n",
345 | " 41 | \n",
346 | " 18 | \n",
347 | " 541 | \n",
348 | " 0 | \n",
349 | "
\n",
350 | " \n",
351 | " Nodule | \n",
352 | " 0.991759 | \n",
353 | " 16 | \n",
354 | " 36 | \n",
355 | " 548 | \n",
356 | " 0 | \n",
357 | "
\n",
358 | " \n",
359 | "
\n",
360 | "
"
361 | ],
362 | "text/plain": [
363 | " AUC True + False + True - \\\n",
364 | "Label (with calcs on held out 40 pct) \n",
365 | "Normal 0.956679 208 53 324 \n",
366 | "Opacity 0.981869 62 17 517 \n",
367 | "Cardiomegaly 0.993979 41 18 541 \n",
368 | "Nodule 0.991759 16 36 548 \n",
369 | "\n",
370 | " False - \n",
371 | "Label (with calcs on held out 40 pct) \n",
372 | "Normal 15 \n",
373 | "Opacity 4 \n",
374 | "Cardiomegaly 0 \n",
375 | "Nodule 0 "
376 | ]
377 | },
378 | "execution_count": 9,
379 | "metadata": {},
380 | "output_type": "execute_result"
381 | }
382 | ],
383 | "source": [
384 | "CXRAnnotator.accuracy"
385 | ]
386 | },
387 | {
388 | "cell_type": "markdown",
389 | "metadata": {},
390 | "source": [
391 | "Notice `Fibrosis` was filtered out despite appearing in input data as we had very few positive observations. It is important to ensure that sufficient positive and negative cases for each label exist in your labeled data.\n",
392 | "\n",
393 | "Rare labels with high AUC may still have a significant number of false positives (`Nodule`). Be aware of noise introduced by your labeling process before using inferred labels to train convolutional neural networks or other algorithms, and consider the positive predictive value (PPV) of a positive label. Additional labeled examples, particularly of rare pathology, may help improve accuracy. \n",
394 | "\n",
395 | "Recent results ([Ghafoorian et al.](https://arxiv.org/abs/1801.05040) [Rajpurkar et al.](https://arxiv.org/abs/1711.05225)) demonstrate that deep learning can achieve impressive results when trained to a large noisily labeled radiological imaging dataset."
396 | ]
397 | },
398 | {
399 | "cell_type": "markdown",
400 | "metadata": {},
401 | "source": [
402 | "Examine a few probabilistic predictions:"
403 | ]
404 | },
405 | {
406 | "cell_type": "code",
407 | "execution_count": 10,
408 | "metadata": {},
409 | "outputs": [
410 | {
411 | "data": {
412 | "text/html": [
413 | "\n",
414 | "\n",
427 | "
\n",
428 | " \n",
429 | " \n",
430 | " | \n",
431 | " Normal | \n",
432 | " Opacity | \n",
433 | " Cardiomegaly | \n",
434 | " Nodule | \n",
435 | "
\n",
436 | " \n",
437 | " Accession Number | \n",
438 | " | \n",
439 | " | \n",
440 | " | \n",
441 | " | \n",
442 | "
\n",
443 | " \n",
444 | " \n",
445 | " \n",
446 | " 103661 | \n",
447 | " 0.113953 | \n",
448 | " 0.007305 | \n",
449 | " 0.022156 | \n",
450 | " 0.009483 | \n",
451 | "
\n",
452 | " \n",
453 | " 103662 | \n",
454 | " 0.283203 | \n",
455 | " 0.007305 | \n",
456 | " 0.022156 | \n",
457 | " 0.009483 | \n",
458 | "
\n",
459 | " \n",
460 | " 103663 | \n",
461 | " 0.283203 | \n",
462 | " 0.007305 | \n",
463 | " 0.022156 | \n",
464 | " 0.009483 | \n",
465 | "
\n",
466 | " \n",
467 | " 103664 | \n",
468 | " 0.000129 | \n",
469 | " 0.060547 | \n",
470 | " 0.058807 | \n",
471 | " 0.037109 | \n",
472 | "
\n",
473 | " \n",
474 | " 103665 | \n",
475 | " 0.020233 | \n",
476 | " 0.999512 | \n",
477 | " 0.011406 | \n",
478 | " 0.019058 | \n",
479 | "
\n",
480 | " \n",
481 | "
\n",
482 | "
"
483 | ],
484 | "text/plain": [
485 | " Normal Opacity Cardiomegaly Nodule\n",
486 | "Accession Number \n",
487 | "103661 0.113953 0.007305 0.022156 0.009483\n",
488 | "103662 0.283203 0.007305 0.022156 0.009483\n",
489 | "103663 0.283203 0.007305 0.022156 0.009483\n",
490 | "103664 0.000129 0.060547 0.058807 0.037109\n",
491 | "103665 0.020233 0.999512 0.011406 0.019058"
492 | ]
493 | },
494 | "execution_count": 10,
495 | "metadata": {},
496 | "output_type": "execute_result"
497 | }
498 | ],
499 | "source": [
500 | "proba_labels.tail()"
501 | ]
502 | },
503 | {
504 | "cell_type": "markdown",
505 | "metadata": {},
506 | "source": [
507 | "Examine a few binary predictions - these override to manual labels when available:"
508 | ]
509 | },
510 | {
511 | "cell_type": "code",
512 | "execution_count": 11,
513 | "metadata": {},
514 | "outputs": [
515 | {
516 | "data": {
517 | "text/html": [
518 | "\n",
519 | "\n",
532 | "
\n",
533 | " \n",
534 | " \n",
535 | " | \n",
536 | " Normal | \n",
537 | " Opacity | \n",
538 | " Cardiomegaly | \n",
539 | " Nodule | \n",
540 | "
\n",
541 | " \n",
542 | " Accession Number | \n",
543 | " | \n",
544 | " | \n",
545 | " | \n",
546 | " | \n",
547 | "
\n",
548 | " \n",
549 | " \n",
550 | " \n",
551 | " 103661 | \n",
552 | " 0 | \n",
553 | " 0 | \n",
554 | " 0 | \n",
555 | " 0 | \n",
556 | "
\n",
557 | " \n",
558 | " 103662 | \n",
559 | " 0 | \n",
560 | " 0 | \n",
561 | " 0 | \n",
562 | " 0 | \n",
563 | "
\n",
564 | " \n",
565 | " 103663 | \n",
566 | " 0 | \n",
567 | " 0 | \n",
568 | " 0 | \n",
569 | " 0 | \n",
570 | "
\n",
571 | " \n",
572 | " 103664 | \n",
573 | " 0 | \n",
574 | " 0 | \n",
575 | " 0 | \n",
576 | " 0 | \n",
577 | "
\n",
578 | " \n",
579 | " 103665 | \n",
580 | " 0 | \n",
581 | " 1 | \n",
582 | " 0 | \n",
583 | " 0 | \n",
584 | "
\n",
585 | " \n",
586 | "
\n",
587 | "
"
588 | ],
589 | "text/plain": [
590 | " Normal Opacity Cardiomegaly Nodule\n",
591 | "Accession Number \n",
592 | "103661 0 0 0 0\n",
593 | "103662 0 0 0 0\n",
594 | "103663 0 0 0 0\n",
595 | "103664 0 0 0 0\n",
596 | "103665 0 1 0 0"
597 | ]
598 | },
599 | "execution_count": 11,
600 | "metadata": {},
601 | "output_type": "execute_result"
602 | }
603 | ],
604 | "source": [
605 | "binary_labels.tail()"
606 | ]
607 | },
608 | {
609 | "cell_type": "markdown",
610 | "metadata": {},
611 | "source": [
612 | "You can examine individual report predictions; here are report and predictions for a report that manual reviewers coded as `Normal`:"
613 | ]
614 | },
615 | {
616 | "cell_type": "code",
617 | "execution_count": 12,
618 | "metadata": {},
619 | "outputs": [
620 | {
621 | "name": "stdout",
622 | "output_type": "stream",
623 | "text": [
624 | " Comparison: None. Indication: XXXX, chest pain and XXXX x2 weeks. Findings: The cardiomediastinal silhouette and pulmonary vasculature are within normal limits in size. The lungs are clear of focal airspace disease, pneumothorax, or pleural effusion. There are no acute bony findings. Impression: No acute cardiopulmonary findings. \n",
625 | "\n",
626 | "\n",
627 | "Normal 0.969727\n",
628 | "Opacity 0.001776\n",
629 | "Cardiomegaly 0.000642\n",
630 | "Nodule 0.000948\n",
631 | "Name: 101700, dtype: float64\n",
632 | "\n",
633 | "\n",
634 | "Normal 1\n",
635 | "Opacity 0\n",
636 | "Cardiomegaly 0\n",
637 | "Nodule 0\n",
638 | "Name: 101700, dtype: int32\n"
639 | ]
640 | }
641 | ],
642 | "source": [
643 | "#normal report\n",
644 | "print(CXRAnnotator.df_data['Report Text'].iloc[1700])\n",
645 | "print(\"\\n\")\n",
646 | "print(proba_labels.iloc[1700])\n",
647 | "print(\"\\n\")\n",
648 | "print(binary_labels.iloc[1700])"
649 | ]
650 | },
651 | {
652 | "cell_type": "markdown",
653 | "metadata": {},
654 | "source": [
655 | "Here are report and predictions for a report that manual reviewers coded as positive for `Cardiomegaly`:"
656 | ]
657 | },
658 | {
659 | "cell_type": "code",
660 | "execution_count": 13,
661 | "metadata": {},
662 | "outputs": [
663 | {
664 | "name": "stdout",
665 | "output_type": "stream",
666 | "text": [
667 | " Comparison: PA and lateral chest x-XXXX dated XXXX. Indication: XXXX-year-old female with chest pain. Findings: The heart size is enlarged. Tortuous aorta. Otherwise the mediastinal contour is within normal limits. The lungs are free of any focal infiltrates. There are no nodules or masses. No visible pneumothorax. No visible pleural fluid. The XXXX are grossly normal. There is no visible free intraperitoneal air under the diaphragm. Impression: 1. Cardiomegaly without lung infiltrates. \n",
668 | "\n",
669 | "\n",
670 | "Normal 0.008018\n",
671 | "Opacity 0.001008\n",
672 | "Cardiomegaly 0.981445\n",
673 | "Nodule 0.056152\n",
674 | "Name: 102100, dtype: float64\n",
675 | "\n",
676 | "\n",
677 | "Normal 0\n",
678 | "Opacity 0\n",
679 | "Cardiomegaly 1\n",
680 | "Nodule 0\n",
681 | "Name: 102100, dtype: int32\n"
682 | ]
683 | }
684 | ],
685 | "source": [
686 | "print(CXRAnnotator.df_data['Report Text'].iloc[2100])\n",
687 | "print(\"\\n\")\n",
688 | "print(proba_labels.iloc[2100])\n",
689 | "print(\"\\n\")\n",
690 | "print(binary_labels.iloc[2100])"
691 | ]
692 | },
693 | {
694 | "cell_type": "markdown",
695 | "metadata": {},
696 | "source": [
697 | "Here are report and predictions for a report that manual reviewers coded as positive for `Opacity`:"
698 | ]
699 | },
700 | {
701 | "cell_type": "code",
702 | "execution_count": 14,
703 | "metadata": {},
704 | "outputs": [
705 | {
706 | "name": "stdout",
707 | "output_type": "stream",
708 | "text": [
709 | " Comparison: XXXX, XXXX Indication: XXXX-year-old XXXX with chest pain. Findings: The heart size is stable. The aorta is ectatic and atherosclerotic but stable. XXXX sternotomy XXXX are again noted. The scarring in the left lower lobe is again noted and unchanged from prior exam. There are mild bilateral prominent lung interstitial opacities consistent with emphysematous disease. The calcified granulomas are stable. Impression: 1. Changes of emphysema and left lower lobe scarring, both stable. 2. Unchanged degenerative and atherosclerotic changes of the thoracic aorta. \n",
710 | "\n",
711 | "\n",
712 | "Normal 0.000000\n",
713 | "Opacity 0.981445\n",
714 | "Cardiomegaly 0.125977\n",
715 | "Nodule 0.234497\n",
716 | "Name: 102770, dtype: float64\n",
717 | "\n",
718 | "\n",
719 | "Normal 0\n",
720 | "Opacity 1\n",
721 | "Cardiomegaly 0\n",
722 | "Nodule 0\n",
723 | "Name: 102770, dtype: int32\n"
724 | ]
725 | }
726 | ],
727 | "source": [
728 | "#opacity\n",
729 | "print(CXRAnnotator.df_data['Report Text'].iloc[2770])\n",
730 | "print(\"\\n\")\n",
731 | "print(proba_labels.iloc[2770])\n",
732 | "print(\"\\n\")\n",
733 | "print(binary_labels.iloc[2770])"
734 | ]
735 | }
736 | ],
737 | "metadata": {
738 | "kernelspec": {
739 | "display_name": "Python 3",
740 | "language": "python",
741 | "name": "python3"
742 | },
743 | "language_info": {
744 | "codemirror_mode": {
745 | "name": "ipython",
746 | "version": 3
747 | },
748 | "file_extension": ".py",
749 | "mimetype": "text/x-python",
750 | "name": "python",
751 | "nbconvert_exporter": "python",
752 | "pygments_lexer": "ipython3",
753 | "version": "3.6.2"
754 | }
755 | },
756 | "nbformat": 4,
757 | "nbformat_minor": 2
758 | }
759 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # RadReportAnnotator
2 |
3 | Authors: jrzech, eko
4 |
5 | Provides a library of methods for automatically inferring labels for a corpus of radiological reports given a set of manually-labeled data. These methods are described in our publication [Natural Language–based Machine Learning Models for the Annotation of Clinical Radiology Reports](https://doi.org/10.1148/radiol.2018171093).
6 |
7 | ## Getting Started:
8 |
9 | To configure your own local instance (assumes [Anaconda is installed](https://www.anaconda.com/download/)):
10 |
11 | ```
12 | git clone https://www.github.com/aisinai/rad-report-annotator.git
13 | cd rad-report-annotator
14 | conda env create -f environment.yml
15 | source activate rad_env
16 | python -m ipykernel install --user --name rad_env --display-name "Python (rad_env)"
17 | ```
18 |
19 | *Note as of Oct 11, 2022: this conda environment builds on Linux and Windows, but not on Mac as older versions of gensim for Mac are not available in conda-forge.*
20 |
21 | To see a demo of the library on data from the [Indiana University Chest X-ray Dataset (Demner-Fushman et al.)](https://www.ncbi.nlm.nih.gov/pubmed/26133894), please open `Demo Notebook.ipynb` and run all cells.
22 |
23 |
--------------------------------------------------------------------------------
/RadReportAnnotator.py:
--------------------------------------------------------------------------------
1 | """
2 | RadReportAnnotator
3 | Authors: jrzech, eko
4 |
5 | This is a library of methods for automatically inferring labels for a corpus or radiological documents given a set of manually-labeled data.
6 |
7 | """
8 |
9 | #usual imports for data science
10 | import numpy as np
11 | import pandas as pd
12 | import sys
13 | import os
14 | import math
15 | from tqdm import tqdm
16 |
17 | #sklearn
18 | from sklearn.model_selection import StratifiedKFold
19 | from sklearn.model_selection import GridSearchCV
20 | from sklearn.linear_model import LogisticRegression
21 | from sklearn.metrics import roc_auc_score
22 |
23 | #NLP imports
24 | from nltk.tokenize import RegexpTokenizer
25 | from nltk.stem.porter import PorterStemmer
26 | import re
27 |
28 | #gensim for word embedding featurization
29 | import gensim
30 | from collections import namedtuple
31 |
32 | #misc
33 | import glob
34 | import os.path
35 | import multiprocessing
36 | import random
37 |
38 |
39 | def join_montage_files(data_dir,NAME_UNID_REPORTS, NAME_TEXT_REPORTS):
40 | """
41 | Joins several montage files in excel format into a single pandas dataframe
42 | Args:
43 | data_dir: a filepath pointing to a directory containing montage files in excel format
44 | NAME_UNID_REPORTS: column name of unique id / accession id in reports xlsx
45 | NAME_TEXT_REPORTS: column name of report text in reports xlsx
46 | Returns:
47 | df_data: a pandas dataframe containing texts from montage
48 | """
49 | print("building pre-corpus")
50 | datafiles = os.listdir(data_dir)
51 | df_data = pd.read_excel(os.path.join(data_dir,datafiles[0]))
52 | datafiles.remove(datafiles[0])
53 | for subcorpus in datafiles:
54 | df_data = df_data.append(pd.read_excel(os.path.join(data_dir,subcorpus)))
55 | print('pre-corpus built')
56 |
57 | df_data.rename(columns={NAME_UNID_REPORTS:'Accession Number',NAME_TEXT_REPORTS:'Report Text'},inplace=True)
58 | return df_data
59 |
60 | def preprocess_data(df_data, en_stop, stem_words=True):
61 | """
62 | Takes a dataframe of montage files and a list of stop words and returns a list of lda_inputs. The lda_inputs
63 | list consists of sublists of stemmed unigrams.
64 | Args:
65 | df_data: a dataframe of joined montage files.
66 | en_stop: a list of english stop_words from the stop_words library
67 | stem_words: argument indicating whether or not to stem words
68 | Returns:
69 | lda_inputs: a list of lists of stemmed words from each text within the montage dataframe
70 | """
71 | if(stem_words==False):
72 | print("NOTE - NOT STEMMING")
73 | p_stemmer = PorterStemmer()
74 | processed_reports = []
75 | accession_index=[]
76 |
77 | print("preprocessing reports")
78 | for i in tqdm(range(0,df_data.shape[0])):
79 |
80 | tokenizer = RegexpTokenizer(r'\w+')
81 | process = df_data['Report Text'].iloc[i]
82 |
83 | process = str(process)
84 | process = process + "..." # add a period, sometimes it's missing at end
85 | process = process.lower()
86 |
87 | z = len(process)
88 | k = 0
89 | #remove line breaks
90 | process=process.replace("^M", " ") #
91 | process=process.replace("\n", " ") #
92 | process=process.replace("\r", " ") #
93 | process=process.replace("_", " ") #
94 | process=process.replace("-", " ") #
95 | process=process.replace(",", " , ") #
96 | process=process.replace(" ", " ") #
97 | process=process.replace(" ", " ") #
98 | process=process.replace(" ", " ") #
99 | process=process.replace(" ", " ") #
100 | process=process.replace(" ", " ") #
101 |
102 | process = re.sub(r'\d+', '',process)
103 | process=process.replace(".", " SENTENCEEND ") # create end characters
104 |
105 | process_tokenized = tokenizer.tokenize(process)
106 | process_stopped = [i for i in process_tokenized if not i in en_stop]
107 |
108 | if(stem_words==True):
109 | process_stemmed = [p_stemmer.stem(i) for i in process_stopped]
110 | else:
111 | process_stemmed = process_stopped
112 |
113 | processed_reports.append(process_stemmed)
114 | #include n grams in lda_input
115 | return processed_reports
116 |
117 | def remove_infrequent_tokens(processed_reports,freq_threshold,labeled_indices):
118 | """
119 | Takes a list of processed_preports and removes infrequent tokens (defined as occurring < freq_threshold times) from them.
120 | Args:
121 | processed_reports: list of lists of stemmed words after initial processing, where each entry corresponds to a report
122 | freq_threshold: count threshold, remove words occuring < freq_threshold time from corpus. note - considers only unlabeled corpus, not labeled corpus, to avoid peeking into labeled data.
123 | labeled_indices: indices of processed_reports that are labeled reports - these are excluded from frequency calculations.
124 | Returns:
125 | process_reports_postcountfilter: list of lists of stemmed words after initial processing, where each entry corresponds to a report, after low frequency words have been removed
126 | """
127 | word_count = common_stems(processed_reports, labeled_indices)
128 | d = dict((k,v) for k, v in word_count.items() if v >= freq_threshold)
129 | process_reports_postcountfilter=[[] for x in range(0,len(processed_reports))]
130 | for i in range(0,len(processed_reports)):
131 | for token in processed_reports[i]:
132 | if token in d:
133 | process_reports_postcountfilter[i].append(token)
134 | return process_reports_postcountfilter
135 |
136 | def create_ngrams(processed_reports, labeled_indices, N_GRAM_SIZES, freq_threshold):
137 | """
138 | Takes a a processed_reports list, specified n_gram size list, and a frequency threshold at which
139 | to eliminate tokens with < frequency of appearance returns creates n_grams as well as removes ngrams that signify end of sentence
140 | Args:
141 | processed_reports: a list of text lists of stemmed unigrams ready for conversion into n-gram text lists
142 | labeled_indices: exclude these from calculcation of n-gram cutoff
143 | N_GRAM_SIZES: a list of ints specifying the n-gram sizes to include in the texts of the future corpus
144 | freq_threshold: the frequency threshold for n-gram inclusions. N-grams that occur with frequency < threshold will be removed from corpus
145 | Returns:
146 | processed_outputs_clean: a list of text lists of n-grams that are ready to be processed into a corpus
147 | """
148 | processed_outputs = []
149 | print("creating n-grams")
150 | for report in tqdm(processed_reports[:]):
151 | new_report = []
152 | end=len(report)
153 | #CREATES 4-grams - for all n-grams, we don't allow "no" to be in middle of n-gram, don't allow sentenceend token to be in n-gram
154 | if 4 in N_GRAM_SIZES:
155 | for i in range (0,end-3):
156 | if (report[i+1] != "no" and report[i+2] != "no" and report[i+3] != "no" and report[i].lower()!= "sentenceend" and report[i+1].lower()!= "sentenceend" and report[i+2].lower()!= "sentenceend" and report[i+3]!= "sentenceend"): #no only at beginning
157 | new_report.append(report[i] +"_" +report[i+1] + "_" + report[i+2] + "_" + report[i+3])
158 | #CREATES 3-grams
159 | if 3 in N_GRAM_SIZES:
160 | for i in range (0,end-2):
161 | if (report[i+1] != "no" and report[i+2] != "no" and report[i].lower()!= "sentenceend" and report[i+1].lower()!= "sentenceend" and report[i+2].lower()!= "sentenceend"): #no only at beginning
162 | new_report.append(report[i] +"_" +report[i+1] + "_" + report[i+2])
163 | #CREATES 2-grams
164 | if 2 in N_GRAM_SIZES:
165 | for i in range (0,end-1):
166 | if (report[i+1] != "no" and report[i].lower()!= "sentenceend" and report[i+1].lower()!= "sentenceend"): #no only at beginning
167 | new_report.append(report[i] +"_" +report[i+1])
168 | #CREATES unigrams
169 | if 1 in N_GRAM_SIZES:
170 | for i in range (0,end):
171 | if(report[i].lower()!= "sentenceend" and report[i]!= "no"): # we take no out as a unigram in bow
172 | new_report.append(report[i])
173 | processed_outputs.append(new_report)
174 |
175 | #remove low freq tokens
176 | word_count = common_stems(processed_outputs, labeled_indices)
177 | print("number of unique n-grams:", len(word_count))
178 | d = dict((k,v) for k, v in word_count.items() if v >= freq_threshold)
179 | print("number of unique n-grams after filtering out low frequency tokens:", len(d))
180 |
181 | #remove tokens that occurred infrequently from processed_outputs --> processed_outputs_clean
182 | processed_outputs_clean=[[] for x in range(0,len(processed_outputs))]
183 | for i in range(0,len(processed_outputs)):
184 | for token in processed_outputs[i]:
185 | if token in d:
186 | processed_outputs_clean[i].append(token)
187 | return processed_outputs_clean
188 |
189 | def get_labeled_indices(df_data,validation_file,TRAIN_INDEX_OVERRIDE):
190 | """
191 | Returns numerical indices of reports in df_data for which we have labeled data in validation_file; will set labeled reports as unlabeled if in TRAIN_INDEX_OVERRIDE
192 | Args:
193 | df_data: dataframe containing report text and accession ids
194 | validation_file: dataframe containing accession ids and labels
195 | TRAIN_INDEX_OVERRIDE: list of numerical indices to treat as unlabeled; necessary to train d2v model if all your data is labeled as it uses exclusively unlabeled data to train to avoid peeking into labeled data
196 | Returns:
197 | return_indices: indices we treat as labeled
198 | """
199 | validation = pd.read_excel(validation_file)
200 | validation.set_index('Accession Number')
201 | validation_cases=validation['Accession Number'].tolist()
202 | all_indices = df_data['Accession Number'].tolist()
203 | return_indices=[]
204 | for i in all_indices:
205 | if i in validation_cases and i not in TRAIN_INDEX_OVERRIDE: # if something is manually overrided to be in train, don't put it in test
206 | return_indices.append(True)
207 | else:
208 | return_indices.append(False)
209 | return return_indices
210 |
211 | def common_stems(ngram_list, exclude_indices):
212 | """
213 | Takes a list of ngrams, ngram_list, and returns the most frequently appearing stems as a dict item of word:word_count pairs
214 | is flagged to write output to memory.
215 | Args:
216 | ngram_list: list of all n_grams
217 | exclude_indices:rows to ignore when doing count (labeled data)
218 | Returns:
219 | word_count: dict of ngram:ngram_count pairs
220 | """
221 | word_count={}
222 | i=0
223 | excluded=0
224 | for entry in ngram_list:
225 | if exclude_indices[i]==False:
226 | for word in entry:
227 | if word not in word_count:
228 | #add word with entry 1
229 | word_count[word] = 1
230 | else:
231 | #increment entry by 1
232 | word_count[word]=word_count[word]+1
233 | else:
234 | excluded=excluded+1
235 | i=i+1
236 | d = dict((k,v) for k, v in word_count.items())
237 |
238 | return word_count
239 |
240 |
241 | def build_train_test_corpus(df_data, ngram_list, labeled_filepath,TRAIN_INDEX_OVERRIDE):
242 | """
243 | Takes the master corpus, the ngram_list, and a filepath pointing to a labeled spreadsheet
244 | and builds a labeled_corpus consisting of labelled data and an unlabeled_corpus
245 | of non-labelled data
246 | Args:
247 | df_data: a dataframe consisting of the original set of excel files with report text and accession id
248 | ngram_list: list of all n-grams in corpus
249 | labeled_filepath: path to file containing accession ids and labels
250 | TRAIN_INDEX_OVERRIDE: indices to treat as unlabeled data regardless of presence of labels.
251 | Returns:
252 | train_corpus: a corpus consisting of unlabelled texts that will be used for model construction
253 | test_corpus: a corpus consisting of labelled held-out texts that will be used for model validation
254 | dictionary: a dictionary compromised of the LDA input n-grams
255 | labeled_indices: the indices for the validation files
256 | """
257 | dictionary = gensim.corpora.Dictionary(ngram_list)
258 | corpus = [dictionary.doc2bow(input) for input in ngram_list]
259 | if(not labeled_filepath is None):
260 | outcomes = pd.read_excel(labeled_filepath)
261 | outcomes.set_index('Accession Number')
262 | labeled_cases=outcomes['Accession Number'].tolist()
263 | else:
264 | labeled_cases=[]
265 | labeled_indices = []
266 | not_labeled_indices = []
267 | train_data_lda = np.ones(df_data.shape[0],dtype=bool)
268 | num_removed=0
269 | for i in range(0,df_data.shape[0]):
270 | if df_data['Accession Number'].iloc[i] in labeled_cases and df_data['Accession Number'].iloc[i] not in TRAIN_INDEX_OVERRIDE:
271 | train_data_lda[i]=False
272 | labeled_indices.append(i)
273 | num_removed += 1
274 | else:
275 | not_labeled_indices.append(i)
276 | unlabeled_corpus = [corpus[i] for i in not_labeled_indices]
277 | labeled_corpus = [corpus[i] for i in labeled_indices]
278 |
279 | return corpus, unlabeled_corpus, labeled_corpus, dictionary, labeled_indices
280 |
281 |
282 | def build_d2v_corpora(df_data,d2v_inputs,labeled_indices):
283 | """
284 | Build corpora in format for doc2vec gensim implementation
285 | Args:
286 | df_data: a dataframe consisting of the original set of excel files with report text and accession id
287 | d2v_inputs: list of lists of tokens, where each entry in d2v_inputs corresponds to a report
288 | labeled_indices: indices of labeled reports (and those we treat as labeled due to TRAIN_INDEX_OVERRIDE)
289 | Returns:
290 | unlabeled_corpus: a corpus consisting of unlabelled texts that will be used for feature construction
291 | labeled_corpus: a corpus consisting of labelled held-out texts that will be used for Lasso regression training
292 | total_unlabeled_words: count of total words in unlabeled corpus
293 | """
294 |
295 | SentimentDocument = namedtuple('SentimentDocument', 'words tags')
296 | unlabeled_docs = []
297 | labeled_docs = []
298 | total_unlabeled_words=0
299 | i=0
300 | for line in d2v_inputs:
301 | words = line # [x for x in line if x != 'END']
302 | tags = '' + str(df_data['Accession Number'].iloc[i])
303 | if(i in labeled_indices):
304 | labeled_docs.append(SentimentDocument(words,tags))
305 | else:
306 | unlabeled_docs.append(SentimentDocument(words,tags))
307 | total_unlabeled_words+=len(words)
308 | i+=1
309 |
310 | print('%d unlabeled reports for featurization, %d labeled reports for modeling' % (len(unlabeled_docs), len(labeled_docs)))
311 | return unlabeled_docs, labeled_docs, total_unlabeled_words
312 |
313 |
314 | def train_d2v(unlabeled_docs, labeled_docs, D2V_EPOCH, DIM_DOC2VEC, W2V_DM, W2V_WINDOW, total_unlabeled_words):
315 | """
316 | Train doc2vec/word2vec model.
317 |
318 | Args:
319 | unlabeled_docs: unlabeled corpus
320 | labeled_docs: labeled corpus
321 | D2V_EPOCHS: number of epochs to train d2v model; 20 has worked well in our experiments; parameter for gensim doc2vec
322 | DIM_DOC2VEC: dimensionality of embedding vectors, we explored values 50-800; parameter for gensim doc2vec
323 | W2V_DM: 1 is PV-DM, otherwise PV-DBOW; parameter for gensim doc2vec
324 | W2V_WINDOW: number of words window to use in doc2vec model; parameter for gensim doc2vec
325 | total_unlabeled_words: total words in unlabeled corpus; argument for gensim doc2vec
326 |
327 | Returns:
328 | d2vmodel: trained doc2vec model.
329 | """
330 |
331 | cores = multiprocessing.cpu_count()
332 | assert gensim.models.doc2vec.FAST_VERSION > -1, "speed up"
333 | print("started doc2vec training")
334 | d2vmodel = gensim.models.Doc2Vec(dm=W2V_DM, size=DIM_DOC2VEC, window=W2V_WINDOW, negative=5, hs=0, min_count=2, workers=cores)
335 | d2vmodel.build_vocab(unlabeled_docs + labeled_docs)
336 | d2vmodel.train(unlabeled_docs, total_words=total_unlabeled_words, epochs=D2V_EPOCH)
337 | print("finished doc2vec training")
338 | return d2vmodel
339 |
340 | def calc_auc(predictor_matrix,eligible_outcomes_aligned, all_outcomes_aligned,N_LABELS, pred_type, header,ASSIGNFOLD_USING_ROW=False):
341 | """
342 | Train Lasso models using 60% of labeled data with generated features and labels; calculate AUC, accuracy,
343 | confusion matrix for each label on remaining 40% of labeled data.
344 |
345 | Args:
346 |
347 | predictor_matrix: numpy matrix of features available to use as input to Lasso logistic regression
348 | eligible_outcomes_aligned: dataframe of labels we are predicting
349 | all_outcomes_aligned: dataframe of all labels, including those we excluded due to infrequent positive/negative occurences - we use it for accession id
350 | N_LABELS: total number of labels we are predicting
351 | pred_type: label indicating what variables went into predictor_matrix
352 | results_dir: directory to which to save results
353 | header: header for predictor matrix
354 | ASSIGNFOLD_USING_ROW: normally 60/40 split done randomly, you can fix it to use first 60% of rows if you need replicability
355 | but be wary of introducing distortion into train/test set with dates, etc.: recommend randomly sorting
356 | rows in excel beforehand if you opt for this.
357 |
358 | Returns:
359 |
360 | lasso_models: list of all trained lasso logistic regression models from sklearn, where index corresponds to relative index in columns of eligible_outcomes_aligend
361 | """
362 |
363 | if predictor_matrix.shape[1]!=len(header):
364 | print("predictor_matrix.shape[1]="+str(predictor_matrix.shape[1]))
365 | print("len(header)"+str(len(header)))
366 | raise ValueError("predictor_matrix shape doesn't match header, investigate")
367 | all_coef = pd.concat([ pd.DataFrame(header)], axis = 1)
368 |
369 | lasso_models={}
370 | model_types = ["Lasso"]
371 |
372 | r = list(range(eligible_outcomes_aligned.shape[0]))
373 | random.shuffle(r)
374 |
375 | if(ASSIGNFOLD_USING_ROW):
376 | assignfold = pd.DataFrame(data=list(range(eligible_outcomes_aligned.shape[0])), columns=['train'])
377 | else:
378 | assignfold = pd.DataFrame(data=r, columns=['train'])
379 |
380 | cutoff = np.floor(0.6*eligible_outcomes_aligned.shape[0])
381 |
382 | train=assignfold['train']=cutoff
384 |
385 | N_TRAIN=eligible_outcomes_aligned.ix[train,:].shape[0]
386 | N_HELDOUT=eligible_outcomes_aligned.ix[test,:].shape[0]
387 | print("n_train in modeling="+str(N_TRAIN))
388 | print("n_test in modeling="+str(N_HELDOUT))
389 |
390 | confusion = pd.DataFrame(data=np.zeros(shape=(eligible_outcomes_aligned.shape[1]*len(model_types),6),dtype=np.int),columns=['Label (with calcs on held out 40 pct)','AUC','True +','False +','True -','False -'])
391 |
392 | resultrow=0
393 | for i in range(0,N_LABELS):
394 | PROCEED=True;
395 | #need to make sure we don't have an invalid setting -- ie, a train[x] set of labels that is uniform, else Lasso regression fails
396 | if(len(set(eligible_outcomes_aligned.ix[train,i].tolist())))==1:
397 | PROCEED=False;
398 | raise ValueError ("fed label to lasso regression with no variation - cannot compute - please investigate data")
399 |
400 | if(PROCEED):
401 |
402 | for model_type in model_types:
403 | if(model_type=="Lasso"):
404 | parameters = { "penalty": ['l1'],
405 | "C": [64,32,16,8,4,2,1,0.5,0.25,0.1,0.05,0.025,0.01,0.005]
406 | }
407 | try:
408 | cv = StratifiedKFold(n_splits=5)
409 | grid_search = GridSearchCV(LogisticRegression(), param_grid=parameters, scoring='neg_log_loss', cv=cv)
410 | grid_search.fit(predictor_matrix[train,:],np.array(eligible_outcomes_aligned.ix[train,i]))
411 | best_parameters0 = grid_search.best_estimator_.get_params()
412 | model0 = LogisticRegression(**best_parameters0)
413 | except:
414 | raise ValueError ("error in lasso regression - likely data issue, may involve rare labels - please investigate data")
415 | model0.fit(predictor_matrix[np.array(train),:],eligible_outcomes_aligned.ix[train,i])
416 | pred0=model0.predict_proba(predictor_matrix[np.array(test),:])[:,1]
417 | coef = pd.concat([ pd.DataFrame(header),pd.DataFrame(np.transpose(model0.coef_))], axis = 1)
418 | df0 = pd.DataFrame({'predict':pred0,'target':eligible_outcomes_aligned.ix[test,i], 'label':all_outcomes_aligned['Accession Number'][test]})
419 |
420 | calc_auc=roc_auc_score(np.array(df0['target']),np.array(df0['predict']))
421 | if(i%10==0):
422 | print("i="+str(i))
423 | save_name=str(list(eligible_outcomes_aligned.columns.values)[i])
424 |
425 | target_predicted=''.join(e for e in save_name if e.isalnum())
426 |
427 | #confusion: outcome TP TN FP FN
428 | thresh = np.mean(df0['target'])
429 | FP=0
430 | FN=0
431 | TP=0
432 | TN=0
433 | for j in df0.index:
434 | cpred=df0.ix[j][1]
435 | ctarget = df0.ix[j][2]
436 |
437 | if cpred>=thresh and ctarget==1:
438 | TP+=1
439 | if cpred=thresh and ctarget==0:
442 | FP+=1
443 | if cpred0): temp_avg = np.divide(temp_avg,m_avg) #if vector was empty, just leave it zero
558 |
559 | for k in range(0,DIM_DOC2VEC):
560 | w2v_matrix[j,k]=temp_avg[k]
561 |
562 | j+=1
563 | return bow_matrix, pv_matrix,w2v_matrix,accid_list,orig_text,orig_input
564 |
565 | def generate_wholeset_features(DIM_DOC2VEC,
566 | processed_reports,
567 | DO_PARAGRAPH_VECTOR,
568 | DO_WORD2VEC,
569 | dictionary,
570 | corpus,
571 | d2vmodel,
572 | d2v_inputs):
573 | """
574 | Generate numerical features to be used in Lasso logistic regressions using text data for all reports (labeled and unlabeled)
575 |
576 | Args:
577 |
578 | DIM_DOC2VEC: embedding dimensionality of doc2vec
579 | processed_reports: list of list of words, each entry in original list corresponding to a report
580 | DO_PARAGRAPH_VECTOR: use paragraph vector features?
581 | DO_WORD2VEC: use average word embedding features?
582 | dictionary: a dictionary compromised of the LDA input n-grams
583 | corpus: corpus with both unlabeled and labeled data, list of lists
584 | d2vmodel: trained doc2vec model object
585 | d2v_inputs: reports processed into d2v input format
586 |
587 | Returns:
588 |
589 | bow_matrix: numpy matrix with indicator bow features (1 if word present, 0 else), each row corresponds to a report
590 | pv_matrix: numpy matrix with paragraph vector embedding features, each row corresponds to a report
591 | w2v_matrix: numpy matrix with average word embedding features, each row corresponds to a report
592 | """
593 |
594 | bow_matrix = np.zeros(shape=(len(corpus),len(dictionary)),dtype=np.int)
595 | pv_matrix = np.zeros(shape=(len(corpus),DIM_DOC2VEC),dtype=np.float64)
596 | w2v_matrix = np.zeros(shape=(len(corpus),DIM_DOC2VEC),dtype=np.float64)
597 |
598 | j=0
599 | for i in tqdm(range(0,len(corpus))):
600 |
601 | #fill feature columns - if ngram shows up in the document, mark it as 1, else leave as 0
602 | for k in range(0,len(corpus[i])):
603 | bow_matrix[j][corpus[i][k][0]]=1
604 |
605 | if(DO_PARAGRAPH_VECTOR):
606 | vect = d2vmodel.infer_vector(d2v_inputs[i],alpha=0.01, steps=50)
607 |
608 | for k in range(0,len(vect)):
609 | pv_matrix[j,k]=vect[k]
610 |
611 | if(DO_WORD2VEC):
612 |
613 | #we want to use vectors based on word average:
614 | temp_avg =np.zeros(shape=(DIM_DOC2VEC),dtype=np.float64)
615 | m_avg=0
616 | real_words=0
617 | for k in range(0,len(d2v_inputs[i])):
618 |
619 | #ignore special end character, otherwise proceed
620 | if(d2v_inputs[i][k].lower()!="sentenceend"):
621 | real_words+=0
622 | try:
623 | #if it can't find the word, zero it out
624 | weight_avg = 1
625 | temp_avg = np.add(temp_avg,weight_avg*d2vmodel[d2v_inputs[i][k]])
626 | m_avg +=weight_avg
627 | except:
628 | pass # do nothing
629 |
630 | if(real_words>0): temp_avg = np.divide(temp_avg,m_avg) #if vector was empty, just leave it zero
631 |
632 | for k in range(0,DIM_DOC2VEC):
633 | w2v_matrix[j,k]=temp_avg[k]
634 |
635 | j+=1
636 | return bow_matrix,pv_matrix,w2v_matrix
637 |
638 |
639 | def generate_outcomes(labeled_file,accid_list,N_THRESH_OUTCOMES):
640 | """
641 | Generate dataframe of labels to be used in Lasso logistic regressions
642 |
643 | Args:
644 | labeled_file: path to file with labels and accession ids
645 | accid_list: list of accession ids of each row in the labeled data that are also present in exported reports;
646 | needed to eliminate labeled reports for which we have no text (mistranscribed accession IDs, etc.)
647 | N_THRESH_OUTCOMES: eliminate outcomes that don't have this many positive / negative examples
648 |
649 | Returns:
650 |
651 | eligible_outcomes_aligned: dataframe of labels eligible for prediction
652 | all_outcomes_aligned: dataframe of all labels
653 | N_LABELS: total number of labels we predict
654 | outcome_header_list: list of headers corresponding to each label
655 | """
656 |
657 | outcomes = pd.read_excel(labeled_file)
658 | outcomes.set_index('Accession Number')
659 | outcomes_aligned2 = pd.DataFrame(data=accid_list, index=accid_list, columns=['Accession Number'])
660 | all_outcomes_aligned = pd.merge(outcomes_aligned2, outcomes, sort=False)
661 |
662 | #modify outcome matrix to only include outcomes with n_thresh_outcomes +/- observations
663 |
664 | outcome_remove=[]
665 | N_LABELS=all_outcomes_aligned.shape[1]
666 | print("total labels:"+str(N_LABELS))
667 | for i in range(0,N_LABELS):
668 | check=sum(all_outcomes_aligned.iloc[:,i])
669 |
670 | if(check((all_outcomes_aligned.shape)[0]-N_THRESH_OUTCOMES)):
673 | outcome_remove.append(i)
674 | elif(math.isnan(check)):
675 | outcome_remove.append(i)
676 |
677 | eligible_outcomes_aligned=all_outcomes_aligned.drop(all_outcomes_aligned.columns[outcome_remove],axis=1)
678 |
679 | N_LABELS=eligible_outcomes_aligned.shape[1]
680 | print("labels eligible for inference:"+str(N_LABELS))
681 |
682 | outcome_header_list=list(eligible_outcomes_aligned)
683 | outcome_header_list=[x.replace(",",".") for x in outcome_header_list]
684 | outcome_header_list=",".join(outcome_header_list)
685 |
686 | return eligible_outcomes_aligned,all_outcomes_aligned, N_LABELS, outcome_header_list
687 |
688 |
689 | def write_silver_standard_labels(corpus,
690 | N_LABELS,
691 | eligible_outcomes_aligned,
692 | DIM_DOC2VEC,
693 | processed_reports,
694 | DO_BOW,
695 | DO_PARAGRAPH_VECTOR,
696 | DO_WORD2VEC,
697 | dictionary,
698 | d2vmodel,
699 | d2v_inputs,
700 | lasso_models,
701 | accid_list,
702 | labeled_indices,
703 | df_data,
704 | SILVER_THRESHOLD):
705 | """
706 | Generate inferred labels using trained Lasso regression models; override with hand-labeled data when available.
707 |
708 | Args:
709 |
710 | corpus: list of lists of tokens, each entry in original list corresponds to report
711 | N_LABELS: total labels we predict
712 | eligible_outcomes_aligned: dataframe of eligible labels for prediction
713 | DIM_DOC2VEC: embedding dimensionality of average word embedding features
714 | processed_reports: corpus of processed reports
715 | DO_BOW: include bag of words features?
716 | DO_PARAGRAPH_VECTOR: include paragraph vector features?
717 | DO_WORD2VEC: include average word embedding features?
718 | dictionary: dictionary mapping word to integer representation
719 | d2vmodel: trained doc2vec feature
720 | d2v_inputs: reports processed into doc2vec format
721 | lasso_models: list of saved Lasso logistic regression models, each index corresponds to a corresponding column in eligible_outcomes_aligned
722 | accid_list: list of accession ids of each row in the labeled data that are also present in exported reports
723 | labeled_indices: indices of labeled data (or data we treat as labeled because of TRAIN_INDEX_OVERRIDE)
724 | df_data: dataframe containing original reports and accession ids
725 | SILVER_THRESHOLD: "mean" or "fiftypct", defines threshold for converting probabilities to binary labels (mean of label vs. 50%).
726 | note that in either case it will be overridden with true labels when available
727 |
728 | Returns:
729 |
730 | pred_outcome_df: dataframe containing accession ids and inferred labels
731 |
732 | """
733 |
734 | pred_outcome_matrix_binary = np.zeros(shape=(len(corpus),N_LABELS),dtype=np.int)
735 | pred_outcome_matrix_proba = np.zeros(shape=(len(corpus),N_LABELS),dtype=np.float16)
736 |
737 | #we classify as true/false based on mean of predictor - note dependence on self.SILVER_THRESHOLD
738 | if(SILVER_THRESHOLD=="mean"):
739 | class_thresh = eligible_outcomes_aligned.mean(axis=0)
740 | elif(SILVER_THRESHOLD=="fiftypct"):
741 | class_thresh = [0.5]*eligible_outcomes_aligned.shape[1]
742 |
743 | for x in range(0,len(corpus),2000):
744 | #generate features for whole dataset so we can return inferred labels for deep learning on images themselves
745 | whole_bow_matrix,whole_pv_matrix,whole_w2v_matrix=generate_wholeset_features(
746 | DIM_DOC2VEC,
747 | processed_reports[x:x+2000],
748 | DO_PARAGRAPH_VECTOR,
749 | DO_WORD2VEC,dictionary,corpus[x:x+2000],d2vmodel,d2v_inputs[x:x+2000])
750 |
751 | #use everything available for prediction - done in chunks to avoid memory issues
752 | #whole_combined_matrix=np.hstack((whole_w2v_matrix,whole_bow_matrix,whole_pv_matrix))
753 | if(DO_BOW and DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_bow_matrix,whole_w2v_matrix,whole_pv_matrix))
754 |
755 | if(DO_BOW and DO_WORD2VEC and not DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_bow_matrix,whole_w2v_matrix))
756 | if(DO_BOW and not DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_bow_matrix,whole_pv_matrix))
757 | if(not DO_BOW and DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_w2v_matrix,whole_pv_matrix))
758 |
759 | if(DO_BOW and not DO_WORD2VEC and not DO_PARAGRAPH_VECTOR): whole_combined_matrix=whole_bow_matrix
760 | if(not DO_BOW and DO_WORD2VEC and not DO_PARAGRAPH_VECTOR): whole_combined_matrix=whole_w2v_matrix
761 | if(not DO_BOW and not DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=whole_pv_matrix
762 |
763 | for i in range(0,N_LABELS):
764 | pred_proba=lasso_models[i].predict_proba(whole_combined_matrix)[:,1]
765 | pred_binary = (pred_proba > class_thresh[i]).astype(int)
766 | pred_outcome_matrix_proba[x:x+2000,i]=pred_proba
767 | pred_outcome_matrix_binary[x:x+2000,i]=pred_binary
768 |
769 | #generate list of accession #s for export
770 | accession_list = []
771 | for i in range(0,len(corpus)):
772 | accession_list.append(df_data['Accession Number'].iloc[i])
773 |
774 | pred_outcome_proba_df = pd.DataFrame(pred_outcome_matrix_proba, index = accession_list, columns = list(eligible_outcomes_aligned.columns.values) )
775 | pred_outcome_binary_df = pd.DataFrame(pred_outcome_matrix_binary, index = accession_list, columns = list(eligible_outcomes_aligned.columns.values) )
776 |
777 | #get accuracy by column
778 |
779 | outcome_lookup ={}
780 | for i in range(0,len(accid_list)):
781 | outcome_lookup[accid_list[i]]=i
782 |
783 | errors = np.zeros(shape=(N_LABELS,1),dtype=np.int)
784 | denom = np.zeros(shape=(N_LABELS,1),dtype=np.int)
785 | tp = np.zeros(shape=(N_LABELS,1),dtype=np.int)
786 | fp = np.zeros(shape=(N_LABELS,1),dtype=np.int)
787 | tn = np.zeros(shape=(N_LABELS,1),dtype=np.int)
788 | fn = np.zeros(shape=(N_LABELS,1),dtype=np.int)
789 |
790 | for i in range(0,len(corpus)):
791 | if i in labeled_indices: # need to evaluate
792 | #grab accession #
793 | accno = df_data['Accession Number'].iloc[i]
794 |
795 | for k in range(0,N_LABELS):
796 |
797 | #does our predicted value match the true one? if not, record discrepancy
798 | if(eligible_outcomes_aligned.ix[outcome_lookup[accno],k]!=pred_outcome_binary_df.iloc[i,k]):
799 | errors[k]+=1
800 | denom[k]+=1
801 |
802 | #set probabilistic predictions to labeled ones regardless
803 | pred_outcome_proba_df.iloc[i,k]=eligible_outcomes_aligned.ix[outcome_lookup[accno],k]
804 |
805 | #if disagreement btw pred and hand-labeled data, use hand labeled
806 | if(eligible_outcomes_aligned.ix[outcome_lookup[accno],k]!=pred_outcome_binary_df.iloc[i,k]):
807 | pred_outcome_binary_df.iloc[i,k]=eligible_outcomes_aligned.ix[outcome_lookup[accno],k]
808 |
809 | #print('classifier accuracy by label on all labeled data including that used to train it (process integrity check)')
810 | #print(str(1-(errors/denom)))
811 | pred_outcome_binary_df.set_index(df_data['Accession Number'],inplace=True)
812 | pred_outcome_proba_df.set_index(df_data['Accession Number'],inplace=True)
813 | return pred_outcome_binary_df,pred_outcome_proba_df
814 |
815 |
816 | def give_stop_words():
817 | """
818 | Returns list of stop words.
819 |
820 | Arguments:
821 |
822 | None
823 |
824 | Returns:
825 |
826 | stop_words: a list of stop words. note - we have removed stop words from this example; you can add them below if you have a list of stop words for your application.
827 | """
828 | #stop_words=["word1", "word2", ...]
829 | stop_words=[]
830 |
831 |
832 | return stop_words
833 |
834 |
835 | class RadReportAnnotator(object):
836 |
837 | def __init__(self, report_dir_path, validation_file_path):
838 | """
839 | Initialize RadReportAnnotator class
840 |
841 | Args:
842 |
843 | report_dir_path: FOLDER where reports are located in montage xls. Expects columns titled "Accession Number" and "Report Text"; can specify alternate labels in define_config()
844 | validation_file_path: FILE with human-labeled reports file. Expects column titled "Accession Number" as first column, every subsequent column will be interpreted as a label to be predicted.
845 |
846 | Returns:
847 |
848 | Nothing
849 |
850 | """
851 |
852 | #USER MODIFIABLE SETTINGS - USE define_config() TO SET
853 | self.DO_BOW=None
854 | self.DO_WORD2VEC=None
855 | self.DO_PARAGRAPH_VECTOR=None
856 | self.DO_SILVER_STANDARD=None
857 | self.STEM_WORDS=None
858 | self.N_GRAM_SIZES = None
859 | self.DIM_DOC2VEC = None
860 | self.N_THRESH_CORPUS=None
861 | self.N_THRESH_OUTCOMES=None
862 | self.TRAIN_INDEX_OVERRIDE = None
863 | self.SILVER_THRESHOLD=None
864 | self.NAME_UNID_LABELED_FILE = None
865 | self.NAME_UNID_REPORTS= None
866 | self.NAME_TEXT_REPORTS= None
867 |
868 |
869 | #SETTINGS YOU WILL LIKELY WITH TO LEAVE AS IS, BUT CAN MODIFY IF NEEDED
870 | self.D2V_EPOCH = 20 # 20 works well, # of epochs to train D2V for
871 | self.W2V_DM = 1 # 1 is PV-DM, otherwise PV-DBOW
872 | self.W2V_WINDOW = 5 #we can try 3,5,7
873 | self.data_dir = report_dir_path #"Base directory for raw reports
874 | self.validation_file = validation_file_path #"File containing report annotations")
875 | self.ASSIGNFOLD_USING_ROW=False # normally in lasso regression modeling 60% train / 40% test splits are done randomly. you can do them by row if you need consistency across runs
876 |
877 |
878 | #MENTIONING CLASS OBJECTS USED INTERNALLY LATER
879 | self.df_data=None
880 | self.processed_reports=None
881 | self.labeled_indices=None
882 | self.d2v_inputs=None
883 | self.ngram_reports =None
884 | self.corpus = None
885 | self.train_corpus = None
886 | self.test_corpus = None
887 | self.dictionary = None
888 | self.labeled_indices = None
889 | self.train_docs = None # w2v
890 | self.test_docs = None
891 | self.d2vmodel = None
892 | self.bow_matrix = None
893 | self.combined = None
894 | self.pv_matrix = None
895 | self.w2v_matrix = None
896 | self.accid_list = None
897 | self.orig_text = None
898 | self.orig_input = None
899 | self.eligible_outcomes_aligned = None
900 | self.all_outcomes_aligned = None
901 | self.N_LABELS = None
902 | self.outcome_header_list = None
903 | self.lasso_models = None
904 | self.inferred_binary_labels = None
905 | self.inferred_proba_labels = None
906 | self.headers = None
907 | self.accuracy = None
908 |
909 |
910 | def define_config(self, DO_BOW=True, DO_WORD2VEC=False, DO_PARAGRAPH_VECTOR=False,DO_SILVER_STANDARD=True,STEM_WORDS=True,N_GRAM_SIZES=[1],DIM_DOC2VEC=200,N_THRESH_CORPUS=1,N_THRESH_OUTCOMES=1,TRAIN_INDEX_OVERRIDE=[], SILVER_THRESHOLD="mean", NAME_UNID_REPORTS="Accession Number",NAME_TEXT_REPORTS="Report Text"):
911 | """
912 | Sets parameters for RadReportAnnotator.
913 |
914 | Args:
915 |
916 | DO_BOW: True to use indicator bag of words-based features (1 if word present in doc, 0 if not).
917 | DO_WORD2VEC: True to use word2vec-based average word embedding fatures.
918 | DO_PARAGRAPH_VECTOR: True to use word2vec-based paragraph vector embedding fatures.
919 | DO_SILVER_STANDARD: True to infer labels for unlabeled reports.
920 | STEM_WORDS: True to stem words for BOW analysis; words are unstemmed in doc2vec analysis
921 | N_GRAM_SIZES: Which set of n-grams to use in BOW analysis: [1] = 1 grams only, [3] = 3 grams only, [1,2,3] = 1, 2, and 3- grams.
922 | DIM_DOC2VEC: Dimensionality of doc2vec manifold; recommend value in 50 to 400
923 | N_THRESH_CORPUS: ignore any n-grams that appear fewer than N times in the entire corpus
924 | N_THRESH_OUTCOMES: do not train models for labels that don't have at least this many positive and negative examples.
925 | TRAIN_INDEX_OVERRIDE: list of accession numbers we force to be treated as unlabeled data even though they are labeled (ie, these will *not* be used in Lasso regressions). May be used if all of your reports are labeled, as some unlabeled reports are required for d2v training.
926 | SILVER_THRESHOLD: how to threshold probability predictions in infer_labels to get binary labels.
927 | can be ["mean","mostlikely"]
928 | mean sets any predicted probability greater than population mean to 1, else 0; e.g., prediction 0.10 in a label with average 0.05 is set to 1
929 | mostlikely sets any predicted probability >50% to 1, otherwise 0
930 | both settings have issues, and class imbalance is a major issue in training convolutional nets.
931 | we recommend using probabilities if your model can accomodate it.
932 | NAME_UNID_REPORTS: column name of accession number / unique report id in the read-in *reports* file. provided for convenience as there may be many report files.
933 | NAME_TEXT_REPORTS: column name of report text in the read-in reports file. provided for convenience as there may be many report files.
934 | Returns:
935 |
936 | Nothing
937 |
938 | """
939 |
940 | self.DO_BOW=DO_BOW #generate results for bag of words approach?
941 | self.DO_WORD2VEC=DO_WORD2VEC #generate resultes (tfidf and avg weight) for word2vec approach?
942 | self.DO_PARAGRAPH_VECTOR=DO_PARAGRAPH_VECTOR #generate results for paragraph vector approach?
943 | self.DO_SILVER_STANDARD=DO_SILVER_STANDARD #generate silver standard labels?
944 | self.STEM_WORDS=STEM_WORDS #should we stem words for BOW, LDA analysis? (we never stem words or doc2vec/w2v analysis, see below)
945 | if not N_GRAM_SIZES in ([1],[2],[3],[1,2],[1,3],[1,2,3]):
946 | raise ValueError('Invalid N_GRAM_SIZES argument:'+str(N_GRAM_SIZES)+", please review documentation for proper format (e.g., [1])")
947 | self.N_GRAM_SIZES = N_GRAM_SIZES # how many n-grams to use in BOW, LDA analyses? [1] = 1 grams only, [3] = 3 grams only, [1,2,3] = 1, 2, and 3- grams.
948 | self.DIM_DOC2VEC = DIM_DOC2VEC #dimensionality of doc2vec manifold
949 | self.N_THRESH_CORPUS=N_THRESH_CORPUS # delete any n-grams that appear fewer than N times in the entire corpus
950 | self.N_THRESH_OUTCOMES=N_THRESH_OUTCOMES # delete any predictors that don't have at least N-many positive and negative examples
951 | self.TRAIN_INDEX_OVERRIDE = TRAIN_INDEX_OVERRIDE # define a list of indices you want to force to be included as unlabeled data even though they are labeled (ie, these will *not* be used for predictions). Some unlabeled reports are required for d2v training."""
952 | self.SILVER_THRESHOLD=SILVER_THRESHOLD
953 | self.NAME_UNID_REPORTS = NAME_UNID_REPORTS
954 | self.NAME_TEXT_REPORTS = NAME_TEXT_REPORTS
955 |
956 | if(self.DO_BOW==False and self.DO_WORD2VEC==False and self.DO_PARAGRAPH_VECTOR==False): raise ValueError("DO_BOW and DO_WORD2VEC and DO_PARAGRAPH_VECTOR cannot both be false")
957 |
958 | def build_corpus(self):
959 | """
960 | Builds corpus of reports and and generates numerical features from reports for later analysis.
961 | Please run define_config() beforehand.
962 |
963 | Arguments:
964 |
965 | None
966 |
967 | Returns:
968 |
969 | None
970 | """
971 |
972 | #assemble dataframe of reports
973 | self.df_data = join_montage_files(self.data_dir, self.NAME_UNID_REPORTS, self.NAME_TEXT_REPORTS) # build dataframe with all the report text
974 |
975 | #get list of stop words
976 | en_stop = give_stop_words()
977 |
978 | # preprocess report text, get list with length (# reports) and text after first round of processing.
979 | # if curious to see how it works, look at processed_reports[0] to see first report.
980 | self.processed_reports = preprocess_data(self.df_data, en_stop, stem_words=True)
981 |
982 | #determine which indices should be used for
983 | self.labeled_indices = get_labeled_indices(self.df_data,self.validation_file,self.TRAIN_INDEX_OVERRIDE)
984 |
985 | #build n-grams of desired size, takes a list of sizes and frequency threshold as inputs
986 | self.ngram_reports = create_ngrams(
987 | self.processed_reports,
988 | self.labeled_indices,
989 | N_GRAM_SIZES=self.N_GRAM_SIZES,
990 | freq_threshold=self.N_THRESH_CORPUS) #now we create n-grams
991 |
992 | # generate inputs for doc2vec/word2vec model
993 | # can see example report - d2v_inputs[0]
994 | self.d2v_inputs= remove_infrequent_tokens(self.processed_reports,self.N_THRESH_CORPUS,self.labeled_indices)
995 |
996 | #assemble train/test corpora and a word dict.
997 | self.corpus, self.train_corpus, self.test_corpus, self.dictionary, self.labeled_indices = build_train_test_corpus(
998 | self.df_data,
999 | self.ngram_reports,
1000 | self.validation_file,
1001 | self.TRAIN_INDEX_OVERRIDE)
1002 |
1003 | #train doc2vec/word2vec if indicated:
1004 | if(self.DO_WORD2VEC or self.DO_PARAGRAPH_VECTOR):
1005 | self.train_docs, self.test_docs, self.total_train_words = build_d2v_corpora(self.df_data,self.d2v_inputs,self.labeled_indices)
1006 | self.d2vmodel=train_d2v(self.train_docs, self.test_docs, self.D2V_EPOCH, self.DIM_DOC2VEC, self.W2V_DM, self.W2V_WINDOW, self.total_train_words)
1007 |
1008 | def infer_labels(self):
1009 | """
1010 | Infers labels for unlabeled documents.
1011 | Please run build_corpus() beforehand.
1012 |
1013 | Arguments:
1014 |
1015 | None
1016 |
1017 | Returns:
1018 |
1019 | self.inferred_labels: dataframe containing inferred labels
1020 | """
1021 |
1022 | #get the numerical features of text we need to train models for labels
1023 | self.bow_matrix, self.pv_matrix,self.w2v_matrix,self.accid_list,self.orig_text,self.orig_input=generate_labeled_data_features(
1024 | self.validation_file,
1025 | self.labeled_indices,
1026 | self.DIM_DOC2VEC,
1027 | self.df_data,
1028 | self.processed_reports,
1029 | self.DO_PARAGRAPH_VECTOR,
1030 | self.DO_WORD2VEC,
1031 | self.dictionary,
1032 | self.corpus,
1033 | self.d2vmodel,
1034 | self.d2v_inputs)
1035 |
1036 | #get and process labels for reports
1037 | self.eligible_outcomes_aligned,self.all_outcomes_aligned, self.N_LABELS, self.outcome_header_list = generate_outcomes(
1038 | self.validation_file,
1039 | self.accid_list,
1040 | self.N_THRESH_OUTCOMES)
1041 |
1042 | #to generate silver standard labels -- use whatever features are generated (word2vec average word embeddings, bow features, paragraph vector matrix)
1043 | if(self.DO_BOW and self.DO_WORD2VEC and self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.bow_matrix,self.w2v_matrix,self.pv_matrix))
1044 |
1045 | if(self.DO_BOW and self.DO_WORD2VEC and not self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.bow_matrix,self.w2v_matrix))
1046 | if(self.DO_BOW and not self.DO_WORD2VEC and self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.bow_matrix,self.pv_matrix))
1047 | if(not self.DO_BOW and self.DO_WORD2VEC and self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.w2v_matrix,self.pv_matrix))
1048 |
1049 | if(self.DO_BOW and not self.DO_WORD2VEC and not self.DO_PARAGRAPH_VECTOR): self.combined=self.bow_matrix
1050 | if(not self.DO_BOW and self.DO_WORD2VEC and not self.DO_PARAGRAPH_VECTOR): self.combined=self.w2v_matrix
1051 | if(not self.DO_BOW and not self.DO_WORD2VEC and self. DO_PARAGRAPH_VECTOR): self.combined=self.pv_matrix
1052 |
1053 | #create header for combined predictor matrix so we can interpret coefficients
1054 | self.headers=[]
1055 | if(self.DO_BOW): self.headers=self.headers + [self.dictionary[i] for i in self.dictionary]
1056 | if(self.DO_WORD2VEC): self.headers=self.headers + ["W2V"+str(i) for i in range(0,self.DIM_DOC2VEC)]
1057 | if(self.DO_PARAGRAPH_VECTOR): self.headers=self.headers + ["PV"+str(i) for i in range(0,self.DIM_DOC2VEC)]
1058 |
1059 |
1060 | pred_type = "combined" # a label for results
1061 | print("dimensionality of predictor matrix:"+str(self.combined.shape))
1062 |
1063 | #run lasso regressions
1064 | self.lasso_models, self.accuracy = calc_auc(self.combined,self.eligible_outcomes_aligned,self.all_outcomes_aligned, self.N_LABELS, pred_type, self.headers,self.ASSIGNFOLD_USING_ROW)
1065 |
1066 | #infer labels
1067 | self.inferred_binary_labels, self.inferred_proba_labels = write_silver_standard_labels(self.corpus,
1068 | self.N_LABELS,
1069 | self.eligible_outcomes_aligned,
1070 | self.DIM_DOC2VEC,
1071 | self.processed_reports,
1072 | self.DO_BOW,
1073 | self.DO_PARAGRAPH_VECTOR,
1074 | self.DO_WORD2VEC,
1075 | self.dictionary,
1076 | self.d2vmodel,
1077 | self.d2v_inputs,
1078 | self.lasso_models,
1079 | self.accid_list,
1080 | self.labeled_indices,
1081 | self.df_data,
1082 | self.SILVER_THRESHOLD)
1083 | return self.inferred_binary_labels, self.inferred_proba_labels
1084 |
1085 |
1086 |
1087 |
1088 |
--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
1 | name: rad_env
2 | channels:
3 | - conda-forge
4 | dependencies:
5 | - python=3.6.5
6 | - pandas=0.22.0
7 | - numpy=1.14.2
8 | - tqdm=4.23.0
9 | - scikit-learn=0.19.0
10 | - xlrd=1.1.0
11 | - jupyterlab=0.35.0
12 | - nb_conda_kernels=2.1.1
13 | - nltk=3.4.4
14 | - gensim=3.5.0
15 |
--------------------------------------------------------------------------------
/pseudodata/labels/labeled_reports.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aisinai/rad-report-annotator/61ee948866bb09d272fc75210c63dcb818b3c21d/pseudodata/labels/labeled_reports.xlsx
--------------------------------------------------------------------------------
/pseudodata/reports/words.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aisinai/rad-report-annotator/61ee948866bb09d272fc75210c63dcb818b3c21d/pseudodata/reports/words.xlsx
--------------------------------------------------------------------------------