├── Demo Notebook.ipynb
├── README.md
├── RadReportAnnotator.py
├── environment.yml
└── pseudodata
    ├── labels
        └── labeled_reports.xlsx
    └── reports
        └── words.xlsx


/Demo Notebook.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# RadReportAnnotator Demo\n",
  8 |     "\n",
  9 |     "We demonstrate on data from the [Indiana University Chest X-ray Dataset (Demner-Fushman et al.)](https://www.ncbi.nlm.nih.gov/pubmed/26133894)\n",
 10 |     "\n",
 11 |     "This example can be adapted to your own collection of radiology reports exported from Montage \n",
 12 |     "and a manually-generated set of classification labels"
 13 |    ]
 14 |   },
 15 |   {
 16 |    "cell_type": "markdown",
 17 |    "metadata": {},
 18 |    "source": [
 19 |     "Import library:"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 1,
 25 |    "metadata": {
 26 |     "collapsed": true
 27 |    },
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "import RadReportAnnotator as ra\n",
 31 |     "import os.path"
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "metadata": {},
 37 |    "source": [
 38 |     "Instantiate RadReportAnnotator object with paths to demo `reports` and `labels`. \n",
 39 |     "\n",
 40 |     "`Reports` contains 3,666 deidentified chest x-ray radiology reports. \n",
 41 |     "\n",
 42 |     "`Labels` contains binary labels for `Normal`, `Opacity`, `Cardiomegaly`, `Nodule`, and `Fibrosis` for 1,500 of these reports."
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "code",
 47 |    "execution_count": 2,
 48 |    "metadata": {
 49 |     "collapsed": true
 50 |    },
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "CXRAnnotator = ra.RadReportAnnotator(report_dir_path=os.path.join(\"pseudodata\",\"reports\"), \n",
 54 |     "                                     validation_file_path=os.path.join(\"pseudodata\",\"labels\",\"labeled_reports.xlsx\"))"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "Set arguments for RadReportAnnotator here in define_config - see documentation in RadReportAnnotator for more information.\n",
 62 |     "\n",
 63 |     "Models that use only bag of words (`DO_BOW=True,DO_WORD2VEC=False`) have been competitive in our experience with those that use both bag of words and word embeddings (`DO_BOW=True, DO_WORD2VEC=True`). Word embeddings can take considerable time to train on larger datasets. \n",
 64 |     "\n",
 65 |     "In the below demo, we use bag of words features (`DO_BOW=True`) with 1, 2, and 3-grams (`N_GRAM_SIZES=[1,2,3]`)."
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 3,
 71 |    "metadata": {
 72 |     "collapsed": true
 73 |    },
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "CXRAnnotator.define_config(DO_BOW=True,\n",
 77 |     "\tDO_WORD2VEC=False,\n",
 78 |     "\tDO_PARAGRAPH_VECTOR=False,\n",
 79 |     "\tN_GRAM_SIZES=[1,2,3],\n",
 80 |     "\tSILVER_THRESHOLD=\"fiftypct\",\n",
 81 |     "\tNAME_UNID_REPORTS = \"ACCID\", \n",
 82 |     "\tNAME_TEXT_REPORTS =\"REPORT\", \n",
 83 |     "\tN_THRESH_CORPUS=10,\n",
 84 |     "\tN_THRESH_OUTCOMES=50)"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "metadata": {},
 90 |    "source": [
 91 |     "Build corpus from reports"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 4,
 97 |    "metadata": {},
 98 |    "outputs": [
 99 |     {
100 |      "name": "stdout",
101 |      "output_type": "stream",
102 |      "text": [
103 |       "building pre-corpus\n",
104 |       "pre-corpus built\n",
105 |       "preprocessing reports\n"
106 |      ]
107 |     },
108 |     {
109 |      "name": "stderr",
110 |      "output_type": "stream",
111 |      "text": [
112 |       "100%|█████████████████████████████████████████████████████████████████████████████| 3666/3666 [00:07<00:00, 473.86it/s]\n"
113 |      ]
114 |     },
115 |     {
116 |      "name": "stdout",
117 |      "output_type": "stream",
118 |      "text": [
119 |       "creating n-grams\n"
120 |      ]
121 |     },
122 |     {
123 |      "name": "stderr",
124 |      "output_type": "stream",
125 |      "text": [
126 |       "100%|████████████████████████████████████████████████████████████████████████████| 3666/3666 [00:00<00:00, 6268.53it/s]\n"
127 |      ]
128 |     },
129 |     {
130 |      "name": "stdout",
131 |      "output_type": "stream",
132 |      "text": [
133 |       "number of unique n-grams: 33865\n",
134 |       "number of unique n-grams after filtering out low frequency tokens: 2425\n"
135 |      ]
136 |     }
137 |    ],
138 |    "source": [
139 |     "CXRAnnotator.build_corpus()"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "markdown",
144 |    "metadata": {},
145 |    "source": [
146 |     "We can examine how the preprocessing works. Let's look at the original input text for report at index 500:"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 5,
152 |    "metadata": {},
153 |    "outputs": [
154 |     {
155 |      "data": {
156 |       "text/plain": [
157 |        "'  Comparison:  None   Indication:  Central line placement   Findings:  The heart is borderline in size. The aorta is mildly tortuous. XXXX right IJ catheter is in XXXX with tip in proximal right atrium/cavoatrial junction. There is no pneumothorax. Lungs are grossly clear. There is no large effusion.   Impression:  Right IJ catheter tip in proximal right atrium. No pneumothorax. '"
158 |       ]
159 |      },
160 |      "execution_count": 5,
161 |      "metadata": {},
162 |      "output_type": "execute_result"
163 |     }
164 |    ],
165 |    "source": [
166 |     "CXRAnnotator.df_data['Report Text'].iloc[500]"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "markdown",
171 |    "metadata": {},
172 |    "source": [
173 |     "Let's look this report after preprocessing:"
174 |    ]
175 |   },
176 |   {
177 |    "cell_type": "code",
178 |    "execution_count": 6,
179 |    "metadata": {},
180 |    "outputs": [
181 |     {
182 |      "name": "stdout",
183 |      "output_type": "stream",
184 |      "text": [
185 |       "['comparison', 'none', 'indic', 'central', 'line', 'placement', 'find', 'the', 'heart', 'is', 'borderlin', 'in', 'size', 'sentenceend', 'the', 'aorta', 'is', 'mildli', 'tortuou', 'sentenceend', 'xxxx', 'right', 'ij', 'cathet', 'is', 'in', 'xxxx', 'with', 'tip', 'in', 'proxim', 'right', 'atrium', 'cavoatri', 'junction', 'sentenceend', 'there', 'is', 'no', 'pneumothorax', 'sentenceend', 'lung', 'are', 'grossli', 'clear', 'sentenceend', 'there', 'is', 'no', 'larg', 'effus', 'sentenceend', 'impress', 'right', 'ij', 'cathet', 'tip', 'in', 'proxim', 'right', 'atrium', 'sentenceend', 'no', 'pneumothorax', 'sentenceend', 'sentenceend', 'sentenceend', 'sentenceend']\n"
186 |      ]
187 |     }
188 |    ],
189 |    "source": [
190 |     "print(CXRAnnotator.processed_reports[500])"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "metadata": {},
196 |    "source": [
197 |     "Words were stemmed (\"indication\"-->\"indic\"), extra punctuation was removed, and periods were replaced with the special end character. Word2vec takes input in a format like this to learn word embeddings.\n",
198 |     "\n",
199 |     "Let's look at the n-gram features for this report, which will be used for bag of words modeling:"
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "code",
204 |    "execution_count": 7,
205 |    "metadata": {},
206 |    "outputs": [
207 |     {
208 |      "name": "stdout",
209 |      "output_type": "stream",
210 |      "text": [
211 |       "['find_the_heart', 'the_heart_is', 'the_aorta_is', 'lung_are_grossli', 'are_grossli_clear', 'no_larg_effus', 'comparison_none', 'find_the', 'the_heart', 'heart_is', 'in_size', 'the_aorta', 'aorta_is', 'is_mildli', 'xxxx_right', 'is_in', 'in_xxxx', 'xxxx_with', 'with_tip', 'tip_in', 'right_atrium', 'there_is', 'no_pneumothorax', 'lung_are', 'are_grossli', 'grossli_clear', 'there_is', 'no_larg', 'larg_effus', 'impress_right', 'cathet_tip', 'tip_in', 'right_atrium', 'no_pneumothorax', 'comparison', 'none', 'indic', 'central', 'line', 'placement', 'find', 'the', 'heart', 'is', 'borderlin', 'in', 'size', 'the', 'aorta', 'is', 'mildli', 'tortuou', 'xxxx', 'right', 'cathet', 'is', 'in', 'xxxx', 'with', 'tip', 'in', 'right', 'atrium', 'junction', 'there', 'is', 'pneumothorax', 'lung', 'are', 'grossli', 'clear', 'there', 'is', 'larg', 'effus', 'impress', 'right', 'cathet', 'tip', 'in', 'right', 'atrium', 'pneumothorax']\n"
212 |      ]
213 |     }
214 |    ],
215 |    "source": [
216 |     "print(CXRAnnotator.ngram_reports[500])"
217 |    ]
218 |   },
219 |   {
220 |    "cell_type": "markdown",
221 |    "metadata": {},
222 |    "source": [
223 |     "Since we have `N_GRAM_SIZES=[1,2,3]` in this demo, we see individual words (1-grams), each 2 consecutive words (2-grams; e.g., 'comparison_none'), and each 3 consecutive words ('no_larg_effus') available as features. Sometimes these 2- and 3-grams are uninformative ('comparison_none'), at other times they may be useful ('no_pneumothorax'). Note that only n-grams appearing `N_THRESH_CORPUS` times in training data (10 in this demo) are included. "
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "markdown",
228 |    "metadata": {},
229 |    "source": [
230 |     "Train Lasso logistic regression models using features from 60% of labeled reports and infer labels for 40% of labeled reports (for performance evaluation) and unlabeled reports (for ultimate application):"
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": 8,
236 |    "metadata": {},
237 |    "outputs": [
238 |     {
239 |      "name": "stdout",
240 |      "output_type": "stream",
241 |      "text": [
242 |       "generating features\n"
243 |      ]
244 |     },
245 |     {
246 |      "name": "stderr",
247 |      "output_type": "stream",
248 |      "text": [
249 |       "100%|████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:00<00:00, 4099.24it/s]\n"
250 |      ]
251 |     },
252 |     {
253 |      "name": "stdout",
254 |      "output_type": "stream",
255 |      "text": [
256 |       "total labels:6\n",
257 |       "labels eligible for inference:4\n",
258 |       "dimensionality of predictor matrix:(1500, 2425)\n",
259 |       "n_train in modeling=900\n",
260 |       "n_test in modeling=600\n",
261 |       "i=0\n"
262 |      ]
263 |     },
264 |     {
265 |      "name": "stderr",
266 |      "output_type": "stream",
267 |      "text": [
268 |       "100%|███████████████████████████████████████████████████████████████████████████| 2000/2000 [00:00<00:00, 26965.13it/s]\n",
269 |       "100%|███████████████████████████████████████████████████████████████████████████| 1666/1666 [00:00<00:00, 19683.19it/s]\n"
270 |      ]
271 |     }
272 |    ],
273 |    "source": [
274 |     "binary_labels, proba_labels = CXRAnnotator.infer_labels()"
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "markdown",
279 |    "metadata": {},
280 |    "source": [
281 |     "Examine quality of predictions on held out 40% of labeled data."
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": 9,
287 |    "metadata": {},
288 |    "outputs": [
289 |     {
290 |      "data": {
291 |       "text/html": [
292 |        "<div>\n",
293 |        "<style>\n",
294 |        "    .dataframe thead tr:only-child th {\n",
295 |        "        text-align: right;\n",
296 |        "    }\n",
297 |        "\n",
298 |        "    .dataframe thead th {\n",
299 |        "        text-align: left;\n",
300 |        "    }\n",
301 |        "\n",
302 |        "    .dataframe tbody tr th {\n",
303 |        "        vertical-align: top;\n",
304 |        "    }\n",
305 |        "</style>\n",
306 |        "<table border=\"1\" class=\"dataframe\">\n",
307 |        "  <thead>\n",
308 |        "    <tr style=\"text-align: right;\">\n",
309 |        "      <th></th>\n",
310 |        "      <th>AUC</th>\n",
311 |        "      <th>True +</th>\n",
312 |        "      <th>False +</th>\n",
313 |        "      <th>True -</th>\n",
314 |        "      <th>False -</th>\n",
315 |        "    </tr>\n",
316 |        "    <tr>\n",
317 |        "      <th>Label (with calcs on held out 40 pct)</th>\n",
318 |        "      <th></th>\n",
319 |        "      <th></th>\n",
320 |        "      <th></th>\n",
321 |        "      <th></th>\n",
322 |        "      <th></th>\n",
323 |        "    </tr>\n",
324 |        "  </thead>\n",
325 |        "  <tbody>\n",
326 |        "    <tr>\n",
327 |        "      <th>Normal</th>\n",
328 |        "      <td>0.956679</td>\n",
329 |        "      <td>208</td>\n",
330 |        "      <td>53</td>\n",
331 |        "      <td>324</td>\n",
332 |        "      <td>15</td>\n",
333 |        "    </tr>\n",
334 |        "    <tr>\n",
335 |        "      <th>Opacity</th>\n",
336 |        "      <td>0.981869</td>\n",
337 |        "      <td>62</td>\n",
338 |        "      <td>17</td>\n",
339 |        "      <td>517</td>\n",
340 |        "      <td>4</td>\n",
341 |        "    </tr>\n",
342 |        "    <tr>\n",
343 |        "      <th>Cardiomegaly</th>\n",
344 |        "      <td>0.993979</td>\n",
345 |        "      <td>41</td>\n",
346 |        "      <td>18</td>\n",
347 |        "      <td>541</td>\n",
348 |        "      <td>0</td>\n",
349 |        "    </tr>\n",
350 |        "    <tr>\n",
351 |        "      <th>Nodule</th>\n",
352 |        "      <td>0.991759</td>\n",
353 |        "      <td>16</td>\n",
354 |        "      <td>36</td>\n",
355 |        "      <td>548</td>\n",
356 |        "      <td>0</td>\n",
357 |        "    </tr>\n",
358 |        "  </tbody>\n",
359 |        "</table>\n",
360 |        "</div>"
361 |       ],
362 |       "text/plain": [
363 |        "                                            AUC  True +  False +  True -  \\\n",
364 |        "Label (with calcs on held out 40 pct)                                      \n",
365 |        "Normal                                 0.956679     208       53     324   \n",
366 |        "Opacity                                0.981869      62       17     517   \n",
367 |        "Cardiomegaly                           0.993979      41       18     541   \n",
368 |        "Nodule                                 0.991759      16       36     548   \n",
369 |        "\n",
370 |        "                                       False -  \n",
371 |        "Label (with calcs on held out 40 pct)           \n",
372 |        "Normal                                      15  \n",
373 |        "Opacity                                      4  \n",
374 |        "Cardiomegaly                                 0  \n",
375 |        "Nodule                                       0  "
376 |       ]
377 |      },
378 |      "execution_count": 9,
379 |      "metadata": {},
380 |      "output_type": "execute_result"
381 |     }
382 |    ],
383 |    "source": [
384 |     "CXRAnnotator.accuracy"
385 |    ]
386 |   },
387 |   {
388 |    "cell_type": "markdown",
389 |    "metadata": {},
390 |    "source": [
391 |     "Notice `Fibrosis` was filtered out despite appearing in input data as we had very few positive observations. It is important to ensure that sufficient positive and negative cases for each label exist in your labeled data.\n",
392 |     "\n",
393 |     "Rare labels with high AUC may still have a significant number of false positives (`Nodule`). Be aware of noise introduced by your labeling process before using inferred labels to train convolutional neural networks or other algorithms, and consider the positive predictive value (PPV) of a positive label. Additional labeled examples, particularly of rare pathology, may help improve accuracy. \n",
394 |     "\n",
395 |     "Recent results ([Ghafoorian et al.](https://arxiv.org/abs/1801.05040) [Rajpurkar et al.](https://arxiv.org/abs/1711.05225)) demonstrate that deep learning can achieve impressive results when trained to a large noisily labeled radiological imaging dataset."
396 |    ]
397 |   },
398 |   {
399 |    "cell_type": "markdown",
400 |    "metadata": {},
401 |    "source": [
402 |     "Examine a few probabilistic predictions:"
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "code",
407 |    "execution_count": 10,
408 |    "metadata": {},
409 |    "outputs": [
410 |     {
411 |      "data": {
412 |       "text/html": [
413 |        "<div>\n",
414 |        "<style>\n",
415 |        "    .dataframe thead tr:only-child th {\n",
416 |        "        text-align: right;\n",
417 |        "    }\n",
418 |        "\n",
419 |        "    .dataframe thead th {\n",
420 |        "        text-align: left;\n",
421 |        "    }\n",
422 |        "\n",
423 |        "    .dataframe tbody tr th {\n",
424 |        "        vertical-align: top;\n",
425 |        "    }\n",
426 |        "</style>\n",
427 |        "<table border=\"1\" class=\"dataframe\">\n",
428 |        "  <thead>\n",
429 |        "    <tr style=\"text-align: right;\">\n",
430 |        "      <th></th>\n",
431 |        "      <th>Normal</th>\n",
432 |        "      <th>Opacity</th>\n",
433 |        "      <th>Cardiomegaly</th>\n",
434 |        "      <th>Nodule</th>\n",
435 |        "    </tr>\n",
436 |        "    <tr>\n",
437 |        "      <th>Accession Number</th>\n",
438 |        "      <th></th>\n",
439 |        "      <th></th>\n",
440 |        "      <th></th>\n",
441 |        "      <th></th>\n",
442 |        "    </tr>\n",
443 |        "  </thead>\n",
444 |        "  <tbody>\n",
445 |        "    <tr>\n",
446 |        "      <th>103661</th>\n",
447 |        "      <td>0.113953</td>\n",
448 |        "      <td>0.007305</td>\n",
449 |        "      <td>0.022156</td>\n",
450 |        "      <td>0.009483</td>\n",
451 |        "    </tr>\n",
452 |        "    <tr>\n",
453 |        "      <th>103662</th>\n",
454 |        "      <td>0.283203</td>\n",
455 |        "      <td>0.007305</td>\n",
456 |        "      <td>0.022156</td>\n",
457 |        "      <td>0.009483</td>\n",
458 |        "    </tr>\n",
459 |        "    <tr>\n",
460 |        "      <th>103663</th>\n",
461 |        "      <td>0.283203</td>\n",
462 |        "      <td>0.007305</td>\n",
463 |        "      <td>0.022156</td>\n",
464 |        "      <td>0.009483</td>\n",
465 |        "    </tr>\n",
466 |        "    <tr>\n",
467 |        "      <th>103664</th>\n",
468 |        "      <td>0.000129</td>\n",
469 |        "      <td>0.060547</td>\n",
470 |        "      <td>0.058807</td>\n",
471 |        "      <td>0.037109</td>\n",
472 |        "    </tr>\n",
473 |        "    <tr>\n",
474 |        "      <th>103665</th>\n",
475 |        "      <td>0.020233</td>\n",
476 |        "      <td>0.999512</td>\n",
477 |        "      <td>0.011406</td>\n",
478 |        "      <td>0.019058</td>\n",
479 |        "    </tr>\n",
480 |        "  </tbody>\n",
481 |        "</table>\n",
482 |        "</div>"
483 |       ],
484 |       "text/plain": [
485 |        "                    Normal   Opacity  Cardiomegaly    Nodule\n",
486 |        "Accession Number                                            \n",
487 |        "103661            0.113953  0.007305      0.022156  0.009483\n",
488 |        "103662            0.283203  0.007305      0.022156  0.009483\n",
489 |        "103663            0.283203  0.007305      0.022156  0.009483\n",
490 |        "103664            0.000129  0.060547      0.058807  0.037109\n",
491 |        "103665            0.020233  0.999512      0.011406  0.019058"
492 |       ]
493 |      },
494 |      "execution_count": 10,
495 |      "metadata": {},
496 |      "output_type": "execute_result"
497 |     }
498 |    ],
499 |    "source": [
500 |     "proba_labels.tail()"
501 |    ]
502 |   },
503 |   {
504 |    "cell_type": "markdown",
505 |    "metadata": {},
506 |    "source": [
507 |     "Examine a few binary predictions - these override to manual labels when available:"
508 |    ]
509 |   },
510 |   {
511 |    "cell_type": "code",
512 |    "execution_count": 11,
513 |    "metadata": {},
514 |    "outputs": [
515 |     {
516 |      "data": {
517 |       "text/html": [
518 |        "<div>\n",
519 |        "<style>\n",
520 |        "    .dataframe thead tr:only-child th {\n",
521 |        "        text-align: right;\n",
522 |        "    }\n",
523 |        "\n",
524 |        "    .dataframe thead th {\n",
525 |        "        text-align: left;\n",
526 |        "    }\n",
527 |        "\n",
528 |        "    .dataframe tbody tr th {\n",
529 |        "        vertical-align: top;\n",
530 |        "    }\n",
531 |        "</style>\n",
532 |        "<table border=\"1\" class=\"dataframe\">\n",
533 |        "  <thead>\n",
534 |        "    <tr style=\"text-align: right;\">\n",
535 |        "      <th></th>\n",
536 |        "      <th>Normal</th>\n",
537 |        "      <th>Opacity</th>\n",
538 |        "      <th>Cardiomegaly</th>\n",
539 |        "      <th>Nodule</th>\n",
540 |        "    </tr>\n",
541 |        "    <tr>\n",
542 |        "      <th>Accession Number</th>\n",
543 |        "      <th></th>\n",
544 |        "      <th></th>\n",
545 |        "      <th></th>\n",
546 |        "      <th></th>\n",
547 |        "    </tr>\n",
548 |        "  </thead>\n",
549 |        "  <tbody>\n",
550 |        "    <tr>\n",
551 |        "      <th>103661</th>\n",
552 |        "      <td>0</td>\n",
553 |        "      <td>0</td>\n",
554 |        "      <td>0</td>\n",
555 |        "      <td>0</td>\n",
556 |        "    </tr>\n",
557 |        "    <tr>\n",
558 |        "      <th>103662</th>\n",
559 |        "      <td>0</td>\n",
560 |        "      <td>0</td>\n",
561 |        "      <td>0</td>\n",
562 |        "      <td>0</td>\n",
563 |        "    </tr>\n",
564 |        "    <tr>\n",
565 |        "      <th>103663</th>\n",
566 |        "      <td>0</td>\n",
567 |        "      <td>0</td>\n",
568 |        "      <td>0</td>\n",
569 |        "      <td>0</td>\n",
570 |        "    </tr>\n",
571 |        "    <tr>\n",
572 |        "      <th>103664</th>\n",
573 |        "      <td>0</td>\n",
574 |        "      <td>0</td>\n",
575 |        "      <td>0</td>\n",
576 |        "      <td>0</td>\n",
577 |        "    </tr>\n",
578 |        "    <tr>\n",
579 |        "      <th>103665</th>\n",
580 |        "      <td>0</td>\n",
581 |        "      <td>1</td>\n",
582 |        "      <td>0</td>\n",
583 |        "      <td>0</td>\n",
584 |        "    </tr>\n",
585 |        "  </tbody>\n",
586 |        "</table>\n",
587 |        "</div>"
588 |       ],
589 |       "text/plain": [
590 |        "                  Normal  Opacity  Cardiomegaly  Nodule\n",
591 |        "Accession Number                                       \n",
592 |        "103661                 0        0             0       0\n",
593 |        "103662                 0        0             0       0\n",
594 |        "103663                 0        0             0       0\n",
595 |        "103664                 0        0             0       0\n",
596 |        "103665                 0        1             0       0"
597 |       ]
598 |      },
599 |      "execution_count": 11,
600 |      "metadata": {},
601 |      "output_type": "execute_result"
602 |     }
603 |    ],
604 |    "source": [
605 |     "binary_labels.tail()"
606 |    ]
607 |   },
608 |   {
609 |    "cell_type": "markdown",
610 |    "metadata": {},
611 |    "source": [
612 |     "You can examine individual report predictions; here are report and predictions for a report that manual reviewers coded as `Normal`:"
613 |    ]
614 |   },
615 |   {
616 |    "cell_type": "code",
617 |    "execution_count": 12,
618 |    "metadata": {},
619 |    "outputs": [
620 |     {
621 |      "name": "stdout",
622 |      "output_type": "stream",
623 |      "text": [
624 |       "  Comparison:  None.   Indication:  XXXX, chest pain and XXXX x2 weeks.   Findings:  The cardiomediastinal silhouette and pulmonary vasculature are within normal limits in size. The lungs are clear of focal airspace disease, pneumothorax, or pleural effusion. There are no acute bony findings.   Impression:  No acute cardiopulmonary findings. \n",
625 |       "\n",
626 |       "\n",
627 |       "Normal          0.969727\n",
628 |       "Opacity         0.001776\n",
629 |       "Cardiomegaly    0.000642\n",
630 |       "Nodule          0.000948\n",
631 |       "Name: 101700, dtype: float64\n",
632 |       "\n",
633 |       "\n",
634 |       "Normal          1\n",
635 |       "Opacity         0\n",
636 |       "Cardiomegaly    0\n",
637 |       "Nodule          0\n",
638 |       "Name: 101700, dtype: int32\n"
639 |      ]
640 |     }
641 |    ],
642 |    "source": [
643 |     "#normal report\n",
644 |     "print(CXRAnnotator.df_data['Report Text'].iloc[1700])\n",
645 |     "print(\"\\n\")\n",
646 |     "print(proba_labels.iloc[1700])\n",
647 |     "print(\"\\n\")\n",
648 |     "print(binary_labels.iloc[1700])"
649 |    ]
650 |   },
651 |   {
652 |    "cell_type": "markdown",
653 |    "metadata": {},
654 |    "source": [
655 |     "Here are report and predictions for a report that manual reviewers coded as positive for `Cardiomegaly`:"
656 |    ]
657 |   },
658 |   {
659 |    "cell_type": "code",
660 |    "execution_count": 13,
661 |    "metadata": {},
662 |    "outputs": [
663 |     {
664 |      "name": "stdout",
665 |      "output_type": "stream",
666 |      "text": [
667 |       "  Comparison:  PA and lateral chest x-XXXX dated XXXX.   Indication:  XXXX-year-old female with chest pain.   Findings:  The heart size is enlarged. Tortuous aorta. Otherwise the mediastinal contour is within normal limits. The lungs are free of any focal infiltrates. There are no nodules or masses. No visible pneumothorax. No visible pleural fluid. The XXXX are grossly normal. There is no visible free intraperitoneal air under the diaphragm.   Impression:  1. Cardiomegaly without lung infiltrates. \n",
668 |       "\n",
669 |       "\n",
670 |       "Normal          0.008018\n",
671 |       "Opacity         0.001008\n",
672 |       "Cardiomegaly    0.981445\n",
673 |       "Nodule          0.056152\n",
674 |       "Name: 102100, dtype: float64\n",
675 |       "\n",
676 |       "\n",
677 |       "Normal          0\n",
678 |       "Opacity         0\n",
679 |       "Cardiomegaly    1\n",
680 |       "Nodule          0\n",
681 |       "Name: 102100, dtype: int32\n"
682 |      ]
683 |     }
684 |    ],
685 |    "source": [
686 |     "print(CXRAnnotator.df_data['Report Text'].iloc[2100])\n",
687 |     "print(\"\\n\")\n",
688 |     "print(proba_labels.iloc[2100])\n",
689 |     "print(\"\\n\")\n",
690 |     "print(binary_labels.iloc[2100])"
691 |    ]
692 |   },
693 |   {
694 |    "cell_type": "markdown",
695 |    "metadata": {},
696 |    "source": [
697 |     "Here are report and predictions for a report that manual reviewers coded as positive for `Opacity`:"
698 |    ]
699 |   },
700 |   {
701 |    "cell_type": "code",
702 |    "execution_count": 14,
703 |    "metadata": {},
704 |    "outputs": [
705 |     {
706 |      "name": "stdout",
707 |      "output_type": "stream",
708 |      "text": [
709 |       "  Comparison:  XXXX, XXXX   Indication:  XXXX-year-old XXXX with chest pain.   Findings:  The heart size is stable. The aorta is ectatic and atherosclerotic but stable. XXXX sternotomy XXXX are again noted. The scarring in the left lower lobe is again noted and unchanged from prior exam. There are mild bilateral prominent lung interstitial opacities consistent with emphysematous disease. The calcified granulomas are stable.   Impression:  1. Changes of emphysema and left lower lobe scarring, both stable. 2. Unchanged degenerative and atherosclerotic changes of the thoracic aorta. \n",
710 |       "\n",
711 |       "\n",
712 |       "Normal          0.000000\n",
713 |       "Opacity         0.981445\n",
714 |       "Cardiomegaly    0.125977\n",
715 |       "Nodule          0.234497\n",
716 |       "Name: 102770, dtype: float64\n",
717 |       "\n",
718 |       "\n",
719 |       "Normal          0\n",
720 |       "Opacity         1\n",
721 |       "Cardiomegaly    0\n",
722 |       "Nodule          0\n",
723 |       "Name: 102770, dtype: int32\n"
724 |      ]
725 |     }
726 |    ],
727 |    "source": [
728 |     "#opacity\n",
729 |     "print(CXRAnnotator.df_data['Report Text'].iloc[2770])\n",
730 |     "print(\"\\n\")\n",
731 |     "print(proba_labels.iloc[2770])\n",
732 |     "print(\"\\n\")\n",
733 |     "print(binary_labels.iloc[2770])"
734 |    ]
735 |   }
736 |  ],
737 |  "metadata": {
738 |   "kernelspec": {
739 |    "display_name": "Python 3",
740 |    "language": "python",
741 |    "name": "python3"
742 |   },
743 |   "language_info": {
744 |    "codemirror_mode": {
745 |     "name": "ipython",
746 |     "version": 3
747 |    },
748 |    "file_extension": ".py",
749 |    "mimetype": "text/x-python",
750 |    "name": "python",
751 |    "nbconvert_exporter": "python",
752 |    "pygments_lexer": "ipython3",
753 |    "version": "3.6.2"
754 |   }
755 |  },
756 |  "nbformat": 4,
757 |  "nbformat_minor": 2
758 | }
759 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # RadReportAnnotator
 2 | 
 3 | Authors: jrzech, eko
 4 | 
 5 | Provides a library of methods for automatically inferring labels for a corpus of radiological reports given a set of manually-labeled data. These methods are described in our publication [Natural Language–based Machine Learning Models for the Annotation of Clinical Radiology Reports](https://doi.org/10.1148/radiol.2018171093).
 6 | 
 7 | ## Getting Started:
 8 | 
 9 | To configure your own local instance (assumes [Anaconda is installed](https://www.anaconda.com/download/)):
10 | 
11 | ```
12 | git clone https://www.github.com/aisinai/rad-report-annotator.git
13 | cd rad-report-annotator
14 | conda env create -f environment.yml
15 | source activate rad_env
16 | python -m ipykernel install --user --name rad_env --display-name "Python (rad_env)"
17 | ```
18 | 
19 | *Note as of Oct 11, 2022: this conda environment builds on Linux and Windows, but not on Mac as older versions of gensim for Mac are not available in conda-forge.* 
20 | 
21 | To see a demo of the library on data from the [Indiana University Chest X-ray Dataset (Demner-Fushman et al.)](https://www.ncbi.nlm.nih.gov/pubmed/26133894), please open `Demo Notebook.ipynb` and run all cells.
22 | 
23 | 


--------------------------------------------------------------------------------
/RadReportAnnotator.py:
--------------------------------------------------------------------------------
   1 | """
   2 | RadReportAnnotator
   3 | Authors: jrzech, eko
   4 | 
   5 | This is a library of methods for automatically inferring labels for a corpus or radiological documents given a set of manually-labeled data.
   6 | 
   7 | """
   8 | 
   9 | #usual imports for data science
  10 | import numpy as np
  11 | import pandas as pd
  12 | import sys
  13 | import os
  14 | import math
  15 | from tqdm import tqdm
  16 | 
  17 | #sklearn
  18 | from sklearn.model_selection import StratifiedKFold
  19 | from sklearn.model_selection import GridSearchCV
  20 | from sklearn.linear_model import LogisticRegression
  21 | from sklearn.metrics import roc_auc_score
  22 | 
  23 | #NLP imports
  24 | from nltk.tokenize import RegexpTokenizer
  25 | from nltk.stem.porter import PorterStemmer
  26 | import re
  27 | 
  28 | #gensim for word embedding featurization
  29 | import gensim
  30 | from collections import namedtuple
  31 | 
  32 | #misc
  33 | import glob
  34 | import os.path
  35 | import multiprocessing
  36 | import random
  37 | 
  38 | 
  39 | def join_montage_files(data_dir,NAME_UNID_REPORTS, NAME_TEXT_REPORTS):
  40 | 	"""
  41 |  	Joins several montage files in excel format into a single pandas dataframe
  42 |  	Args:
  43 |   		data_dir: a filepath pointing to a directory containing montage files in excel format
  44 |   		NAME_UNID_REPORTS: column name of unique id / accession id in reports xlsx
  45 |   		NAME_TEXT_REPORTS: column name of report text in reports xlsx
  46 |  	Returns:
  47 |   		df_data: a pandas dataframe containing texts from montage
  48 |  	"""
  49 | 	print("building pre-corpus")
  50 | 	datafiles = os.listdir(data_dir)
  51 | 	df_data = pd.read_excel(os.path.join(data_dir,datafiles[0]))
  52 | 	datafiles.remove(datafiles[0])
  53 | 	for subcorpus in datafiles:
  54 | 		df_data = df_data.append(pd.read_excel(os.path.join(data_dir,subcorpus)))
  55 | 	print('pre-corpus built')
  56 | 
  57 | 	df_data.rename(columns={NAME_UNID_REPORTS:'Accession Number',NAME_TEXT_REPORTS:'Report Text'},inplace=True)
  58 | 	return df_data
  59 | 
  60 | def preprocess_data(df_data, en_stop, stem_words=True):
  61 | 	"""
  62 | 	Takes a dataframe of montage files and a list of stop words and returns a list of lda_inputs. The lda_inputs
  63 | 	list consists of sublists of stemmed unigrams.
  64 | 	Args:
  65 | 		df_data: a dataframe of joined montage files.
  66 | 		en_stop: a list of english stop_words from the stop_words library
  67 | 		stem_words: argument indicating whether or not to stem words
  68 | 	Returns:
  69 | 		lda_inputs: a list of lists of stemmed words from each text within the montage dataframe
  70 | 	"""
  71 | 	if(stem_words==False):
  72 | 		print("NOTE - NOT STEMMING")
  73 | 	p_stemmer = PorterStemmer()
  74 | 	processed_reports = []
  75 | 	accession_index=[]
  76 | 
  77 | 	print("preprocessing reports")
  78 | 	for i in tqdm(range(0,df_data.shape[0])):
  79 | 
  80 | 		tokenizer = RegexpTokenizer(r'\w+')
  81 | 		process = df_data['Report Text'].iloc[i]
  82 | 
  83 | 		process = str(process)
  84 | 		process = process + "..." # add a period, sometimes it's missing at end
  85 | 		process = process.lower()
  86 | 
  87 | 		z = len(process)
  88 | 		k = 0
  89 | 		#remove line breaks
  90 | 		process=process.replace("^M", " ") # 
  91 | 		process=process.replace("\n", " ") # 
  92 | 		process=process.replace("\r", " ") #   
  93 | 		process=process.replace("_", " ") #   
  94 | 		process=process.replace("-", " ") #   
  95 | 		process=process.replace(",", " , ") # 
  96 | 		process=process.replace("  ", " ") # 
  97 | 		process=process.replace("  ", " ") # 
  98 | 		process=process.replace("  ", " ") # 
  99 | 		process=process.replace("  ", " ") # 
 100 | 		process=process.replace("  ", " ") # 
 101 | 
 102 | 		process = re.sub(r'\d+', '',process)
 103 | 		process=process.replace(".", " SENTENCEEND ")  # create end characters
 104 | 
 105 | 		process_tokenized = tokenizer.tokenize(process)
 106 | 		process_stopped = [i for i in process_tokenized if not i in en_stop]
 107 |         
 108 | 		if(stem_words==True):        
 109 | 			process_stemmed = [p_stemmer.stem(i) for i in process_stopped]
 110 | 		else:
 111 | 			process_stemmed = process_stopped
 112 | 
 113 | 		processed_reports.append(process_stemmed)
 114 | 		#include n grams in lda_input
 115 | 	return processed_reports
 116 | 
 117 | def remove_infrequent_tokens(processed_reports,freq_threshold,labeled_indices):
 118 | 	"""
 119 | 	Takes a list of processed_preports and removes infrequent tokens (defined as occurring < freq_threshold times) from them.
 120 | 	Args:
 121 | 		processed_reports: list of lists of stemmed words after initial processing, where each entry corresponds to a report
 122 | 		freq_threshold: count threshold, remove words occuring < freq_threshold time from corpus. note - considers only unlabeled corpus, not labeled corpus, to avoid peeking into labeled data. 
 123 | 		labeled_indices: indices of processed_reports that are labeled reports - these are excluded from frequency calculations.
 124 | 	Returns:
 125 | 		process_reports_postcountfilter: list of lists of stemmed words after initial processing, where each entry corresponds to a report, after low frequency words have been removed
 126 | 	"""
 127 | 	word_count = common_stems(processed_reports, labeled_indices)
 128 | 	d = dict((k,v) for k, v in word_count.items() if v >= freq_threshold)
 129 | 	process_reports_postcountfilter=[[] for x in range(0,len(processed_reports))]
 130 | 	for i in range(0,len(processed_reports)):
 131 | 		for token in processed_reports[i]:
 132 | 			if token in d: 
 133 | 				process_reports_postcountfilter[i].append(token)
 134 | 	return process_reports_postcountfilter
 135 | 
 136 | def create_ngrams(processed_reports, labeled_indices, N_GRAM_SIZES, freq_threshold):
 137 | 	"""
 138 | 	Takes a a processed_reports list, specified n_gram size list, and a frequency threshold at which
 139 | 	to eliminate tokens with < frequency of appearance returns creates n_grams as well as removes ngrams that signify end of sentence
 140 | 	Args:
 141 | 		processed_reports: a list of text lists of stemmed unigrams ready for conversion into n-gram text lists
 142 | 		labeled_indices: exclude these from calculcation of n-gram cutoff
 143 | 		N_GRAM_SIZES: a list of ints specifying the n-gram sizes to include in the texts of the future corpus
 144 | 		freq_threshold: the frequency threshold for n-gram inclusions. N-grams that occur with frequency < threshold will be removed from corpus
 145 | 	Returns:
 146 | 		processed_outputs_clean: a list of text lists of n-grams that are ready to be processed into a corpus	
 147 | 	"""
 148 | 	processed_outputs = []
 149 | 	print("creating n-grams")
 150 | 	for report in tqdm(processed_reports[:]):
 151 | 		new_report = []
 152 | 		end=len(report)
 153 | 		#CREATES 4-grams - for all n-grams, we don't allow "no" to be in middle of n-gram, don't allow sentenceend token to be in n-gram
 154 | 		if 4 in N_GRAM_SIZES:
 155 | 			for i in range (0,end-3):
 156 | 				if (report[i+1] != "no" and report[i+2] != "no" and report[i+3] != "no" and report[i].lower()!= "sentenceend" and report[i+1].lower()!= "sentenceend" and report[i+2].lower()!= "sentenceend" and report[i+3]!= "sentenceend"): #no only at beginning
 157 | 					new_report.append(report[i] +"_" +report[i+1] + "_" + report[i+2] + "_" + report[i+3]) 
 158 | 		#CREATES 3-grams
 159 | 		if 3 in N_GRAM_SIZES:
 160 | 			for i in range (0,end-2):
 161 | 				if (report[i+1] != "no" and report[i+2] != "no" and report[i].lower()!= "sentenceend" and report[i+1].lower()!= "sentenceend" and report[i+2].lower()!= "sentenceend"): #no only at beginning
 162 | 					new_report.append(report[i] +"_" +report[i+1] + "_" + report[i+2])
 163 | 		#CREATES 2-grams
 164 | 		if 2 in N_GRAM_SIZES:
 165 | 			for i in range (0,end-1):
 166 | 				if (report[i+1] != "no" and report[i].lower()!= "sentenceend" and report[i+1].lower()!= "sentenceend"): #no only at beginning
 167 | 					new_report.append(report[i] +"_" +report[i+1])
 168 | 		#CREATES unigrams
 169 | 		if 1 in N_GRAM_SIZES:
 170 | 			for i in range (0,end):
 171 | 				if(report[i].lower()!= "sentenceend" and report[i]!= "no"): # we take no out as a unigram in bow
 172 | 					new_report.append(report[i])
 173 | 		processed_outputs.append(new_report)
 174 | 
 175 | 	#remove low freq tokens
 176 | 	word_count = common_stems(processed_outputs, labeled_indices)
 177 | 	print("number of unique n-grams:", len(word_count)) 
 178 | 	d = dict((k,v) for k, v in word_count.items() if v >= freq_threshold)
 179 | 	print("number of unique n-grams after filtering out low frequency tokens:", len(d))
 180 | 
 181 | 	#remove tokens that occurred infrequently from processed_outputs --> processed_outputs_clean
 182 | 	processed_outputs_clean=[[] for x in range(0,len(processed_outputs))]
 183 | 	for i in range(0,len(processed_outputs)):
 184 | 		for token in processed_outputs[i]:
 185 | 			if token in d: 
 186 | 				processed_outputs_clean[i].append(token)
 187 | 	return processed_outputs_clean
 188 | 
 189 | def get_labeled_indices(df_data,validation_file,TRAIN_INDEX_OVERRIDE):
 190 | 	"""
 191 | 	Returns numerical indices of reports in df_data for which we have labeled data in validation_file; will set labeled reports as unlabeled if in TRAIN_INDEX_OVERRIDE
 192 | 	Args:
 193 | 		df_data: dataframe containing report text and accession ids
 194 | 		validation_file: dataframe containing accession ids and labels
 195 | 		TRAIN_INDEX_OVERRIDE: list of numerical indices to treat as unlabeled; necessary to train d2v model if all your data is labeled as it uses exclusively unlabeled data to train to avoid peeking into labeled data
 196 | 	Returns:
 197 | 		return_indices: indices we treat as labeled
 198 | 	"""
 199 | 	validation = pd.read_excel(validation_file)
 200 | 	validation.set_index('Accession Number')
 201 | 	validation_cases=validation['Accession Number'].tolist()
 202 | 	all_indices = df_data['Accession Number'].tolist()
 203 | 	return_indices=[]
 204 | 	for i in all_indices:
 205 | 		if i in validation_cases and i not in TRAIN_INDEX_OVERRIDE: # if something is manually overrided to be in train, don't put it in test
 206 | 			return_indices.append(True)
 207 | 		else:
 208 | 			return_indices.append(False)
 209 | 	return return_indices
 210 | 
 211 | def common_stems(ngram_list, exclude_indices):
 212 | 	"""
 213 | 	Takes a list of ngrams, ngram_list, and returns the most frequently appearing stems as a dict item of word:word_count pairs
 214 | 	is flagged to write output to memory.
 215 | 	Args:
 216 | 		ngram_list: list of all n_grams
 217 | 		exclude_indices:rows to ignore when doing count (labeled data)
 218 | 	Returns:
 219 | 		word_count: dict of ngram:ngram_count pairs
 220 | 	"""
 221 | 	word_count={}
 222 | 	i=0
 223 | 	excluded=0	
 224 | 	for entry in ngram_list:
 225 | 		if exclude_indices[i]==False:
 226 | 			for word in entry:
 227 | 				if word not in word_count:
 228 | 					#add word with entry 1
 229 | 					word_count[word] = 1
 230 | 				else:
 231 | 					#increment entry by 1
 232 | 					word_count[word]=word_count[word]+1
 233 | 		else:
 234 | 			excluded=excluded+1  
 235 | 		i=i+1
 236 | 	d = dict((k,v) for k, v in word_count.items())
 237 | 
 238 | 	return word_count
 239 | 
 240 | 
 241 | def build_train_test_corpus(df_data, ngram_list, labeled_filepath,TRAIN_INDEX_OVERRIDE):
 242 | 	"""
 243 | 	Takes the master corpus, the ngram_list, and a filepath pointing to a labeled spreadsheet
 244 | 	and builds a labeled_corpus consisting of labelled data and an unlabeled_corpus
 245 | 	of non-labelled data
 246 | 	Args:
 247 | 		df_data: a dataframe consisting of the original set of excel files with report text and accession id
 248 | 		ngram_list: list of all n-grams in corpus
 249 | 		labeled_filepath: path to file containing accession ids and labels
 250 | 		TRAIN_INDEX_OVERRIDE: indices to treat as unlabeled data regardless of presence of labels.
 251 | 	Returns:
 252 | 		train_corpus: a corpus consisting of unlabelled texts that will be used for model construction
 253 | 		test_corpus: a corpus consisting of labelled held-out texts that will be used for model validation
 254 | 		dictionary: a dictionary compromised of the LDA input n-grams
 255 | 		labeled_indices: the indices for the validation files
 256 | 	"""
 257 | 	dictionary = gensim.corpora.Dictionary(ngram_list)
 258 | 	corpus = [dictionary.doc2bow(input) for input in ngram_list]
 259 | 	if(not labeled_filepath is None):
 260 | 		outcomes = pd.read_excel(labeled_filepath)
 261 | 		outcomes.set_index('Accession Number')
 262 | 		labeled_cases=outcomes['Accession Number'].tolist()
 263 | 	else:
 264 | 		labeled_cases=[]
 265 | 	labeled_indices = []
 266 | 	not_labeled_indices = []
 267 | 	train_data_lda = np.ones(df_data.shape[0],dtype=bool)
 268 | 	num_removed=0
 269 | 	for i in range(0,df_data.shape[0]):
 270 | 		if df_data['Accession Number'].iloc[i] in labeled_cases and df_data['Accession Number'].iloc[i] not in TRAIN_INDEX_OVERRIDE:
 271 | 			train_data_lda[i]=False
 272 | 			labeled_indices.append(i)
 273 | 			num_removed += 1
 274 | 		else:
 275 | 			not_labeled_indices.append(i)
 276 | 	unlabeled_corpus = [corpus[i] for i in not_labeled_indices]
 277 | 	labeled_corpus = [corpus[i] for i in labeled_indices]
 278 | 
 279 | 	return corpus, unlabeled_corpus, labeled_corpus, dictionary, labeled_indices
 280 | 
 281 | 
 282 | def build_d2v_corpora(df_data,d2v_inputs,labeled_indices):
 283 | 	"""
 284 | 	Build corpora in format for doc2vec gensim implementation
 285 | 	Args:
 286 | 		df_data: a dataframe consisting of the original set of excel files with report text and accession id
 287 | 		d2v_inputs: list of lists of tokens, where each entry in d2v_inputs corresponds to a report
 288 | 		labeled_indices: indices of labeled reports (and those we treat as labeled due to TRAIN_INDEX_OVERRIDE)
 289 | 	Returns:
 290 | 		unlabeled_corpus: a corpus consisting of unlabelled texts that will be used for feature construction
 291 | 		labeled_corpus: a corpus consisting of labelled held-out texts that will be used for Lasso regression training
 292 | 		total_unlabeled_words: count of total words in unlabeled corpus
 293 | 	"""
 294 | 
 295 | 	SentimentDocument = namedtuple('SentimentDocument', 'words tags')
 296 | 	unlabeled_docs = [] 
 297 | 	labeled_docs = []  
 298 | 	total_unlabeled_words=0
 299 | 	i=0
 300 | 	for line in d2v_inputs:
 301 | 		words = line # [x for x in line if x != 'END']
 302 | 		tags = '' + str(df_data['Accession Number'].iloc[i])
 303 | 		if(i in labeled_indices):
 304 | 			labeled_docs.append(SentimentDocument(words,tags))
 305 | 		else:
 306 | 			unlabeled_docs.append(SentimentDocument(words,tags))
 307 | 			total_unlabeled_words+=len(words)
 308 | 		i+=1
 309 | 
 310 | 	print('%d unlabeled reports for featurization, %d labeled reports for modeling' % (len(unlabeled_docs), len(labeled_docs)))
 311 | 	return unlabeled_docs, labeled_docs, total_unlabeled_words
 312 | 
 313 | 
 314 | def train_d2v(unlabeled_docs, labeled_docs, D2V_EPOCH, DIM_DOC2VEC, W2V_DM, W2V_WINDOW, total_unlabeled_words):
 315 | 	"""
 316 | 	Train doc2vec/word2vec model.
 317 | 
 318 | 	Args:
 319 | 		unlabeled_docs: unlabeled corpus
 320 | 		labeled_docs: labeled corpus
 321 | 		D2V_EPOCHS: number of epochs to train d2v model; 20 has worked well in our experiments; parameter for gensim doc2vec
 322 | 		DIM_DOC2VEC: dimensionality of embedding vectors, we explored values 50-800; parameter for gensim doc2vec
 323 | 		W2V_DM: 1 is PV-DM, otherwise PV-DBOW; parameter for gensim doc2vec
 324 | 		W2V_WINDOW: number of words window to use  in doc2vec model; parameter for gensim doc2vec
 325 | 		total_unlabeled_words: total words in unlabeled corpus; argument for gensim doc2vec
 326 | 
 327 | 	Returns:
 328 | 		d2vmodel: trained doc2vec model.
 329 | 	"""
 330 | 
 331 | 	cores = multiprocessing.cpu_count()
 332 | 	assert gensim.models.doc2vec.FAST_VERSION > -1, "speed up"
 333 | 	print("started doc2vec training")
 334 | 	d2vmodel = gensim.models.Doc2Vec(dm=W2V_DM, size=DIM_DOC2VEC, window=W2V_WINDOW, negative=5, hs=0, min_count=2, workers=cores)
 335 | 	d2vmodel.build_vocab(unlabeled_docs + labeled_docs)  
 336 | 	d2vmodel.train(unlabeled_docs, total_words=total_unlabeled_words, epochs=D2V_EPOCH)
 337 | 	print("finished doc2vec training")
 338 | 	return d2vmodel
 339 | 
 340 | def calc_auc(predictor_matrix,eligible_outcomes_aligned, all_outcomes_aligned,N_LABELS, pred_type, header,ASSIGNFOLD_USING_ROW=False):
 341 | 	"""
 342 | 	Train Lasso models using 60% of labeled data with generated features and labels; calculate AUC, accuracy, 
 343 | 	confusion matrix for each label on remaining 40% of labeled data.
 344 | 
 345 | 	Args:
 346 | 		
 347 | 		predictor_matrix: numpy matrix of features available to use as input to Lasso logistic regression
 348 | 		eligible_outcomes_aligned: dataframe of labels we are predicting
 349 | 		all_outcomes_aligned: dataframe of all labels, including those we excluded due to infrequent positive/negative occurences - we use it for accession id
 350 | 		N_LABELS: total number of labels we are predicting
 351 | 		pred_type: label indicating what variables went into predictor_matrix
 352 | 		results_dir: directory to which to save results
 353 | 		header: header for predictor matrix
 354 | 		ASSIGNFOLD_USING_ROW: normally 60/40 split done randomly, you can fix it to use first 60% of rows if you need replicability 
 355 | 							   but be wary of introducing distortion into train/test set with dates, etc.: recommend randomly sorting
 356 | 							   rows in excel beforehand if you opt for this.
 357 | 
 358 | 	Returns:
 359 | 
 360 | 		lasso_models: list of all trained lasso logistic regression models from sklearn, where index corresponds to relative index in columns of eligible_outcomes_aligend
 361 | 	"""
 362 | 
 363 | 	if predictor_matrix.shape[1]!=len(header):
 364 | 		print("predictor_matrix.shape[1]="+str(predictor_matrix.shape[1]))
 365 | 		print("len(header)"+str(len(header)))
 366 | 		raise ValueError("predictor_matrix shape doesn't match header, investigate")
 367 | 	all_coef = pd.concat([ pd.DataFrame(header)], axis = 1)	
 368 | 	
 369 | 	lasso_models={}
 370 | 	model_types = ["Lasso"]
 371 | 
 372 | 	r = list(range(eligible_outcomes_aligned.shape[0]))
 373 | 	random.shuffle(r)
 374 | 	
 375 | 	if(ASSIGNFOLD_USING_ROW):
 376 | 		assignfold = pd.DataFrame(data=list(range(eligible_outcomes_aligned.shape[0])), columns=['train'])
 377 | 	else:
 378 | 		assignfold = pd.DataFrame(data=r, columns=['train'])
 379 | 
 380 | 	cutoff = np.floor(0.6*eligible_outcomes_aligned.shape[0])
 381 | 	
 382 | 	train=assignfold['train']<cutoff
 383 | 	test=assignfold['train']>=cutoff
 384 | 	
 385 | 	N_TRAIN=eligible_outcomes_aligned.ix[train,:].shape[0]
 386 | 	N_HELDOUT=eligible_outcomes_aligned.ix[test,:].shape[0]
 387 | 	print("n_train in modeling="+str(N_TRAIN))
 388 | 	print("n_test in modeling="+str(N_HELDOUT))
 389 | 	
 390 | 	confusion = pd.DataFrame(data=np.zeros(shape=(eligible_outcomes_aligned.shape[1]*len(model_types),6),dtype=np.int),columns=['Label (with calcs on held out 40 pct)','AUC','True +','False +','True -','False -'])
 391 | 
 392 | 	resultrow=0
 393 | 	for i in range(0,N_LABELS):
 394 | 		PROCEED=True;
 395 | 		#need to make sure we don't have an invalid setting -- ie, a train[x] set of labels that is uniform, else Lasso regression fails
 396 | 		if(len(set(eligible_outcomes_aligned.ix[train,i].tolist())))==1:
 397 | 			PROCEED=False;
 398 | 			raise ValueError ("fed label to lasso regression with no variation - cannot compute - please investigate data")
 399 |  
 400 | 		if(PROCEED):
 401 | 			
 402 | 			for model_type in model_types:
 403 | 				if(model_type=="Lasso"):
 404 | 					parameters = { "penalty": ['l1'], 
 405 | 								   "C": [64,32,16,8,4,2,1,0.5,0.25,0.1,0.05,0.025,0.01,0.005]
 406 | 								 }
 407 | 					try:
 408 | 						cv = StratifiedKFold(n_splits=5)
 409 | 						grid_search = GridSearchCV(LogisticRegression(), param_grid=parameters, scoring='neg_log_loss', cv=cv)
 410 | 						grid_search.fit(predictor_matrix[train,:],np.array(eligible_outcomes_aligned.ix[train,i]))				
 411 | 						best_parameters0 = grid_search.best_estimator_.get_params()
 412 | 						model0 = LogisticRegression(**best_parameters0)					
 413 | 					except:
 414 | 						raise ValueError ("error in lasso regression - likely data issue, may involve rare labels - please investigate data")                        
 415 | 				model0.fit(predictor_matrix[np.array(train),:],eligible_outcomes_aligned.ix[train,i])
 416 | 				pred0=model0.predict_proba(predictor_matrix[np.array(test),:])[:,1]
 417 | 				coef = pd.concat([ pd.DataFrame(header),pd.DataFrame(np.transpose(model0.coef_))], axis = 1)	
 418 | 				df0 = pd.DataFrame({'predict':pred0,'target':eligible_outcomes_aligned.ix[test,i], 'label':all_outcomes_aligned['Accession Number'][test]})
 419 | 							  
 420 | 				calc_auc=roc_auc_score(np.array(df0['target']),np.array(df0['predict']))
 421 | 				if(i%10==0):
 422 | 					print("i="+str(i))
 423 | 				save_name=str(list(eligible_outcomes_aligned.columns.values)[i])
 424 | 
 425 | 				target_predicted=''.join(e for e in save_name if e.isalnum())
 426 | 
 427 | 				#confusion: outcome TP TN FP FN
 428 | 				thresh = np.mean(df0['target'])
 429 | 				FP=0
 430 | 				FN=0
 431 | 				TP=0
 432 | 				TN=0
 433 | 				for j in df0.index:
 434 | 					cpred=df0.ix[j][1]
 435 | 					ctarget = df0.ix[j][2]
 436 | 
 437 | 					if cpred>=thresh and ctarget==1:
 438 | 						TP+=1
 439 | 					if cpred<thresh and ctarget==1:
 440 | 						FN+=1
 441 | 					if cpred>=thresh and ctarget==0:
 442 | 						FP+=1
 443 | 					if cpred<thresh and ctarget==0:
 444 | 						TN+=1
 445 | 						
 446 | 				#save results		
 447 | 				confusion.iloc[resultrow,0]=list(eligible_outcomes_aligned.columns.values)[i]
 448 | 				confusion.iloc[resultrow,1]=calc_auc
 449 | 				confusion.iloc[resultrow,2]=TP
 450 | 				confusion.iloc[resultrow,3]=FP
 451 | 				confusion.iloc[resultrow,4]=TN
 452 | 				confusion.iloc[resultrow,5]=FN
 453 | 
 454 | 				#let's rebuild model using all data before we save it to use for prediction;
 455 | 				try:
 456 | 					model0 = LogisticRegression(**best_parameters0)	
 457 | 					model0.fit(predictor_matrix,eligible_outcomes_aligned.ix[:,i])                
 458 | 					lasso_models[i]=model0
 459 | 				except:
 460 | 					raise ValueError ("error in lasso regression - likely data issue, may involve rare labels - please investigate data")                        
 461 | 
 462 | 				resultrow+=1
 463 | 		
 464 | 	confusion.set_index(confusion.columns[0],inplace=True)
 465 | 	return lasso_models, confusion
 466 | 
 467 | 
 468 | def generate_labeled_data_features(labeled_file,
 469 | 					   labeled_indices,
 470 | 					   DIM_DOC2VEC,
 471 | 					   df_data,
 472 | 					   processed_reports,
 473 | 					   DO_PARAGRAPH_VECTOR,
 474 | 					   DO_WORD2VEC,
 475 | 					   dictionary,
 476 | 					   corpus,
 477 | 					   d2vmodel,
 478 | 					   d2v_inputs):
 479 | 	"""
 480 | 	Generate numerical features to be used in Lasso logistic regressions using text data for labeled reports.
 481 | 	Note: output reorganizes indices in order to align 
 482 | 
 483 | 	Args:
 484 | 
 485 | 		labeled_file: path to file with labels and accession ids
 486 | 		labeled_indices: indices of labeled data (or data we treat as labeled because of TRAIN_INDEX_OVERRIDE)
 487 | 		DIM_DOC2VEC: embedding dimensionality of doc2vec
 488 | 		df_data: dataframe containing original reports and accession ids
 489 | 		processed_reports: list of list of words, each entry in original list corresponding to a report
 490 | 		DO_PARAGRAPH_VECTOR: use paragraph vector features?
 491 | 		DO_WORD2VEC: use average word embedding features? 
 492 | 		dictionary: a dictionary compromised of the LDA input n-grams
 493 | 		corpus: corpus with both unlabeled and labeled data, list of lists
 494 | 		d2vmodel: trained doc2vec model object 
 495 | 		d2v_inputs: reports processed into d2v input format
 496 | 
 497 | 	Returns:
 498 | 
 499 | 		bow_matrix: numpy matrix with indicator bow features (1 if word present, 0 else), each row corresponds to a report
 500 | 		pv_matrix: numpy matrix with paragraph vector embedding features, each row corresponds to a report
 501 | 		w2v_matrix: numpy matrix with average word embedding features, each row corresponds to a report
 502 | 		accid_list: give index into original corpus of each case; for quick debugging and spot-checking of cases
 503 | 		orig_text: original text; for quick debugging and spot-checking of cases
 504 | 		orig_input:original processed_report; for quick debugging and spot-checking of cases
 505 | 
 506 | 	"""
 507 | 
 508 | 	# #generate weight and feature matrix for held out labeled data
 509 | 	outcomes = pd.read_excel(labeled_file)
 510 | 	outcomes.set_index('Accession Number')
 511 | 	bow_matrix = np.zeros(shape=(len(labeled_indices),len(dictionary)),dtype=np.int)
 512 | 
 513 | 	pv_matrix = np.zeros(shape=(len(labeled_indices),DIM_DOC2VEC),dtype=np.float64)
 514 | 	w2v_matrix = np.zeros(shape=(len(labeled_indices),DIM_DOC2VEC),dtype=np.float64)
 515 | 	accid_list = []
 516 | 	orig_text=[]
 517 | 	orig_input=[]
 518 | 
 519 | 	j=0
 520 | 	print("generating features")
 521 | 	for i in tqdm(labeled_indices): 
 522 | 		if df_data['Accession Number'].iloc[i] not in list(outcomes['Accession Number']):
 523 | 			raise Exception(" df_data i @ " + str(i) +" = " +str(df_data['Accession Number'].iloc[i])+ " not in set of held out cases, examine" )
 524 | 		accid_list.append(df_data['Accession Number'].iloc[i])
 525 | 		orig_text.append(df_data['Report Text'].iloc[i])
 526 | 		orig_input.append(processed_reports[i])
 527 | 
 528 | 		#fill feature columns - if ngram shows up in the document, mark it as 1, else leave as 0
 529 | 		for k in range(0,len(corpus[i])):
 530 | 			bow_matrix[j][corpus[i][k][0]]=1	
 531 | 		
 532 | 		if(DO_PARAGRAPH_VECTOR):
 533 | 			vect = d2vmodel.infer_vector(d2v_inputs[i],alpha=0.01, steps=50)
 534 | 	
 535 | 			for k in range(0,len(vect)):
 536 | 				pv_matrix[j,k]=vect[k]
 537 | 		
 538 | 		if(DO_WORD2VEC):
 539 | 		
 540 | 			#we want to use vectors based on word average:
 541 | 			temp_avg =np.zeros(shape=(DIM_DOC2VEC),dtype=np.float64)
 542 | 			m_avg=0
 543 | 			real_words=0
 544 | 			for k in range(0,len(d2v_inputs[i])):
 545 | 				
 546 | 				#ignore special end character, otherwise proceed
 547 | 				if(d2v_inputs[i][k].lower()!="sentenceend"):
 548 | 					real_words+=0
 549 | 					try:
 550 | 						#if it can't find the word, zero it out
 551 | 						weight_avg = 1					
 552 | 						temp_avg = np.add(temp_avg,weight_avg*d2vmodel[d2v_inputs[i][k]])					
 553 | 						m_avg +=weight_avg
 554 | 					except:
 555 | 						pass # do nothing
 556 | 			
 557 | 			if(real_words>0): temp_avg = np.divide(temp_avg,m_avg) #if vector was empty, just leave it zero
 558 | 
 559 | 			for k in range(0,DIM_DOC2VEC):
 560 | 				w2v_matrix[j,k]=temp_avg[k]		
 561 | 		
 562 | 		j+=1
 563 | 	return bow_matrix, pv_matrix,w2v_matrix,accid_list,orig_text,orig_input
 564 | 
 565 | def generate_wholeset_features(DIM_DOC2VEC,
 566 | 					  processed_reports,
 567 | 					   DO_PARAGRAPH_VECTOR,
 568 | 					   DO_WORD2VEC,
 569 | 					   dictionary,
 570 | 					   corpus,
 571 | 					   d2vmodel,
 572 | 					   d2v_inputs):
 573 | 	"""
 574 | 	Generate numerical features to be used in Lasso logistic regressions using text data for all reports (labeled and unlabeled)
 575 | 
 576 | 	Args:
 577 | 
 578 | 		DIM_DOC2VEC: embedding dimensionality of doc2vec
 579 | 		processed_reports: list of list of words, each entry in original list corresponding to a report
 580 | 		DO_PARAGRAPH_VECTOR: use paragraph vector features?
 581 | 		DO_WORD2VEC: use average word embedding features? 
 582 | 		dictionary: a dictionary compromised of the LDA input n-grams
 583 | 		corpus: corpus with both unlabeled and labeled data, list of lists
 584 | 		d2vmodel: trained doc2vec model object 
 585 | 		d2v_inputs: reports processed into d2v input format
 586 | 
 587 | 	Returns:
 588 | 
 589 | 		bow_matrix: numpy matrix with indicator bow features (1 if word present, 0 else), each row corresponds to a report
 590 | 		pv_matrix: numpy matrix with paragraph vector embedding features, each row corresponds to a report
 591 | 		w2v_matrix: numpy matrix with average word embedding features, each row corresponds to a report
 592 | 	"""
 593 | 	
 594 | 	bow_matrix = np.zeros(shape=(len(corpus),len(dictionary)),dtype=np.int)
 595 | 	pv_matrix = np.zeros(shape=(len(corpus),DIM_DOC2VEC),dtype=np.float64)
 596 | 	w2v_matrix = np.zeros(shape=(len(corpus),DIM_DOC2VEC),dtype=np.float64)
 597 | 
 598 | 	j=0
 599 | 	for i in tqdm(range(0,len(corpus))): 
 600 | 
 601 | 		#fill feature columns - if ngram shows up in the document, mark it as 1, else leave as 0
 602 | 		for k in range(0,len(corpus[i])):
 603 | 			bow_matrix[j][corpus[i][k][0]]=1	
 604 | 		
 605 | 		if(DO_PARAGRAPH_VECTOR):
 606 | 			vect = d2vmodel.infer_vector(d2v_inputs[i],alpha=0.01, steps=50)
 607 | 	
 608 | 			for k in range(0,len(vect)):
 609 | 				pv_matrix[j,k]=vect[k]
 610 | 		
 611 | 		if(DO_WORD2VEC):
 612 | 		
 613 | 			#we want to use vectors based on word average:
 614 | 			temp_avg =np.zeros(shape=(DIM_DOC2VEC),dtype=np.float64)
 615 | 			m_avg=0
 616 | 			real_words=0
 617 | 			for k in range(0,len(d2v_inputs[i])):
 618 | 				
 619 | 				#ignore special end character, otherwise proceed
 620 | 				if(d2v_inputs[i][k].lower()!="sentenceend"):
 621 | 					real_words+=0
 622 | 					try:
 623 | 						#if it can't find the word, zero it out
 624 | 						weight_avg = 1					
 625 | 						temp_avg = np.add(temp_avg,weight_avg*d2vmodel[d2v_inputs[i][k]])					
 626 | 						m_avg +=weight_avg
 627 | 					except:
 628 | 						pass # do nothing
 629 | 			
 630 | 			if(real_words>0): temp_avg = np.divide(temp_avg,m_avg) #if vector was empty, just leave it zero
 631 | 
 632 | 			for k in range(0,DIM_DOC2VEC):
 633 | 				w2v_matrix[j,k]=temp_avg[k]		
 634 | 		
 635 | 		j+=1
 636 | 	return bow_matrix,pv_matrix,w2v_matrix
 637 | 
 638 | 
 639 | def generate_outcomes(labeled_file,accid_list,N_THRESH_OUTCOMES):
 640 | 	"""
 641 | 	Generate dataframe of labels to be used in Lasso logistic regressions
 642 | 
 643 | 	Args:
 644 | 		labeled_file: path to file with labels and accession ids
 645 | 		accid_list: list of accession ids of each row in the labeled data that are also present in exported reports; 
 646 | 				 	needed to eliminate labeled reports for which we have no text (mistranscribed accession IDs, etc.)
 647 | 		N_THRESH_OUTCOMES: eliminate outcomes that don't have this many positive / negative examples
 648 | 
 649 | 	Returns:
 650 | 
 651 | 		eligible_outcomes_aligned: dataframe of labels eligible for prediction
 652 | 		all_outcomes_aligned: dataframe of all labels
 653 | 		N_LABELS: total number of labels we predict
 654 | 		outcome_header_list: list of headers corresponding to each label
 655 | 	"""
 656 | 
 657 | 	outcomes = pd.read_excel(labeled_file)
 658 | 	outcomes.set_index('Accession Number')
 659 | 	outcomes_aligned2 = pd.DataFrame(data=accid_list, index=accid_list, columns=['Accession Number'])
 660 | 	all_outcomes_aligned = pd.merge(outcomes_aligned2, outcomes, sort=False)
 661 | 
 662 | 	#modify outcome matrix to only include outcomes with n_thresh_outcomes +/- observations
 663 | 
 664 | 	outcome_remove=[]
 665 | 	N_LABELS=all_outcomes_aligned.shape[1]
 666 | 	print("total labels:"+str(N_LABELS))
 667 | 	for i in range(0,N_LABELS):
 668 | 		check=sum(all_outcomes_aligned.iloc[:,i])
 669 | 
 670 | 		if(check<N_THRESH_OUTCOMES):
 671 | 			outcome_remove.append(i)
 672 | 		elif(check>((all_outcomes_aligned.shape)[0]-N_THRESH_OUTCOMES)):
 673 | 			outcome_remove.append(i)
 674 | 		elif(math.isnan(check)):
 675 | 			outcome_remove.append(i)
 676 | 
 677 | 	eligible_outcomes_aligned=all_outcomes_aligned.drop(all_outcomes_aligned.columns[outcome_remove],axis=1)
 678 | 
 679 | 	N_LABELS=eligible_outcomes_aligned.shape[1]
 680 | 	print("labels eligible for inference:"+str(N_LABELS))
 681 | 
 682 | 	outcome_header_list=list(eligible_outcomes_aligned)
 683 | 	outcome_header_list=[x.replace(",",".") for x in outcome_header_list]
 684 | 	outcome_header_list=",".join(outcome_header_list)
 685 | 	
 686 | 	return eligible_outcomes_aligned,all_outcomes_aligned, N_LABELS, outcome_header_list
 687 | 
 688 | 
 689 | def write_silver_standard_labels(corpus,
 690 | 								N_LABELS,
 691 | 								eligible_outcomes_aligned,
 692 | 								DIM_DOC2VEC,
 693 | 								processed_reports,
 694 | 								DO_BOW,
 695 | 								DO_PARAGRAPH_VECTOR,
 696 | 								DO_WORD2VEC,
 697 | 								dictionary,
 698 | 								d2vmodel,
 699 | 								d2v_inputs, 
 700 | 								lasso_models,
 701 | 								accid_list, 
 702 | 								labeled_indices,
 703 | 								df_data, 
 704 | 								SILVER_THRESHOLD):
 705 | 	"""
 706 | 	Generate inferred labels using trained Lasso regression models; override with hand-labeled data when available.
 707 | 
 708 | 	Args:
 709 | 
 710 | 		corpus: list of lists of tokens, each entry in original list corresponds to report
 711 | 		N_LABELS: total labels we predict
 712 | 		eligible_outcomes_aligned: dataframe of eligible labels for prediction
 713 | 		DIM_DOC2VEC: embedding dimensionality of average word embedding features
 714 | 		processed_reports: corpus of processed reports
 715 | 		DO_BOW: include bag of words features?
 716 | 		DO_PARAGRAPH_VECTOR: include paragraph vector features?
 717 | 		DO_WORD2VEC: include average word embedding features?
 718 | 		dictionary: dictionary mapping word to integer representation
 719 | 		d2vmodel: trained doc2vec feature
 720 | 		d2v_inputs: reports processed into doc2vec format
 721 | 		lasso_models: list of saved Lasso logistic regression models, each index corresponds to a corresponding column in eligible_outcomes_aligned
 722 | 		accid_list: list of accession ids of each row in the labeled data that are also present in exported reports
 723 | 		labeled_indices: indices of labeled data (or data we treat as labeled because of TRAIN_INDEX_OVERRIDE)
 724 | 		df_data: dataframe containing original reports and accession ids
 725 | 		SILVER_THRESHOLD: "mean" or "fiftypct", defines threshold for converting probabilities to binary labels (mean of label vs. 50%).  
 726 | 		                  note that in either case it will be overridden with true labels when available
 727 |  
 728 | 	Returns:
 729 | 
 730 | 		pred_outcome_df: dataframe containing accession ids and inferred labels
 731 | 
 732 | 	"""
 733 | 		
 734 | 	pred_outcome_matrix_binary = np.zeros(shape=(len(corpus),N_LABELS),dtype=np.int)
 735 | 	pred_outcome_matrix_proba = np.zeros(shape=(len(corpus),N_LABELS),dtype=np.float16)
 736 | 
 737 | 	#we classify as true/false based on mean of predictor - note dependence on self.SILVER_THRESHOLD
 738 | 	if(SILVER_THRESHOLD=="mean"):
 739 | 		class_thresh = eligible_outcomes_aligned.mean(axis=0)
 740 | 	elif(SILVER_THRESHOLD=="fiftypct"):
 741 | 		class_thresh = [0.5]*eligible_outcomes_aligned.shape[1]
 742 | 
 743 | 	for x in range(0,len(corpus),2000):
 744 | 		#generate features for whole dataset so we can return inferred labels for deep learning on images themselves
 745 | 		whole_bow_matrix,whole_pv_matrix,whole_w2v_matrix=generate_wholeset_features(
 746 | 					  DIM_DOC2VEC,
 747 | 					  processed_reports[x:x+2000],
 748 | 					   DO_PARAGRAPH_VECTOR,
 749 | 					   DO_WORD2VEC,dictionary,corpus[x:x+2000],d2vmodel,d2v_inputs[x:x+2000])
 750 | 		
 751 | 		#use everything available for prediction - done in chunks to avoid memory issues
 752 | 		#whole_combined_matrix=np.hstack((whole_w2v_matrix,whole_bow_matrix,whole_pv_matrix))
 753 | 		if(DO_BOW and DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_bow_matrix,whole_w2v_matrix,whole_pv_matrix))
 754 | 
 755 | 		if(DO_BOW and DO_WORD2VEC and not DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_bow_matrix,whole_w2v_matrix))
 756 | 		if(DO_BOW and not DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_bow_matrix,whole_pv_matrix))
 757 | 		if(not DO_BOW and DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=np.hstack((whole_w2v_matrix,whole_pv_matrix))
 758 | 
 759 | 		if(DO_BOW and not DO_WORD2VEC and not DO_PARAGRAPH_VECTOR): whole_combined_matrix=whole_bow_matrix
 760 | 		if(not DO_BOW and DO_WORD2VEC and not DO_PARAGRAPH_VECTOR): whole_combined_matrix=whole_w2v_matrix
 761 | 		if(not DO_BOW and not DO_WORD2VEC and DO_PARAGRAPH_VECTOR): whole_combined_matrix=whole_pv_matrix
 762 | 
 763 | 		for i in range(0,N_LABELS):
 764 | 			pred_proba=lasso_models[i].predict_proba(whole_combined_matrix)[:,1]
 765 | 			pred_binary = (pred_proba > class_thresh[i]).astype(int)
 766 | 			pred_outcome_matrix_proba[x:x+2000,i]=pred_proba
 767 | 			pred_outcome_matrix_binary[x:x+2000,i]=pred_binary
 768 | 
 769 | 	#generate list of accession #s for export
 770 | 	accession_list = []
 771 | 	for i in range(0,len(corpus)):
 772 | 		accession_list.append(df_data['Accession Number'].iloc[i])
 773 | 
 774 | 	pred_outcome_proba_df = pd.DataFrame(pred_outcome_matrix_proba, index = accession_list, columns = list(eligible_outcomes_aligned.columns.values) )
 775 | 	pred_outcome_binary_df = pd.DataFrame(pred_outcome_matrix_binary, index = accession_list, columns = list(eligible_outcomes_aligned.columns.values) )
 776 | 		
 777 | 	#get accuracy by column
 778 | 
 779 | 	outcome_lookup ={}
 780 | 	for i in range(0,len(accid_list)):
 781 | 		outcome_lookup[accid_list[i]]=i
 782 | 
 783 | 	errors = np.zeros(shape=(N_LABELS,1),dtype=np.int)
 784 | 	denom = np.zeros(shape=(N_LABELS,1),dtype=np.int)
 785 | 	tp = np.zeros(shape=(N_LABELS,1),dtype=np.int)
 786 | 	fp = np.zeros(shape=(N_LABELS,1),dtype=np.int)
 787 | 	tn = np.zeros(shape=(N_LABELS,1),dtype=np.int)
 788 | 	fn = np.zeros(shape=(N_LABELS,1),dtype=np.int)
 789 | 
 790 | 	for i in range(0,len(corpus)):
 791 | 		if i in labeled_indices: # need to evaluate
 792 | 			#grab accession #
 793 | 			accno = df_data['Accession Number'].iloc[i]			
 794 | 			
 795 | 			for k in range(0,N_LABELS):
 796 | 
 797 | 				#does our predicted value match the true one? if not, record discrepancy  
 798 | 				if(eligible_outcomes_aligned.ix[outcome_lookup[accno],k]!=pred_outcome_binary_df.iloc[i,k]):
 799 | 					errors[k]+=1
 800 | 				denom[k]+=1
 801 | 
 802 | 				#set probabilistic predictions to labeled ones regardless
 803 | 				pred_outcome_proba_df.iloc[i,k]=eligible_outcomes_aligned.ix[outcome_lookup[accno],k]
 804 | 
 805 | 				#if disagreement btw pred and hand-labeled data, use hand labeled
 806 | 				if(eligible_outcomes_aligned.ix[outcome_lookup[accno],k]!=pred_outcome_binary_df.iloc[i,k]):
 807 | 					pred_outcome_binary_df.iloc[i,k]=eligible_outcomes_aligned.ix[outcome_lookup[accno],k]
 808 | 
 809 | 	#print('classifier accuracy by label on all labeled data including that used to train it (process integrity check)')
 810 | 	#print(str(1-(errors/denom)))
 811 | 	pred_outcome_binary_df.set_index(df_data['Accession Number'],inplace=True)
 812 | 	pred_outcome_proba_df.set_index(df_data['Accession Number'],inplace=True)
 813 | 	return pred_outcome_binary_df,pred_outcome_proba_df
 814 | 		
 815 | 		
 816 | def give_stop_words():
 817 | 	"""
 818 | 	Returns list of stop words.
 819 | 
 820 | 	Arguments:
 821 | 
 822 | 		None
 823 | 
 824 | 	Returns:
 825 | 	
 826 | 		stop_words: a list of stop words. note - we have removed stop words from this example; you can add them below if you have a list of stop words for your application.
 827 | 	"""
 828 | 	#stop_words=["word1", "word2", ...]    
 829 | 	stop_words=[]
 830 |     
 831 | 
 832 | 	return stop_words
 833 | 
 834 | 
 835 | class RadReportAnnotator(object):
 836 | 	
 837 | 	def __init__(self, report_dir_path, validation_file_path):
 838 | 		"""
 839 | 		Initialize RadReportAnnotator class 
 840 | 
 841 | 		Args: 
 842 | 
 843 | 			report_dir_path: FOLDER where reports are located in montage xls. Expects columns titled "Accession Number" and "Report Text"; can specify alternate labels in define_config()
 844 | 			validation_file_path: FILE with human-labeled reports file. Expects column titled "Accession Number" as first column, every subsequent column will be interpreted as a label to be predicted.
 845 | 
 846 | 		Returns:
 847 | 
 848 | 			Nothing
 849 | 
 850 | 		"""
 851 | 
 852 | 		#USER MODIFIABLE SETTINGS - USE define_config() TO SET
 853 | 		self.DO_BOW=None
 854 | 		self.DO_WORD2VEC=None
 855 | 		self.DO_PARAGRAPH_VECTOR=None
 856 | 		self.DO_SILVER_STANDARD=None   
 857 | 		self.STEM_WORDS=None
 858 | 		self.N_GRAM_SIZES = None
 859 | 		self.DIM_DOC2VEC = None
 860 | 		self.N_THRESH_CORPUS=None
 861 | 		self.N_THRESH_OUTCOMES=None
 862 | 		self.TRAIN_INDEX_OVERRIDE = None
 863 | 		self.SILVER_THRESHOLD=None
 864 | 		self.NAME_UNID_LABELED_FILE = None
 865 | 		self.NAME_UNID_REPORTS= None
 866 | 		self.NAME_TEXT_REPORTS= None		
 867 | 
 868 | 
 869 | 		#SETTINGS YOU WILL LIKELY WITH TO LEAVE AS IS, BUT CAN MODIFY IF NEEDED
 870 | 		self.D2V_EPOCH = 20 # 20 works well, # of epochs to train D2V for
 871 | 		self.W2V_DM = 1 # 1 is PV-DM, otherwise PV-DBOW
 872 | 		self.W2V_WINDOW = 5 #we can try 3,5,7
 873 | 		self.data_dir = report_dir_path #"Base directory for raw reports
 874 | 		self.validation_file =  validation_file_path #"File containing report annotations")
 875 | 		self.ASSIGNFOLD_USING_ROW=False # normally in lasso regression modeling 60% train / 40% test splits are done randomly. you can do them by row if you need consistency across runs
 876 | 
 877 | 
 878 | 		#MENTIONING CLASS OBJECTS USED INTERNALLY LATER
 879 | 		self.df_data=None
 880 | 		self.processed_reports=None
 881 | 		self.labeled_indices=None
 882 | 		self.d2v_inputs=None
 883 | 		self.ngram_reports =None 
 884 | 		self.corpus = None 
 885 | 		self.train_corpus =  None  
 886 | 		self.test_corpus  = None 
 887 | 		self.dictionary = None 
 888 | 		self.labeled_indices = None
 889 | 		self.train_docs = None # w2v
 890 | 		self.test_docs = None
 891 | 		self.d2vmodel = None
 892 | 		self.bow_matrix = None
 893 | 		self.combined = None
 894 | 		self.pv_matrix = None
 895 | 		self.w2v_matrix = None
 896 | 		self.accid_list = None
 897 | 		self.orig_text = None
 898 | 		self.orig_input = None
 899 | 		self.eligible_outcomes_aligned = None
 900 | 		self.all_outcomes_aligned = None
 901 | 		self.N_LABELS = None
 902 | 		self.outcome_header_list = None
 903 | 		self.lasso_models = None
 904 | 		self.inferred_binary_labels = None
 905 | 		self.inferred_proba_labels = None
 906 | 		self.headers = None
 907 | 		self.accuracy = None
 908 | 
 909 | 
 910 | 	def define_config(self, DO_BOW=True, DO_WORD2VEC=False, DO_PARAGRAPH_VECTOR=False,DO_SILVER_STANDARD=True,STEM_WORDS=True,N_GRAM_SIZES=[1],DIM_DOC2VEC=200,N_THRESH_CORPUS=1,N_THRESH_OUTCOMES=1,TRAIN_INDEX_OVERRIDE=[], SILVER_THRESHOLD="mean", NAME_UNID_REPORTS="Accession Number",NAME_TEXT_REPORTS="Report Text"):
 911 | 		"""
 912 | 		Sets parameters for RadReportAnnotator.
 913 | 
 914 | 		Args:
 915 | 
 916 | 			DO_BOW: True to use indicator bag of words-based features (1 if word present in doc, 0 if not). 
 917 | 			DO_WORD2VEC: True to use word2vec-based average word embedding fatures. 
 918 | 			DO_PARAGRAPH_VECTOR: True to use word2vec-based paragraph vector embedding fatures. 
 919 | 			DO_SILVER_STANDARD: True to infer labels for unlabeled reports.
 920 | 			STEM_WORDS: True to stem words for BOW analysis; words are unstemmed in doc2vec analysis
 921 | 			N_GRAM_SIZES: Which set of n-grams to use in BOW analysis: [1] = 1 grams only, [3] = 3 grams only, [1,2,3] = 1, 2, and 3- grams.
 922 | 			DIM_DOC2VEC: Dimensionality of doc2vec manifold; recommend value in 50 to 400
 923 | 			N_THRESH_CORPUS: ignore any n-grams that appear fewer than N times in the entire corpus
 924 | 			N_THRESH_OUTCOMES: do not train models for labels that don't have at least this many positive and negative examples. 
 925 | 			TRAIN_INDEX_OVERRIDE: list of accession numbers we force to be treated as unlabeled data even though they are labeled (ie, these will *not* be used in Lasso regressions). May be used if all of your reports are labeled, as some unlabeled reports are required for d2v training.
 926 | 			SILVER_THRESHOLD: how to threshold probability predictions in infer_labels to get binary labels. 
 927 | 			                  can be ["mean","mostlikely"]
 928 | 			                  mean sets any predicted probability greater than population mean to 1, else 0; e.g., prediction 0.10 in a label with average 0.05 is set to 1
 929 | 			                  mostlikely sets any predicted probability >50% to 1, otherwise 0
 930 | 			                  both settings have issues, and class imbalance is a major issue in training convolutional nets.
 931 | 			                  we recommend using probabilities if your model can accomodate it. 
 932 |  			NAME_UNID_REPORTS: column name of accession number / unique report id in the read-in *reports* file. provided for convenience as there may be many report files.
 933 | 			NAME_TEXT_REPORTS: column name of report text in the read-in reports file. provided for convenience as there may be many report files.
 934 | 		Returns:
 935 | 
 936 | 			Nothing
 937 | 
 938 | 		"""
 939 | 
 940 | 		self.DO_BOW=DO_BOW #generate results for bag of words approach?
 941 | 		self.DO_WORD2VEC=DO_WORD2VEC #generate resultes (tfidf and avg weight) for word2vec approach?
 942 | 		self.DO_PARAGRAPH_VECTOR=DO_PARAGRAPH_VECTOR #generate results for paragraph vector approach?
 943 | 		self.DO_SILVER_STANDARD=DO_SILVER_STANDARD	#generate silver standard labels?
 944 | 		self.STEM_WORDS=STEM_WORDS #should we stem words for BOW, LDA analysis? (we never stem words or doc2vec/w2v analysis, see below)
 945 | 		if not N_GRAM_SIZES in ([1],[2],[3],[1,2],[1,3],[1,2,3]):
 946 | 			raise ValueError('Invalid N_GRAM_SIZES argument:'+str(N_GRAM_SIZES)+", please review documentation for proper format (e.g., [1])")
 947 | 		self.N_GRAM_SIZES = N_GRAM_SIZES  # how many n-grams to use in BOW, LDA analyses? [1] = 1 grams only, [3] = 3 grams only, [1,2,3] = 1, 2, and 3- grams.
 948 | 		self.DIM_DOC2VEC = DIM_DOC2VEC #dimensionality of doc2vec manifold
 949 | 		self.N_THRESH_CORPUS=N_THRESH_CORPUS # delete any n-grams that appear fewer than N times in the entire corpus
 950 | 		self.N_THRESH_OUTCOMES=N_THRESH_OUTCOMES # delete any predictors that don't have at least N-many positive and negative examples
 951 | 		self.TRAIN_INDEX_OVERRIDE = TRAIN_INDEX_OVERRIDE # define a list of indices you want to force to be included as unlabeled data even though they are labeled (ie, these will *not* be used for predictions). Some unlabeled reports are required for d2v training."""
 952 | 		self.SILVER_THRESHOLD=SILVER_THRESHOLD
 953 | 		self.NAME_UNID_REPORTS = NAME_UNID_REPORTS  
 954 | 		self.NAME_TEXT_REPORTS = NAME_TEXT_REPORTS
 955 | 
 956 | 		if(self.DO_BOW==False and self.DO_WORD2VEC==False and self.DO_PARAGRAPH_VECTOR==False): raise ValueError("DO_BOW and DO_WORD2VEC and DO_PARAGRAPH_VECTOR cannot both be false")
 957 | 
 958 | 	def build_corpus(self):
 959 | 		"""
 960 | 		Builds corpus of reports and and generates numerical features from reports for later analysis.
 961 | 		Please run define_config() beforehand.
 962 | 
 963 | 		Arguments:
 964 | 
 965 | 			None
 966 | 
 967 | 		Returns:
 968 | 
 969 | 			None
 970 | 		"""
 971 | 
 972 | 		#assemble dataframe of reports
 973 | 		self.df_data = join_montage_files(self.data_dir, self.NAME_UNID_REPORTS, self.NAME_TEXT_REPORTS) # build dataframe with all the report text
 974 | 
 975 | 		#get list of stop words
 976 | 		en_stop = give_stop_words()
 977 | 
 978 | 		# preprocess report text, get list with length (# reports) and text after first round of processing. 
 979 | 		# if curious to see how it works, look at processed_reports[0] to see first report.
 980 | 		self.processed_reports = preprocess_data(self.df_data, en_stop, stem_words=True) 
 981 | 
 982 | 		#determine which indices should be used for 
 983 | 		self.labeled_indices = get_labeled_indices(self.df_data,self.validation_file,self.TRAIN_INDEX_OVERRIDE)
 984 | 
 985 | 		#build n-grams of desired size, takes a list of sizes and frequency threshold as inputs
 986 | 		self.ngram_reports = create_ngrams(
 987 | 		self.processed_reports,
 988 | 		self.labeled_indices,
 989 | 		N_GRAM_SIZES=self.N_GRAM_SIZES,
 990 | 		freq_threshold=self.N_THRESH_CORPUS) #now we create n-grams
 991 | 
 992 | 		# generate inputs for doc2vec/word2vec model 
 993 | 		# can see example report - d2v_inputs[0]		
 994 | 		self.d2v_inputs= remove_infrequent_tokens(self.processed_reports,self.N_THRESH_CORPUS,self.labeled_indices) 
 995 | 
 996 | 		#assemble train/test corpora and a word dict. 
 997 | 		self.corpus, self.train_corpus, self.test_corpus, self.dictionary, self.labeled_indices = build_train_test_corpus(
 998 | 			self.df_data,
 999 | 			self.ngram_reports,
1000 | 			self.validation_file,
1001 | 			self.TRAIN_INDEX_OVERRIDE)
1002 | 
1003 | 		#train doc2vec/word2vec if indicated:
1004 | 		if(self.DO_WORD2VEC or self.DO_PARAGRAPH_VECTOR):
1005 | 			self.train_docs, self.test_docs, self.total_train_words = build_d2v_corpora(self.df_data,self.d2v_inputs,self.labeled_indices)
1006 | 			self.d2vmodel=train_d2v(self.train_docs, self.test_docs, self.D2V_EPOCH, self.DIM_DOC2VEC, self.W2V_DM, self.W2V_WINDOW, self.total_train_words)
1007 | 
1008 | 	def infer_labels(self):
1009 | 		"""
1010 | 		Infers labels for unlabeled documents.
1011 | 		Please run build_corpus() beforehand.
1012 | 
1013 | 		Arguments:
1014 | 
1015 | 			None
1016 | 
1017 | 		Returns:
1018 | 
1019 | 			self.inferred_labels: dataframe containing inferred labels
1020 | 		"""
1021 | 
1022 | 		#get the numerical features of text we need to train models for labels
1023 | 		self.bow_matrix, self.pv_matrix,self.w2v_matrix,self.accid_list,self.orig_text,self.orig_input=generate_labeled_data_features(
1024 | 							   self.validation_file,
1025 | 							   self.labeled_indices,
1026 | 							   self.DIM_DOC2VEC,
1027 | 							   self.df_data,
1028 | 							   self.processed_reports,
1029 | 							   self.DO_PARAGRAPH_VECTOR,
1030 | 							   self.DO_WORD2VEC, 
1031 | 							   self.dictionary,
1032 | 							   self.corpus,
1033 | 							   self.d2vmodel,
1034 | 							   self.d2v_inputs)
1035 | 
1036 | 		#get and process labels for reports
1037 | 		self.eligible_outcomes_aligned,self.all_outcomes_aligned, self.N_LABELS, self.outcome_header_list = generate_outcomes(
1038 | 			self.validation_file,
1039 | 			self.accid_list,
1040 | 			self.N_THRESH_OUTCOMES)
1041 | 
1042 | 		#to generate silver standard labels -- use whatever features are generated (word2vec average word embeddings, bow features, paragraph vector matrix)
1043 | 		if(self.DO_BOW and self.DO_WORD2VEC and self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.bow_matrix,self.w2v_matrix,self.pv_matrix))
1044 | 
1045 | 		if(self.DO_BOW and self.DO_WORD2VEC and not self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.bow_matrix,self.w2v_matrix))
1046 | 		if(self.DO_BOW and not self.DO_WORD2VEC and self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.bow_matrix,self.pv_matrix))
1047 | 		if(not self.DO_BOW and self.DO_WORD2VEC and self.DO_PARAGRAPH_VECTOR): self.combined=np.hstack((self.w2v_matrix,self.pv_matrix))
1048 | 
1049 | 		if(self.DO_BOW and not self.DO_WORD2VEC and not self.DO_PARAGRAPH_VECTOR): self.combined=self.bow_matrix
1050 | 		if(not self.DO_BOW and self.DO_WORD2VEC and not self.DO_PARAGRAPH_VECTOR): self.combined=self.w2v_matrix
1051 | 		if(not self.DO_BOW and not self.DO_WORD2VEC and self. DO_PARAGRAPH_VECTOR): self.combined=self.pv_matrix		
1052 | 
1053 | 		#create header for combined predictor matrix so we can interpret coefficients
1054 | 		self.headers=[]
1055 | 		if(self.DO_BOW): 				self.headers=self.headers + [self.dictionary[i] for i in self.dictionary]
1056 | 		if(self.DO_WORD2VEC): 			self.headers=self.headers + ["W2V"+str(i) for i in range(0,self.DIM_DOC2VEC)]
1057 | 		if(self.DO_PARAGRAPH_VECTOR): 	self.headers=self.headers + ["PV"+str(i) for i in range(0,self.DIM_DOC2VEC)]
1058 | 
1059 | 
1060 | 		pred_type = "combined" # a label for results
1061 | 		print("dimensionality of predictor matrix:"+str(self.combined.shape))
1062 | 
1063 | 		#run lasso regressions
1064 | 		self.lasso_models, self.accuracy = calc_auc(self.combined,self.eligible_outcomes_aligned,self.all_outcomes_aligned,  self.N_LABELS, pred_type, self.headers,self.ASSIGNFOLD_USING_ROW)
1065 | 
1066 | 		#infer labels	
1067 | 		self.inferred_binary_labels, self.inferred_proba_labels = write_silver_standard_labels(self.corpus,
1068 | 			self.N_LABELS,
1069 | 			self.eligible_outcomes_aligned,
1070 | 			self.DIM_DOC2VEC,
1071 | 			self.processed_reports,
1072 | 			self.DO_BOW,
1073 | 			self.DO_PARAGRAPH_VECTOR,
1074 | 			self.DO_WORD2VEC,
1075 | 			self.dictionary,
1076 | 			self.d2vmodel,
1077 | 			self.d2v_inputs,
1078 | 			self.lasso_models,
1079 | 			self.accid_list, 
1080 | 			self.labeled_indices,
1081 | 			self.df_data,
1082 | 			self.SILVER_THRESHOLD)
1083 | 		return self.inferred_binary_labels, self.inferred_proba_labels
1084 | 	
1085 | 
1086 | 
1087 | 
1088 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: rad_env
 2 | channels:
 3 |   - conda-forge
 4 | dependencies:
 5 |   - python=3.6.5
 6 |   - pandas=0.22.0
 7 |   - numpy=1.14.2
 8 |   - tqdm=4.23.0
 9 |   - scikit-learn=0.19.0
10 |   - xlrd=1.1.0
11 |   - jupyterlab=0.35.0
12 |   - nb_conda_kernels=2.1.1
13 |   - nltk=3.4.4
14 |   - gensim=3.5.0
15 | 


--------------------------------------------------------------------------------
/pseudodata/labels/labeled_reports.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aisinai/rad-report-annotator/61ee948866bb09d272fc75210c63dcb818b3c21d/pseudodata/labels/labeled_reports.xlsx


--------------------------------------------------------------------------------
/pseudodata/reports/words.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aisinai/rad-report-annotator/61ee948866bb09d272fc75210c63dcb818b3c21d/pseudodata/reports/words.xlsx


--------------------------------------------------------------------------------