├── .gitignore
├── README.MD
├── baseline
    ├── .ipynb_checkpoints
    │   └── mimic_icd9_baseline-checkpoint.ipynb
    ├── README.md
    ├── mimic_icd9_baseline.ipynb
    ├── nn_model.py
    └── paper_ranking_loss_scores.png
├── data
    └── readme.md
├── icd9_cnn
    ├── .ipynb_checkpoints
    │   ├── Untitled-checkpoint.ipynb
    │   ├── cnn_top20_leave-checkpoint.ipynb
    │   └── icd9_cnn_multilabel-checkpoint.ipynb
    ├── CNN_for_text2.png
    ├── cnn_model.py
    ├── cnn_model.pyc
    ├── cnn_top20_leave.ipynb
    ├── mimic_CNN_text_classification.png
    ├── tf_saved
    │   ├── cnn_trained.data-00000-of-00001
    │   ├── cnn_trained.index
    │   └── cnn_trained.meta
    ├── utils.py
    ├── utils.pyc
    ├── vocabulary.py
    └── vocabulary.pyc
├── pipeline
    ├── .ipynb_checkpoints
    │   ├── Exploration-checkpoint.ipynb
    │   ├── Temp Guillaume-checkpoint.ipynb
    │   ├── Temp Guillaume2-checkpoint.ipynb
    │   ├── icd9_cnn_50K_run-checkpoint.ipynb
    │   ├── icd9_cnn_att_workbook-checkpoint.ipynb
    │   ├── icd9_hatt_workbook-checkpoint.ipynb
    │   └── icd9_lstm_cnn_workbook-checkpoint.ipynb
    ├── __pycache__
    │   ├── database_selection.cpython-35.pyc
    │   ├── helpers.cpython-35.pyc
    │   └── vectorization.cpython-35.pyc
    ├── attention_util.py
    ├── database_selection.py
    ├── hatt_model.py
    ├── helpers.py
    ├── icd9_cnn_50K_run.ipynb
    ├── icd9_cnn_att.py
    ├── icd9_cnn_att_50K_records.ipynb
    ├── icd9_cnn_att_workbook.ipynb
    ├── icd9_cnn_model.py
    ├── icd9_hatt_workbook.ipynb
    ├── icd9_lstm_att_model.py
    ├── icd9_lstm_cnn.py
    ├── icd9_lstm_cnn_workbook.ipynb
    ├── lstm_model.py
    └── vectorization.py
├── pre_processing
    ├── MIMICERdiagram.png
    ├── README.md
    └── psql_files
    │   ├── create_discharge_notes_all_icd9
    │   ├── query01_top_icd9_codes.sql
    │   ├── query02_filter_diagnoses_by_icd9_code.sql
    │   ├── query03C_icd9_codes_by_admission_create_table.sql
    │   ├── query03_icd9_codes_by_admission.sql
    │   ├── query04_filtering_discharge_summary_notes.sql
    │   ├── query05_discharge_notes_icd9_create_table.sql
    │   ├── query06_export_w266_table.sql
    │   ├── query_all_discharge_notes
    │   ├── query_icd9_codes
    │   └── top_icd9_codes.txt
└── w266FinalReport_ICD_9_Classification.pdf


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | 
 7 | 
 8 | 
 9 | # IPython Notebook
10 | .ipynb_checkpoints
11 | 
12 | # pyenv
13 | .python-version
14 | 
15 | 
16 | 
17 | 


--------------------------------------------------------------------------------
/README.MD:
--------------------------------------------------------------------------------
 1 | # Classifying medical notes into standard disease codes
 2 | 
 3 | **August 2017**
 4 | 
 5 | This repository contains the code I implemented to classify automatically EHR patient discharge notes into standard
 6 | disease labels (ICD9 codes). I implemented deep learning models ( CNN, LSTM and Hierarchical models) using embeddings and 
 7 | attention layers. The CNN model with attention outperformed previous algorithms used in this task.   
 8 | The dataset used for modeling was: [MIMIC III dataset](https://mimic.physionet.org) .
 9 | 
10 | The code was implemented on August 2017, during my graduate studies at the Master of Information and Data Science (MIDS) program at UC Berkeley. The class was: W266 Natural Language Processing with Deep Learning   
11 | 
12 | This is the final project report: [w266FinalReport_ICD_9_Classification.pdf](w266FinalReport_ICD_9_Classification.pdf)
13 | 
14 | (note: code refactoring pending)
15 | 
16 | ## Preprocessing
17 | Getting information from database, pulling data, filtering and joining tables: [Pre processing](pre_processing)
18 | 
19 | ## Main Notebooks
20 | 
21 | ### Classification into top level codes in the ICD-9 hierarchy with 5K records
22 | | Model | ICD 9 code level| N. Records | Epochs | Notebook |
23 | | --- | --- | --- | --- | --- |
24 | | Baseline | First-Level|5K| -|[pipeline/icd9_lstm_cnn_workbook.ipynb](pipeline/icd9_lstm_cnn_workbook.ipynb) </br> Section: "Super Basic Baseline with top 4" Always predict top 4 icd-9 codes, F1-score= 52.6|
25 | | CNN Replication| First-Level  | 5K| 20|[pipeline/icd9_lstm_cnn_workbook.ipynb](pipeline/icd9_lstm_cnn_workbook.ipynb) </br>  Section: "CNN running with 20 epochs". CNN model to replicate results from paper: [Comparing Rule-Based and Deep Learning Models for Patient Phenotyping](https://arxiv.org/abs/1703.08705).In order to compare F1 performance results, I took into consideration the dataset size and number of classes. F1-score= 76.2|  
26 | | CNN| Firs-Level| 5K | 5|[pipeline/icd9_lstm_cnn_workbook.ipynb](pipeline/icd9_lstm_cnn_workbook.ipynb) </br> Section: "CNN running with 5 epochs" running with the 17 first level ICD-9 codes, using 5 epochs and Embeddings. F1-score= 69.1|
27 | | LSTM | First-Level | 5K| 5|[pipeline/icd9_lstm_cnn_workbook.ipynb](pipeline/icd9_lstm_cnn_workbook.ipynb) </br>Section "Basic LSTM" running with the 17 first level ICD-9 codes, using 5 epochs and Embeddings. F1-score= 64.6 |
28 | 
29 | **Attention**   
30 | The average length of discharge clinical notes is 1639 words. The text to classify may be too long for a LSTM or CNN to
31 | remember all relevant information. [Raffel et al. (2016)](https://arxiv.org/abs/1512.08756) displayed better performance in many NLP tasks on long text using Attention. Here, we seek to emulate his results by implementing algorithms based on the formulas presented in [Raffel et al. (2016)](https://arxiv.org/abs/1512.08756) and [Yang et al. (2016)](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf).   
32 |    
33 | | Model | ICD 9 code level| N. Records | Epochs | Notebook |
34 | | --- | --- | --- | --- | --- |
35 | | LSTM with Attention| First-Level | 5K|5| [pipeline/icd9_lstm_cnn_workbook.ipynb](pipeline/icd9_lstm_cnn_workbook.ipynb) </br> Section: "LSTM with Attention"</br> F1-score:67.0|
36 | | CNN with Attention| First-Level | 5K| 5|[pipeline/icd9_cnn_att_workbook.ipynb](pipeline/icd9_cnn_att_workbook.ipynb) </br>  F1-score:72.8|
37 | | Hierarchical LSTM Attention | First-level| 5k|5| [pipeline/icd9_hatt_workbook.ipynb](pipeline/icd9_hatt_workbook.ipynb)</br> This model was implemented based on [Yang et al. (2016)](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf) which specifically targets document classifications. It has two levels of attention mechanisms, the first one creates vectors that represent each sentence, using attention mechanism across words; and the second level creates a vector that represent the document using attention mechanisms across sentences. F1-score: 67.6|   
38 | 
39 | 
40 | ### Classification into most common ICD-9 Codes in the bottom of the ICD-9 Hierarchy (leaves)   
41 | | Model | ICD 9 code level| N. Records | Epochs | Notebook |
42 | | --- | --- | --- | --- | --- |
43 | | Baseline | First-Level |46K and 5K| -|[baseline/mimic_icd9_baseline.ipynb](baseline/mimic_icd9_baseline.ipynb) <br/> - Some Initial Exploration with Python and Sql </br> - Basic Baseline Model: For the basic baseline, we make a fixed prediction corresponding to the top 4 ICD-9 codes for 46K records </br> - NN Baseline Model: A neural network (not Recurrent) with one hidden layer, with relu activation on the hidden layer and sigmoid activation on the output layer. Using cross entropy loss,which is the loss functions for multilabel classification (using Tensorflow), using 5K records. F1-score: 35  | 
44 | | CNN for top 20 leaf icd-9 codes | Leaf | 46K | 7 | [icd9_cnn/cnn_top20_leave.ipynb](icd9_cnn/cnn_top20_leave.ipynb) </br> Classifies clinical notes into the 20 most common ICD-9 that are in the bottom of the ICD-9 hierarchy (leaves), this run was for comparison with previous work.  F1-score:72.4  |
45 | 
46 | ### Classification into top level codes in the ICD-9 hierarchy with 52.6K records
47 | | Model | ICD 9 code level| N. Records | Epochs | Notebook |
48 | | --- | --- | --- | --- | --- |
49 | | CNN | First-Level | 52.6K | - | [pipeline/icd9_cnn_50K_run.ipynb](/pipeline/icd9_cnn_50K_run.ipynb) </br> F1-score: 79.7 |
50 | | CNN with Attention | First-Level | 52.6K | - | [pipeline/icd9_cnn_att_50K_records.ipynb](pipeline/icd9_cnn_att_50K_records.ipynb) </br> F1-score: 78.2.At this stage, the CNN ATT model still overfits: even though it had the highest score during the experimental runs with 5K records and 5 epochs, it didn’t reach the best f1-score when running it with the full data set. Further work would explore hyper-parameters tuning and evaluating the number of parameters to attempt undoing the over fitting situation.|
51 | 
52 | 
53 | 
54 | ## Model Python modules
55 | 
56 | | Model | Python module |
57 | | --- | --- |
58 | | LSTM | [pipeline/lstm_model.py](pipeline/lstm_model.py) |
59 | | CNN | [pipeline/icd9_cnn_model.py](pipeline/icd9_cnn_model.py)  |
60 | | Attention Layer |[pipeline/attention_util.py](pipeline/attention_util.py)  |
61 | | LSTM_ATT | [pipeline/icd9_lstm_att_model.py](pipeline/icd9_lstm_att_model.py)   |
62 | | CNN_ATT | [pipeline/icd9_cnn_att.py](pipeline/icd9_cnn_att.py)   |
63 | | Hierarchical LSTM Attention | [pipeline/hatt_model.py](pipeline/hatt_model.py)  |
64 | 
65 | ## Helper classes for Preprocessing
66 | 
67 | | Helper | Python module |
68 | | --- | --- |
69 | | Filtering clinical-notes to keep the ones that have been assigned the top common N icd-9 codes (this is a multi-label),  removing any code from the label that is not in the top N | [pipeline/database_selection.py](pipeline/database_selection.py) |
70 | | Three main methods: (1) Splits input file in training, valiation and test  (2) Replace leaf icd9-code with its grandparent in the first level  (3) Calculates and Diplay F1 scores for a set of possible thresholds| [pipeline/helpers.py](pipeline/helpers.py) |
71 | | functions necessary to vectorize the ICD labels and text inputs  (I didn't implement this module, is listed here because it is used by the notebooks I had implemented)| [pipeline/vectorization.py](pipeline/vectorization.py) |
72 | 


--------------------------------------------------------------------------------
/baseline/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## NN Baseline Model
 3 | A neural network (not Recurrent) with one hidden layer, with relu activation on the hidden layer and sigmoid activation on the output layer.   Using cross entropy loss,which is the loss functions for multilabel classification [4]   
 4 |    
 5 | 
 6 | ## Evaluation
 7 | rank loss metric to evaluate performance [3] and F1 score [2] 
 8 | 
 9 | ## References
10 | [1] Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics   
11 | [2] Applying Deep Learning to ICD-9 Multi-label Classification from Medical Records   
12 | [3] ICD-9 Coding of Discharge Summaries   
13 | [4] Large-scale Multi-label Text Classification - Revisiting Neural Networks   
14 | 


--------------------------------------------------------------------------------
/baseline/nn_model.py:
--------------------------------------------------------------------------------
 1 | import tensorflow as tf
 2 | 
 3 | 
 4 | def with_self_graph(function):
 5 |     def wrapper(self, *args, **kwargs):
 6 |         with self.graph.as_default():
 7 |             return function(self, *args, **kwargs)
 8 |     return wrapper
 9 | 
10 |     
11 | class NNLM(object):
12 |     def __init__(self, graph=None, *args, **kwargs):
13 |         # Set TensorFlow graph. All TF code will work on this graph.
14 |         self.graph = graph or tf.Graph()
15 |         self.SetParams(*args, **kwargs)
16 |         
17 | 
18 |         
19 |     @with_self_graph
20 |     def SetParams(self,  Hidden_dims, learning_rate, vocabulary_size, y_dim):
21 |         # Model structure; these need to be fixed for a given model.
22 |         self.Hidden_dims = Hidden_dims
23 |         self.learning_rate = learning_rate
24 |         self.V =vocabulary_size  
25 |         self.y_dim = y_dim
26 |         
27 |     @with_self_graph    
28 |     def affine_layer(self, hidden_dim, x, seed=0):
29 |         self.W = tf.get_variable("W", initializer=tf.contrib.layers.xavier_initializer(seed = seed),  \
30 |                         trainable=True,shape=[x.shape[1],hidden_dim])        
31 |         self.b = tf.get_variable("b", initializer=tf.zeros_initializer(), \
32 |                         trainable=True,shape=[hidden_dim])
33 |         return tf.matmul(x,self.W) + self.b 
34 | 
35 |     @with_self_graph
36 |     def fully_connected_layers(self,x):
37 |         for i in range(len(self.Hidden_dims)):
38 |             with tf.variable_scope("layer_" + str(i)):
39 |                 x = tf.nn.relu(self.affine_layer(self.Hidden_dims[i], x))
40 |         return x
41 |     @with_self_graph
42 |     def BuildCoreGraph(self):
43 |         self.x = tf.placeholder(tf.float32, shape=[None, self.V])
44 |         self.target_y = tf.placeholder(tf.float32, shape=[None,None])
45 |         
46 |         z = self.fully_connected_layers(self.x)        
47 |         
48 |         self.y_logit = tf.squeeze(self.affine_layer(self.y_dim,z))
49 |         self.y_hat = tf.sigmoid(self.y_logit)  
50 |      
51 |         self.loss = tf.reduce_mean (tf.nn.sigmoid_cross_entropy_with_logits(labels=self.target_y, logits=self.y_logit))
52 |         
53 |     @with_self_graph
54 |     def BuildTrainGraph(self):
55 |         optimizer = tf.train.GradientDescentOptimizer(self.learning_rate)
56 |         self.train = optimizer.minimize(self.loss)      
57 | 


--------------------------------------------------------------------------------
/baseline/paper_ranking_loss_scores.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/baseline/paper_ranking_loss_scores.png


--------------------------------------------------------------------------------
/data/readme.md:
--------------------------------------------------------------------------------
1 | We are not including the MIMIC data here because it needs authorization from   https://mimic.physionet.org/gettingstarted/access/   
2 | 
3 | 
4 | 
5 | 


--------------------------------------------------------------------------------
/icd9_cnn/.ipynb_checkpoints/Untitled-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 |  "cells": [],
3 |  "metadata": {},
4 |  "nbformat": 4,
5 |  "nbformat_minor": 2
6 | }
7 | 


--------------------------------------------------------------------------------
/icd9_cnn/.ipynb_checkpoints/cnn_top20_leave-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": false
  8 |    },
  9 |    "outputs": [
 10 |     {
 11 |      "name": "stderr",
 12 |      "output_type": "stream",
 13 |      "text": [
 14 |       "Using TensorFlow backend.\n"
 15 |      ]
 16 |     }
 17 |    ],
 18 |    "source": [
 19 |     "# General imports\n",
 20 |     "import numpy as np\n",
 21 |     "import pandas as pd\n",
 22 |     "from sklearn.metrics import f1_score\n",
 23 |     "\n",
 24 |     "# Custom functions\n",
 25 |     "%load_ext autoreload\n",
 26 |     "%autoreload 2\n",
 27 |     "import database_selection\n",
 28 |     "import vectorization\n",
 29 |     "import helpers\n",
 30 |     "\n",
 31 |     "#keras\n",
 32 |     "from keras.models import Sequential, Model\n",
 33 |     "from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding\n",
 34 |     "from keras.layers.merge import Concatenate\n"
 35 |    ]
 36 |   },
 37 |   {
 38 |    "cell_type": "code",
 39 |    "execution_count": 21,
 40 |    "metadata": {
 41 |     "collapsed": true
 42 |    },
 43 |    "outputs": [],
 44 |    "source": [
 45 |     "df = pd.read_csv('../data/disch_notes_all_icd9.csv',\n",
 46 |     "                 names = ['HADM_ID', 'SUBJECT_ID', 'DATE', 'ICD9','TEXT'])"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "code",
 51 |    "execution_count": 22,
 52 |    "metadata": {
 53 |     "collapsed": false
 54 |    },
 55 |    "outputs": [],
 56 |    "source": [
 57 |     "N_TOP = 10 \n",
 58 |     "full_df, top_codes = database_selection.filter_top_codes(df, 'ICD9', N_TOP, filter_empty = True)\n",
 59 |     "df = full_df.head(1000)"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": 23,
 65 |    "metadata": {
 66 |     "collapsed": false
 67 |    },
 68 |    "outputs": [],
 69 |    "source": [
 70 |     "#preprocess icd9 codes\n",
 71 |     "labels = vectorization.vectorize_icd_column(df, 'ICD9', top_codes)\n"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 24,
 77 |    "metadata": {
 78 |     "collapsed": false
 79 |    },
 80 |    "outputs": [
 81 |     {
 82 |      "name": "stdout",
 83 |      "output_type": "stream",
 84 |      "text": [
 85 |       "Vocabulary size: 22330\n",
 86 |       "Average note length: 1767.581\n",
 87 |       "Max note length: 5641\n",
 88 |       "Final Vocabulary: 22330\n",
 89 |       "Final Max Sequence Length: 5000\n"
 90 |      ]
 91 |     }
 92 |    ],
 93 |    "source": [
 94 |     "#preprocess notes\n",
 95 |     "MAX_VOCAB = None # to limit original number of words (None if no limit)\n",
 96 |     "MAX_SEQ_LENGTH = 5000 # to limit length of word sequence (None if no limit)\n",
 97 |     "df.TEXT = vectorization.clean_notes(df, 'TEXT')\n",
 98 |     "data, dictionary, MAX_VOCAB = vectorization.vectorize_notes(df.TEXT, MAX_VOCAB, verbose = True)\n",
 99 |     "data, MAX_SEQ_LENGTH = vectorization.pad_notes(data, MAX_SEQ_LENGTH)\n",
100 |     "print(\"Final Vocabulary: %s\" % MAX_VOCAB)\n",
101 |     "print(\"Final Max Sequence Length: %s\" % MAX_SEQ_LENGTH)"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 25,
107 |    "metadata": {
108 |     "collapsed": false
109 |    },
110 |    "outputs": [
111 |     {
112 |      "name": "stdout",
113 |      "output_type": "stream",
114 |      "text": [
115 |       "('Train: ', (699, 5000), (699, 10))\n",
116 |       "('Validation: ', (200, 5000), (200, 10))\n",
117 |       "('Test: ', (101, 5000), (101, 10))\n"
118 |      ]
119 |     }
120 |    ],
121 |    "source": [
122 |     "#split sets\n",
123 |     "X_train, X_val, X_test, y_train, y_val, y_test = helpers.train_val_test_split(\n",
124 |     "    data, labels, val_size=0.2, test_size=0.1, random_state=101)\n",
125 |     "print(\"Train: \", X_train.shape, y_train.shape)\n",
126 |     "print(\"Validation: \", X_val.shape, y_val.shape)\n",
127 |     "print(\"Test: \", X_test.shape, y_test.shape)"
128 |    ]
129 |   },
130 |   {
131 |    "cell_type": "code",
132 |    "execution_count": 26,
133 |    "metadata": {
134 |     "collapsed": false
135 |    },
136 |    "outputs": [],
137 |    "source": [
138 |     "# Delete temporary variables to free some memory\n",
139 |     "del df, data, labels"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 27,
145 |    "metadata": {
146 |     "collapsed": false
147 |    },
148 |    "outputs": [
149 |     {
150 |      "name": "stdout",
151 |      "output_type": "stream",
152 |      "text": [
153 |       "('Vocabulary in notes:', 22330)\n",
154 |       "('Vocabulary in original embedding:', 400000)\n",
155 |       "('Vocabulary intersection:', 14239)\n"
156 |      ]
157 |     }
158 |    ],
159 |    "source": [
160 |     "#creating embeddings\n",
161 |     "EMBEDDING_LOC = '../data/glove.6B.100d.txt' # location of embedding\n",
162 |     "EMBEDDING_DIM = 100 # given the glove that we chose\n",
163 |     "embedding_matrix, embedding_dict = vectorization.embedding_matrix(EMBEDDING_LOC,\n",
164 |     "                                                                  dictionary, EMBEDDING_DIM, verbose = True)\n"
165 |    ]
166 |   },
167 |   {
168 |    "cell_type": "markdown",
169 |    "metadata": {},
170 |    "source": [
171 |     "## CNN for text classification\n",
172 |     "\n",
173 |     "Based on the following papers and links:\n",
174 |     "* \"Convolutional Neural Networks for Sentence Classification\"   \n",
175 |     "* \"A Sensitivity Analysis of (and Practitioners� Guide to) Convolutional Neural Networks for Sentence Classification\"\n",
176 |     "* http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/\n",
177 |     "* https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/sentiment_cnn.py"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": 28,
183 |    "metadata": {
184 |     "collapsed": true
185 |    },
186 |    "outputs": [],
187 |    "source": [
188 |     "#### set parameters:\n",
189 |     "num_filters = 100\n",
190 |     "filter_sizes = [2,3,4,5]\n",
191 |     "training_dropout_keep_prob = 0.9\n",
192 |     "num_classes=N_TOP\n",
193 |     "batch_size = 50\n",
194 |     "epochs = 5\n",
195 |     "external_embeddings = False\n",
196 |     "EMBEDDING_TRAINABLE = True"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": 29,
202 |    "metadata": {
203 |     "collapsed": false
204 |    },
205 |    "outputs": [
206 |     {
207 |      "name": "stdout",
208 |      "output_type": "stream",
209 |      "text": [
210 |       "Train on 699 samples, validate on 200 samples\n",
211 |       "Epoch 1/5\n",
212 |       "25s - loss: 0.6631 - acc: 0.7506 - val_loss: 0.6439 - val_acc: 0.7445\n",
213 |       "Epoch 2/5\n",
214 |       "26s - loss: 0.6463 - acc: 0.7506 - val_loss: 0.6408 - val_acc: 0.7445\n",
215 |       "Epoch 3/5\n",
216 |       "26s - loss: 0.6409 - acc: 0.7509 - val_loss: 0.6390 - val_acc: 0.7445\n",
217 |       "Epoch 4/5\n",
218 |       "25s - loss: 0.6372 - acc: 0.7506 - val_loss: 0.6373 - val_acc: 0.7445\n",
219 |       "Epoch 5/5\n",
220 |       "26s - loss: 0.6311 - acc: 0.7506 - val_loss: 0.6361 - val_acc: 0.7445\n"
221 |      ]
222 |     },
223 |     {
224 |      "data": {
225 |       "text/plain": [
226 |        "<keras.callbacks.History at 0x7fdbec4e6f50>"
227 |       ]
228 |      },
229 |      "execution_count": 29,
230 |      "metadata": {},
231 |      "output_type": "execute_result"
232 |     }
233 |    ],
234 |    "source": [
235 |     "#Embedding\n",
236 |     "model_input = Input(shape=(MAX_SEQ_LENGTH, ))\n",
237 |     "if external_embeddings:\n",
238 |     "    # use embedding_matrix plus local training\n",
239 |     "    z = Embedding(MAX_VOCAB + 1,\n",
240 |     "                            EMBEDDING_DIM,\n",
241 |     "                            weights=[embedding_matrix],\n",
242 |     "                            input_length=MAX_SEQ_LENGTH,\n",
243 |     "                            trainable=EMBEDDING_TRAINABLE)(model_input)\n",
244 |     "else:\n",
245 |     "    # train embeddings \n",
246 |     "    z =  Embedding(MAX_VOCAB + 1, \n",
247 |     "                   EMBEDDING_DIM, \n",
248 |     "                   input_length=MAX_SEQ_LENGTH, \n",
249 |     "                   name=\"embedding\")(model_input)\n",
250 |     "\n",
251 |     "# Convolutional block\n",
252 |     "conv_blocks = []\n",
253 |     "for sz in filter_sizes:\n",
254 |     "    conv = Convolution1D(filters=num_filters,\n",
255 |     "                         kernel_size=sz,\n",
256 |     "                         padding=\"valid\",\n",
257 |     "                         activation=\"relu\",\n",
258 |     "                         strides=1)(z)\n",
259 |     "    window_pool_size =  MAX_SEQ_LENGTH  - sz + 1 \n",
260 |     "    conv = MaxPooling1D(pool_size=window_pool_size)(conv)  \n",
261 |     "    conv = Flatten()(conv)\n",
262 |     "    conv_blocks.append(conv)\n",
263 |     "\n",
264 |     "#concatenate\n",
265 |     "z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]\n",
266 |     "z = Dropout(training_dropout_keep_prob)(z)\n",
267 |     "\n",
268 |     "#score prediction\n",
269 |     "#z = Dense(num_classes, activation=\"relu\")(z)  I don't think this is necessary\n",
270 |     "model_output = Dense(num_classes, activation=\"softmax\")(z)\n",
271 |     "\n",
272 |     "#creating model\n",
273 |     "model = Model(model_input, model_output)\n",
274 |     "# what to use for tf.nn.softmax_cross_entropy_with_logits?\n",
275 |     "model.compile(loss=\"binary_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"])\n",
276 |     "\n",
277 |     "# Train the model\n",
278 |     "model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs,\n",
279 |     "validation_data=(X_val, y_val), verbose=2)"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": 30,
285 |    "metadata": {
286 |     "collapsed": false
287 |    },
288 |    "outputs": [],
289 |    "source": [
290 |     "pred_train = model.predict(X_train, batch_size=50)\n",
291 |     "pred_dev = model.predict(X_val, batch_size=50)"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "code",
296 |    "execution_count": 31,
297 |    "metadata": {
298 |     "collapsed": false
299 |    },
300 |    "outputs": [
301 |     {
302 |      "name": "stdout",
303 |      "output_type": "stream",
304 |      "text": [
305 |       "F1 scores\n",
306 |       "threshold | training | dev  \n",
307 |       "0.020:      0.399      0.407\n",
308 |       "0.030:      0.399      0.407\n",
309 |       "0.040:      0.399      0.407\n",
310 |       "0.050:      0.408      0.413\n",
311 |       "0.055:      0.433      0.420\n",
312 |       "0.058:      0.437      0.430\n",
313 |       "0.060:      0.432      0.427\n",
314 |       "0.080:      0.501      0.463\n",
315 |       "0.100:      0.446      0.463\n",
316 |       "0.200:      0.206      0.066\n",
317 |       "0.300:      0.000      0.000\n",
318 |       "0.500:      0.000      0.000\n"
319 |      ]
320 |     }
321 |    ],
322 |    "source": [
323 |     "def get_f1_score(y_true,y_hat,threshold, average):\n",
324 |     "    hot_y = np.where(np.array(y_hat) > threshold, 1, 0)\n",
325 |     "    return f1_score(np.array(y_true), hot_y, average=average)\n",
326 |     "\n",
327 |     "print 'F1 scores'\n",
328 |     "print 'threshold | training | dev  '\n",
329 |     "f1_score_average = 'micro'\n",
330 |     "for threshold in [ 0.02, 0.03,0.04,0.05,0.055,0.058,0.06, 0.08, 0.1,0.2,0.3, 0.5]:\n",
331 |     "    train_f1 = get_f1_score(y_train, pred_train,threshold,f1_score_average)\n",
332 |     "    dev_f1 = get_f1_score(y_val, pred_dev,threshold,f1_score_average)\n",
333 |     "    print '%1.3f:      %1.3f      %1.3f' % (threshold,train_f1, dev_f1)"
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "markdown",
338 |    "metadata": {},
339 |    "source": [
340 |     "### Results with external embeddings = True , no additional training,  top 20\n",
341 |     "```\n",
342 |     "F1 scores\n",
343 |     "threshold | training | dev  \n",
344 |     "0.020:      0.337      0.329\n",
345 |     "0.030:      0.360      0.353\n",
346 |     "0.040:      0.365      0.374\n",
347 |     "0.050:      0.372      0.375\n",
348 |     "0.055:      0.370      0.377\n",
349 |     "0.058:      0.369      0.375\n",
350 |     "0.060:      0.368      0.375\n",
351 |     "0.080:      0.348      0.361\n",
352 |     "0.100:      0.309      0.319\n",
353 |     "0.200:      0.198      0.208\n",
354 |     "0.300:      0.157      0.138\n",
355 |     "0.500:      0.000      0.000\n",
356 |     "```\n",
357 |     "\n",
358 |     "### Results with external embeddings = False, top 20\n",
359 |     "```\n",
360 |     "F1 scores\n",
361 |     "threshold | training | dev  \n",
362 |     "0.020:      0.288      0.300\n",
363 |     "0.030:      0.327      0.322\n",
364 |     "0.040:      0.371      0.363\n",
365 |     "0.050:      0.380      0.391\n",
366 |     "0.055:      0.412      0.383\n",
367 |     "0.058:      0.403      0.394\n",
368 |     "0.060:      0.394      0.389\n",
369 |     "0.080:      0.385      0.390\n",
370 |     "0.100:      0.229      0.225\n",
371 |     "0.200:      0.000      0.000\n",
372 |     "0.300:      0.000      0.000\n",
373 |     "0.500:      0.000      0.000\n",
374 |     "```\n",
375 |     "\n",
376 |     "### Results with external embedding and training them , top 20\n",
377 |     "```\n",
378 |     "F1 scores\n",
379 |     "threshold | training | dev  \n",
380 |     "0.020:      0.334      0.333\n",
381 |     "0.030:      0.362      0.360\n",
382 |     "0.040:      0.366      0.374\n",
383 |     "0.050:      0.373      0.380\n",
384 |     "0.055:      0.374      0.382\n",
385 |     "0.058:      0.376      0.376\n",
386 |     "0.060:      0.376      0.378\n",
387 |     "0.080:      0.387      0.371\n",
388 |     "0.100:      0.366      0.350\n",
389 |     "0.200:      0.179      0.171\n",
390 |     "0.300:      0.020      0.020\n",
391 |     "0.500:      0.000      0.000\n",
392 |     "\n",
393 |     "```\n",
394 |     "\n",
395 |     "### Results with external Embeddings = False, top 10, \n",
396 |     "We can compare this setup with the LSTM published in the paper \"Applying Deep Learning to ICD-9 Multi-label Classification from Medical Records\", they got a F1-score of about 0.4168, we are getting 0.447\n",
397 |     "\n",
398 |     "``` \n",
399 |     "F1 scores\n",
400 |     "threshold | training | dev  \n",
401 |     "0.020:      0.399      0.407\n",
402 |     "0.030:      0.399      0.407\n",
403 |     "0.040:      0.399      0.407\n",
404 |     "0.050:      0.408      0.413\n",
405 |     "0.055:      0.433      0.420\n",
406 |     "0.058:      0.437      0.430\n",
407 |     "0.060:      0.432      0.427\n",
408 |     "0.080:      0.501      0.463\n",
409 |     "0.100:      0.446      0.463\n",
410 |     "0.200:      0.206      0.066\n",
411 |     "0.300:      0.000      0.000\n",
412 |     "0.500:      0.000      0.000\n",
413 |     "```\n",
414 |     "\n"
415 |    ]
416 |   },
417 |   {
418 |    "cell_type": "markdown",
419 |    "metadata": {
420 |     "collapsed": true
421 |    },
422 |    "source": [
423 |     "## Notes:\n",
424 |     "\n",
425 |     "\n",
426 |     "(1) There is a LSTM model by this paper: \"Applying Deep Learning to ICD-9 Multi-label Classification from Medical Records\" which did achieve a 42% F1-score. (https://cs224d.stanford.edu/reports/priyanka.pdf), but it only uses the top 10 icd9 codes. We are getting 46% (just running with 1000 notes so far)\n",
427 |     "\n",
428 |     "\n",
429 |     "(2) The \"A Comparison of Rule-Based and Deep Learning Models for Patient Phenotyping\"  study did get a 70% F1-score, but they don't use the icd9-labels but phenotypes labels they annotated themselved (via a group of medical professionals). (https://arxiv.org/abs/1703.08705). There were ONLY 10 phenotypes.\n",
430 |     "\n",
431 |     "The discharge summaries are labeled with ICD9-codes that are leaves in the ICD9-hierarchy (which has hundreds of ICD9-codes), then maybe these leave nodes are too specific and difficult to predict, one experiment would be to replaced all the ICD9-codes with their parent in the second or third level in the hierarchy and see if predictions work better that way.   \n",
432 |     "\n",
433 |     "(3) our baseline with top 20 codes had a f1-score of 35% (assigning top 4 icd9 codes to all notes, using a CNN with no external embeddings is getting about 40% f1-score.. a little better than the baseline"
434 |    ]
435 |   },
436 |   {
437 |    "cell_type": "code",
438 |    "execution_count": null,
439 |    "metadata": {
440 |     "collapsed": true
441 |    },
442 |    "outputs": [],
443 |    "source": []
444 |   }
445 |  ],
446 |  "metadata": {
447 |   "kernelspec": {
448 |    "display_name": "Python 2",
449 |    "language": "python",
450 |    "name": "python2"
451 |   },
452 |   "language_info": {
453 |    "codemirror_mode": {
454 |     "name": "ipython",
455 |     "version": 2
456 |    },
457 |    "file_extension": ".py",
458 |    "mimetype": "text/x-python",
459 |    "name": "python",
460 |    "nbconvert_exporter": "python",
461 |    "pygments_lexer": "ipython2",
462 |    "version": "2.7.13"
463 |   }
464 |  },
465 |  "nbformat": 4,
466 |  "nbformat_minor": 2
467 | }
468 | 


--------------------------------------------------------------------------------
/icd9_cnn/.ipynb_checkpoints/icd9_cnn_multilabel-checkpoint.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "code",
   5 |    "execution_count": 1,
   6 |    "metadata": {
   7 |     "collapsed": false
   8 |    },
   9 |    "outputs": [],
  10 |    "source": [
  11 |     "import csv\n",
  12 |     "import random\n",
  13 |     "import numpy as np\n",
  14 |     "from collections import Counter, defaultdict\n",
  15 |     "from sklearn.feature_extraction.text import *\n",
  16 |     "import re\n",
  17 |     "from tensorflow.contrib import learn\n",
  18 |     "import sys, os\n",
  19 |     "import tensorflow as tf\n",
  20 |     "import cnn_model\n",
  21 |     "import utils\n",
  22 |     "\n",
  23 |     "from sklearn.metrics import label_ranking_loss\n",
  24 |     "from sklearn.metrics import f1_score\n",
  25 |     "import shutil"
  26 |    ]
  27 |   },
  28 |   {
  29 |    "cell_type": "markdown",
  30 |    "metadata": {},
  31 |    "source": [
  32 |     "General Sources:\n",
  33 |     "http://ruder.io/deep-learning-nlp-best-practices/index.html#classification"
  34 |    ]
  35 |   },
  36 |   {
  37 |    "cell_type": "markdown",
  38 |    "metadata": {},
  39 |    "source": [
  40 |     "### Reading File"
  41 |    ]
  42 |   },
  43 |   {
  44 |    "cell_type": "code",
  45 |    "execution_count": 2,
  46 |    "metadata": {
  47 |     "collapsed": false
  48 |    },
  49 |    "outputs": [
  50 |     {
  51 |      "name": "stdout",
  52 |      "output_type": "stream",
  53 |      "text": [
  54 |       "Number of records in the dataset:  45837\n"
  55 |      ]
  56 |     }
  57 |    ],
  58 |    "source": [
  59 |     "#with open('../../../psql_files/disch_notes_all_icd9.csv', 'rb') as csvfile:\n",
  60 |     "csv.field_size_limit(sys.maxsize)\n",
  61 |     "with open('../baseline/psql_files/dis_notes_icd9.csv', 'rb') as csvfile:\n",
  62 |     "    discharge_notes_reader = csv.reader(csvfile)\n",
  63 |     "    discharge_notes_list = list(discharge_notes_reader)    \n",
  64 |     "random.shuffle(discharge_notes_list)\n",
  65 |     "\n",
  66 |     "print \"Number of records in the dataset: \", len (discharge_notes_list)"
  67 |    ]
  68 |   },
  69 |   {
  70 |    "cell_type": "markdown",
  71 |    "metadata": {},
  72 |    "source": [
  73 |     "we will take only 10,000 records to compare with NN baseline"
  74 |    ]
  75 |   },
  76 |   {
  77 |    "cell_type": "code",
  78 |    "execution_count": 3,
  79 |    "metadata": {
  80 |     "collapsed": true
  81 |    },
  82 |    "outputs": [],
  83 |    "source": [
  84 |     "#starting for 1,000 just for programming\n",
  85 |     "number_records = 1000"
  86 |    ]
  87 |   },
  88 |   {
  89 |    "cell_type": "code",
  90 |    "execution_count": 4,
  91 |    "metadata": {
  92 |     "collapsed": false
  93 |    },
  94 |    "outputs": [
  95 |     {
  96 |      "name": "stdout",
  97 |      "output_type": "stream",
  98 |      "text": [
  99 |       "Number of discharge clinical notes:  1000\n"
 100 |      ]
 101 |     }
 102 |    ],
 103 |    "source": [
 104 |     "discharge_notes_icd9 = np.asarray(discharge_notes_list[0:number_records])\n",
 105 |     "print 'Number of discharge clinical notes: ', len(discharge_notes_icd9)\n",
 106 |     "discharge_notes= discharge_notes_icd9[:,3]\n",
 107 |     "discharge_labels = discharge_notes_icd9[:,4]"
 108 |    ]
 109 |   },
 110 |   {
 111 |    "cell_type": "markdown",
 112 |    "metadata": {},
 113 |    "source": [
 114 |     "## Pre Processing"
 115 |    ]
 116 |   },
 117 |   {
 118 |    "cell_type": "markdown",
 119 |    "metadata": {},
 120 |    "source": [
 121 |     "## Stats about Notes  (TODO:)\n",
 122 |     "* vocabulary of size\n",
 123 |     "* find out notes that are too large, outliers to take out (otherwise the embeddings will pad a lot of zeroes to the other note-vectors("
 124 |    ]
 125 |   },
 126 |   {
 127 |    "cell_type": "markdown",
 128 |    "metadata": {},
 129 |    "source": [
 130 |     "## Converting icd9 labels to vectors"
 131 |    ]
 132 |   },
 133 |   {
 134 |    "cell_type": "code",
 135 |    "execution_count": 5,
 136 |    "metadata": {
 137 |     "collapsed": true
 138 |    },
 139 |    "outputs": [],
 140 |    "source": [
 141 |     "#transforming list of icd_codes into a vector\n",
 142 |     "def get_icd9_array(icd9_codes):\n",
 143 |     "    icd9_index_array = [0]*len(unique_icd9_codes)\n",
 144 |     "    for icd9_code in icd9_codes.split():\n",
 145 |     "        index = icd9_to_id [icd9_code]\n",
 146 |     "        icd9_index_array[index] = 1\n",
 147 |     "    return icd9_index_array"
 148 |    ]
 149 |   },
 150 |   {
 151 |    "cell_type": "code",
 152 |    "execution_count": 6,
 153 |    "metadata": {
 154 |     "collapsed": false
 155 |    },
 156 |    "outputs": [
 157 |     {
 158 |      "name": "stdout",
 159 |      "output_type": "stream",
 160 |      "text": [
 161 |       "Counter({'4019': 428, '41401': 297, '4280': 292, '42731': 272, '2724': 211, '5849': 210, '25000': 207, '51881': 172, '53081': 148, '5990': 142, '2449': 133, '2859': 118, '486': 118, '2720': 117, '2762': 112, '496': 97, '5070': 88, '2851': 87, '99592': 78, '0389': 68})\n",
 162 |       "  \n",
 163 |       "List of unique icd9 codes from all labels:  ['2859', '99592', '4019', '2724', '25000', '2720', '2851', '2762', '2449', '4280', '0389', '41401', '42731', '5849', '53081', '486', '5070', '496', '51881', '5990']\n"
 164 |      ]
 165 |     }
 166 |    ],
 167 |    "source": [
 168 |     "#counts by icd9_codes\n",
 169 |     "icd9_codes = Counter()\n",
 170 |     "for label in discharge_labels:\n",
 171 |     "    for icd9_code in label.split():\n",
 172 |     "        icd9_codes[icd9_code] += 1\n",
 173 |     "print icd9_codes\n",
 174 |     "\n",
 175 |     "# list of unique icd9_codes and lookups for its index in the vector\n",
 176 |     "unique_icd9_codes = list (icd9_codes)\n",
 177 |     "index_to_icd9 = dict(enumerate(unique_icd9_codes))\n",
 178 |     "icd9_to_id = {v:k for k,v in index_to_icd9.iteritems()}\n",
 179 |     "print '  '\n",
 180 |     "print 'List of unique icd9 codes from all labels: ', unique_icd9_codes\n",
 181 |     "\n",
 182 |     "#convert icd9 codes into ids\n",
 183 |     "labels_vector= list(map(get_icd9_array,discharge_labels))"
 184 |    ]
 185 |   },
 186 |   {
 187 |    "cell_type": "markdown",
 188 |    "metadata": {},
 189 |    "source": [
 190 |     "## Pre-processing notes"
 191 |    ]
 192 |   },
 193 |   {
 194 |    "cell_type": "markdown",
 195 |    "metadata": {},
 196 |    "source": [
 197 |     "https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py\n",
 198 |     "\n",
 199 |     "\n",
 200 |     "(1) Clean the text data using the same code as the original paper.\n",
 201 |     "https://github.com/yoonkim/CNN_sentence\n",
 202 |     "\n",
 203 |     "(2) Pad each note to the maximum note length, which turns out to be NN. We append special <PAD> tokens to all other notes to make them NN words. Padding sentences to the same length is useful because it allows us to efficiently batch our data since each example in a batch must be of the same length.\n",
 204 |     "(3) Build a vocabulary index and map each word to an integer between 0 and 18,765 (the vocabulary size). Each sentence becomes a vector of integers"
 205 |    ]
 206 |   },
 207 |   {
 208 |    "cell_type": "code",
 209 |    "execution_count": 7,
 210 |    "metadata": {
 211 |     "collapsed": false
 212 |    },
 213 |    "outputs": [],
 214 |    "source": [
 215 |     "def clean_str(string):\n",
 216 |     "    \"\"\"\n",
 217 |     "    Tokenization/string cleaning for all datasets except for SST.\n",
 218 |     "    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py\n",
 219 |     "    \"\"\"\n",
 220 |     "    string = re.sub(r\"[^A-Za-z0-9(),!?\\'\\`]\", \" \", string)\n",
 221 |     "    string = re.sub(r\"\\'s\", \" \\'s\", string)\n",
 222 |     "    string = re.sub(r\"\\'ve\", \" \\'ve\", string)\n",
 223 |     "    string = re.sub(r\"n\\'t\", \" n\\'t\", string)\n",
 224 |     "    string = re.sub(r\"\\'re\", \" \\'re\", string)\n",
 225 |     "    string = re.sub(r\"\\'d\", \" \\'d\", string)\n",
 226 |     "    string = re.sub(r\"\\'ll\", \" \\'ll\", string)\n",
 227 |     "    string = re.sub(r\",\", \" , \", string)\n",
 228 |     "    string = re.sub(r\"!\", \" ! \", string)\n",
 229 |     "    string = re.sub(r\"\\(\", \" \\( \", string)\n",
 230 |     "    string = re.sub(r\"\\)\", \" \\) \", string)\n",
 231 |     "    string = re.sub(r\"\\?\", \" \\? \", string)\n",
 232 |     "    string = re.sub(r\"\\s{2,}\", \" \", string)\n",
 233 |     "    return string.strip().lower()\n",
 234 |     "\n",
 235 |     "def note_preprocessing(data_notes):\n",
 236 |     "    notes_stripped = [s.strip() for s in data_notes]\n",
 237 |     "    notes_clean = [clean_str(note) for note in notes_stripped ]\n",
 238 |     "    notes_canonicalized = [\" \".join (utils.canonicalize_words(note.split(\" \"))) for note in notes_clean ]\n",
 239 |     "    \n",
 240 |     "    note_words_length =  [len(x.split(\" \")) for x in notes_canonicalized]\n",
 241 |     "    max_document_length = max( note_words_length)  \n",
 242 |     "    average_length = np.mean(note_words_length)\n",
 243 |     "    return max_document_length, average_length, notes_canonicalized"
 244 |    ]
 245 |   },
 246 |   {
 247 |    "cell_type": "code",
 248 |    "execution_count": 8,
 249 |    "metadata": {
 250 |     "collapsed": false
 251 |    },
 252 |    "outputs": [
 253 |     {
 254 |      "name": "stdout",
 255 |      "output_type": "stream",
 256 |      "text": [
 257 |       " max document length:  7047\n",
 258 |       "average document length:  1908.263\n",
 259 |       "Vocabulary_size:  23244\n"
 260 |      ]
 261 |     }
 262 |    ],
 263 |    "source": [
 264 |     "#preprocess documents\n",
 265 |     "max_document_length, average_document_length, notes_processed = note_preprocessing(discharge_notes)\n",
 266 |     "\n",
 267 |     "\n",
 268 |     "print ' max document length: ', max_document_length\n",
 269 |     "print 'average document length: ', average_document_length\n",
 270 |     "\n",
 271 |     "#create vocabulary processor\n",
 272 |     "vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)\n",
 273 |     "    \n",
 274 |     "# convert words to ids, and each document is padded\n",
 275 |     "notes_ids = np.array(list(vocab_processor.fit_transform(notes_processed)))\n",
 276 |     "\n",
 277 |     "# vocabulary size\n",
 278 |     "vocabulary_size = len(vocab_processor.vocabulary_)\n",
 279 |     "print 'Vocabulary_size: ', vocabulary_size"
 280 |    ]
 281 |   },
 282 |   {
 283 |    "cell_type": "code",
 284 |    "execution_count": 9,
 285 |    "metadata": {
 286 |     "collapsed": false
 287 |    },
 288 |    "outputs": [],
 289 |    "source": [
 290 |     "#notes_processed[0]"
 291 |    ]
 292 |   },
 293 |   {
 294 |    "cell_type": "markdown",
 295 |    "metadata": {},
 296 |    "source": [
 297 |     "### question?\n",
 298 |     "VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV   \n",
 299 |     "what do we do if the test data has a document with a bigger length than the max for the padding? "
 300 |    ]
 301 |   },
 302 |   {
 303 |    "cell_type": "markdown",
 304 |    "metadata": {},
 305 |    "source": [
 306 |     "### transforming to embeddings using word2vec\n",
 307 |     "\n",
 308 |     "From: \"A Comparison of Rule-Based and Deep Learning Models for Patient Phenotyping\"\n",
 309 |     "\n",
 310 |     "\"We pre-train our embeddings with word2vec on all discharge notes available in the MIMIC-III database.   \n",
 311 |     "The word embeddings of all words in the text to classify are concatenated and used as input to the\n",
 312 |     "convolutional layer. Convolutions detect a signal from a combination of adjacent inputs. We\n",
 313 |     "combine multiple convolutions of different lengths to evaluate phrases that are anywhere from\n",
 314 |     "two to five words long,\"   \n",
 315 |     "\n",
 316 |     "(tf-idf is removing negations..  embedding is taking care of mispellings.. we may need further training-tuning because of medical terms)"
 317 |    ]
 318 |   },
 319 |   {
 320 |    "cell_type": "markdown",
 321 |    "metadata": {},
 322 |    "source": [
 323 |     "https://code.google.com/archive/p/word2vec/\n",
 324 |     "    \n",
 325 |     "Pre-trained word and phrase vectors\n",
 326 |     "\n",
 327 |     "\"We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in [2]. The archive is available here: GoogleNews-vectors-negative300.bin.gz.\"   \n",
 328 |     "\n",
 329 |     "### for now we wil train our own embeddings, but word2vec will be better"
 330 |    ]
 331 |   },
 332 |   {
 333 |    "cell_type": "markdown",
 334 |    "metadata": {},
 335 |    "source": [
 336 |     "## Split Files"
 337 |    ]
 338 |   },
 339 |   {
 340 |    "cell_type": "code",
 341 |    "execution_count": 9,
 342 |    "metadata": {
 343 |     "collapsed": false
 344 |    },
 345 |    "outputs": [
 346 |     {
 347 |      "name": "stdout",
 348 |      "output_type": "stream",
 349 |      "text": [
 350 |       "Training set samples: 700\n",
 351 |       "Dev set samples: 150\n",
 352 |       "Test set samples: 150\n"
 353 |      ]
 354 |     }
 355 |    ],
 356 |    "source": [
 357 |     "def split_file(data, train_frac = 0.7, dev_frac = 0.15):   \n",
 358 |     "    train_split_idx = int(train_frac * len(data))\n",
 359 |     "    dev_split_idx = int ((train_frac + dev_frac)* len(data))\n",
 360 |     "    train_data = data[:train_split_idx]\n",
 361 |     "    dev_data = data[train_split_idx:dev_split_idx]\n",
 362 |     "    test_data = data[dev_split_idx:]\n",
 363 |     "    return train_data, dev_data, test_data\n",
 364 |     "\n",
 365 |     "\n",
 366 |     "train_notes, dev_notes, test_notes = split_file (notes_ids)\n",
 367 |     "train_labels, dev_labels, test_labels = split_file (labels_vector)\n",
 368 |     "print 'Training set samples:', len (train_notes)\n",
 369 |     "print 'Dev set samples:', len (dev_notes)\n",
 370 |     "print 'Test set samples:', len (test_notes)"
 371 |    ]
 372 |   },
 373 |   {
 374 |    "cell_type": "markdown",
 375 |    "metadata": {},
 376 |    "source": []
 377 |   },
 378 |   {
 379 |    "cell_type": "markdown",
 380 |    "metadata": {},
 381 |    "source": [
 382 |     "## CNN Training\n",
 383 |     "\n",
 384 |     "here is an example of a CNN to classify text.. our model will have different values for d (embedding-size, region sizes, etc)\n",
 385 |     "<img src=\"CNN_for_text2.png\"/>"
 386 |    ]
 387 |   },
 388 |   {
 389 |    "cell_type": "markdown",
 390 |    "metadata": {},
 391 |    "source": [
 392 |     "This is the CNN used with the MIMIC discharge summaries\n",
 393 |     "<img src=\"mimic_CNN_text_classification.png\"/>\n",
 394 |     "\n",
 395 |     "\n",
 396 |     "\"For the CNN model, we used 100 filters for each of the widths 2, 3, 4, and 5.    \n",
 397 |     "To prevent overfitting, we set the dropout probability to 0.5 and used L2-normalization to normalize word\n",
 398 |     "embeddings to have a max norm of 3.64     \n",
 399 |     "The model was trained using adadelta with an initial learning rate of 1 for 20 epochs.   \n",
 400 |     "The CNN model was implemented using Lua and the Torch7 framework.66    \n",
 401 |     "All baseline models were implemented using Python with the scikit-learn library.\""
 402 |    ]
 403 |   },
 404 |   {
 405 |    "cell_type": "markdown",
 406 |    "metadata": {
 407 |     "collapsed": true
 408 |    },
 409 |    "source": [
 410 |     "### sources:\n",
 411 |     "http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/  \n",
 412 |     "http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/   \n",
 413 |     "https://github.com/dennybritz/cnn-text-classification-tf/blob/master/text_cnn.py   \n",
 414 |     "https://www.tensorflow.org/get_started/mnist/pros   \n",
 415 |     "https://www.tensorflow.org/api_docs/python/tf/nn/conv2d   \n",
 416 |     " \n",
 417 |     " multi-label\n",
 418 |     " https://github.com/may-/cnn-re-tf/blob/master/cnn.py"
 419 |    ]
 420 |   },
 421 |   {
 422 |    "cell_type": "markdown",
 423 |    "metadata": {},
 424 |    "source": [
 425 |     "From: \"A Comparison of Rule-Based and Deep Learning Models for Patient Phenotyping\"\n",
 426 |     "\n",
 427 |     "\"For the CNN model, we used 100 filters for each of the widths 2, 3, 4, and 5.   \n",
 428 |     "To prevent overfitting, we set the dropout probability to 0.5 and used L2-normalization to normalize word\n",
 429 |     "embeddings to have a max norm of 3.64   \n",
 430 |     "The model was trained using adadelta with an initial learning rate of 1 for 20 epochs\""
 431 |    ]
 432 |   },
 433 |   {
 434 |    "cell_type": "code",
 435 |    "execution_count": 10,
 436 |    "metadata": {
 437 |     "collapsed": false
 438 |    },
 439 |    "outputs": [],
 440 |    "source": [
 441 |     "def run_epoch(lm, session, X, y, batch_size, dropout_keep_prob):\n",
 442 |     "    for batch in xrange(0, X.shape[0], batch_size):\n",
 443 |     "        # x SHAPE:   [batch_size, sequence_length, embedding_size]\n",
 444 |     "        X_batch = X[batch : batch + batch_size]\n",
 445 |     "        y_batch = y[batch : batch + batch_size]\n",
 446 |     "        feed_dict = {lm.input_x:X_batch,lm.input_y:y_batch,lm.dropout_keep_prob: dropout_keep_prob}\n",
 447 |     "        #loss, train_op_value =  session.run( [lm.loss,lm.train],feed_dict=feed_dict ) \n",
 448 |     "        loss, _, step = session.run([lm.loss, lm.train_op, lm.global_step], feed_dict)\n",
 449 |     "        if batch % 500: \n",
 450 |     "            print 'batch: %d, loss: %5.5f' % (batch, loss) "
 451 |    ]
 452 |   },
 453 |   {
 454 |    "cell_type": "code",
 455 |    "execution_count": 11,
 456 |    "metadata": {
 457 |     "collapsed": true
 458 |    },
 459 |    "outputs": [],
 460 |    "source": [
 461 |     "def predict_icd9_codes(lm, session, x_data, y_data, dropout_keep_prob=1.0):\n",
 462 |     "    total_y_hat = []\n",
 463 |     "    for batch in xrange(0, x_data.shape[0], batch_size):\n",
 464 |     "        X_batch = x_data[batch : batch + batch_size]\n",
 465 |     "        Y_batch = y_data[batch : batch + batch_size]\n",
 466 |     "        y_hat_out = session.run(lm.y_hat, feed_dict={lm.input_x:X_batch,lm.input_y:Y_batch, lm.dropout_keep_prob: dropout_keep_prob})\n",
 467 |     "        total_y_hat.extend(y_hat_out)\n",
 468 |     "    return  total_y_hat"
 469 |    ]
 470 |   },
 471 |   {
 472 |    "cell_type": "code",
 473 |    "execution_count": 13,
 474 |    "metadata": {
 475 |     "collapsed": false
 476 |    },
 477 |    "outputs": [],
 478 |    "source": [
 479 |     "#build tensorflow graphs\n",
 480 |     "reload(cnn_model)\n",
 481 |     "\n",
 482 |     "# Model parameters\n",
 483 |     "\n",
 484 |     "model_params = dict(vocab_size= vocabulary_size, sequence_length=max_document_length, learning_rate=0.0001,\\\n",
 485 |     "                    embedding_size=128, num_classes=20, filter_sizes=[2,3,4,5], num_filters=100)\n",
 486 |     "\n",
 487 |     "# Build and Train Model\n",
 488 |     "cnn = cnn_model.NNLM(**model_params)\n",
 489 |     "cnn.BuildCoreGraph()\n",
 490 |     "cnn.BuildTrainGraph()"
 491 |    ]
 492 |   },
 493 |   {
 494 |    "cell_type": "code",
 495 |    "execution_count": 14,
 496 |    "metadata": {
 497 |     "collapsed": false
 498 |    },
 499 |    "outputs": [],
 500 |    "source": [
 501 |     "TF_SAVEDIR = \"tf_saved\"\n",
 502 |     "trained_filename = os.path.join(TF_SAVEDIR, \"cnn_trained\")"
 503 |    ]
 504 |   },
 505 |   {
 506 |    "cell_type": "code",
 507 |    "execution_count": 15,
 508 |    "metadata": {
 509 |     "collapsed": false
 510 |    },
 511 |    "outputs": [
 512 |     {
 513 |      "name": "stdout",
 514 |      "output_type": "stream",
 515 |      "text": [
 516 |       "epoch_num: 0\n",
 517 |       "batch: 50, loss: 35.21418\n",
 518 |       "batch: 100, loss: 32.00004\n",
 519 |       "batch: 150, loss: 32.23874\n",
 520 |       "batch: 200, loss: 32.62824\n",
 521 |       "batch: 250, loss: 28.70796\n",
 522 |       "batch: 300, loss: 28.36549\n",
 523 |       "batch: 350, loss: 31.19997\n",
 524 |       "batch: 400, loss: 31.18843\n",
 525 |       "batch: 450, loss: 24.09070\n",
 526 |       "batch: 550, loss: 27.82526\n",
 527 |       "batch: 600, loss: 26.58137\n",
 528 |       "batch: 650, loss: 31.49951\n",
 529 |       "epoch_num: 1\n",
 530 |       "batch: 50, loss: 27.33090\n",
 531 |       "batch: 100, loss: 23.43864\n",
 532 |       "batch: 150, loss: 24.28109\n",
 533 |       "batch: 200, loss: 28.88978\n",
 534 |       "batch: 250, loss: 23.29307\n",
 535 |       "batch: 300, loss: 23.56560\n",
 536 |       "batch: 350, loss: 24.92994\n",
 537 |       "batch: 400, loss: 26.69365\n",
 538 |       "batch: 450, loss: 20.86471\n",
 539 |       "batch: 550, loss: 21.02352\n",
 540 |       "batch: 600, loss: 22.59895\n",
 541 |       "batch: 650, loss: 26.52458\n",
 542 |       "epoch_num: 2\n",
 543 |       "batch: 50, loss: 21.96159\n",
 544 |       "batch: 100, loss: 20.27966\n",
 545 |       "batch: 150, loss: 22.11069\n",
 546 |       "batch: 200, loss: 23.96683\n",
 547 |       "batch: 250, loss: 19.88365\n",
 548 |       "batch: 300, loss: 19.78596\n",
 549 |       "batch: 350, loss: 21.74492\n",
 550 |       "batch: 400, loss: 23.85380\n",
 551 |       "batch: 450, loss: 19.42990\n",
 552 |       "batch: 550, loss: 19.73495\n",
 553 |       "batch: 600, loss: 20.80687\n",
 554 |       "batch: 650, loss: 22.45188\n",
 555 |       "epoch_num: 3\n",
 556 |       "batch: 50, loss: 20.61851\n",
 557 |       "batch: 100, loss: 18.33901\n",
 558 |       "batch: 150, loss: 18.87777\n",
 559 |       "batch: 200, loss: 22.49130\n",
 560 |       "batch: 250, loss: 17.76616\n",
 561 |       "batch: 300, loss: 19.26856\n",
 562 |       "batch: 350, loss: 20.12720\n",
 563 |       "batch: 400, loss: 20.30942\n",
 564 |       "batch: 450, loss: 17.82849\n",
 565 |       "batch: 550, loss: 19.31835\n",
 566 |       "batch: 600, loss: 18.83955\n",
 567 |       "batch: 650, loss: 22.28319\n",
 568 |       "epoch_num: 4\n",
 569 |       "batch: 50, loss: 18.48457\n",
 570 |       "batch: 100, loss: 19.22259\n",
 571 |       "batch: 150, loss: 20.59698\n",
 572 |       "batch: 200, loss: 20.46447\n",
 573 |       "batch: 250, loss: 17.04944\n",
 574 |       "batch: 300, loss: 17.38269\n",
 575 |       "batch: 350, loss: 18.84311\n",
 576 |       "batch: 400, loss: 20.78538\n",
 577 |       "batch: 450, loss: 16.71252\n",
 578 |       "batch: 550, loss: 17.19374\n",
 579 |       "batch: 600, loss: 18.95580\n",
 580 |       "batch: 650, loss: 22.09250\n",
 581 |       "predicting training now \n",
 582 |       "predicting dev set now\n",
 583 |       "done!\n"
 584 |      ]
 585 |     }
 586 |    ],
 587 |    "source": [
 588 |     "batch_size = 50\n",
 589 |     "num_epochs = 5\n",
 590 |     "training_dropout_keep_prob = 0.8\n",
 591 |     "\n",
 592 |     "with cnn.graph.as_default():\n",
 593 |     "    initializer = tf.global_variables_initializer()\n",
 594 |     "    saver = tf.train.Saver()\n",
 595 |     "    \n",
 596 |     "# Clear old log directory\n",
 597 |     "shutil.rmtree(TF_SAVEDIR, ignore_errors=True)\n",
 598 |     "if not os.path.isdir(TF_SAVEDIR):\n",
 599 |     "    os.makedirs(TF_SAVEDIR)\n",
 600 |     "\n",
 601 |     "with tf.Session(graph=cnn.graph) as session:\n",
 602 |     "    session.run(initializer)\n",
 603 |     "    #training\n",
 604 |     "    for epoch_num in xrange(num_epochs):\n",
 605 |     "        print 'epoch_num:' , epoch_num\n",
 606 |     "        run_epoch(cnn, session, train_notes, train_labels, batch_size,dropout_keep_prob=training_dropout_keep_prob )\n",
 607 |     "    saver.save(session, trained_filename)\n",
 608 |     "    print 'predicting training now '\n",
 609 |     "    train_y_hat = predict_icd9_codes(cnn, session, train_notes, train_labels)   \n",
 610 |     "    print 'predicting dev set now'\n",
 611 |     "    dev_y_hat = predict_icd9_codes(cnn, session, dev_notes, dev_labels)\n",
 612 |     "    print 'done!'\n",
 613 |     "\n"
 614 |    ]
 615 |   },
 616 |   {
 617 |    "cell_type": "markdown",
 618 |    "metadata": {},
 619 |    "source": [
 620 |     "## Performance Evaluation\n"
 621 |    ]
 622 |   },
 623 |   {
 624 |    "cell_type": "code",
 625 |    "execution_count": 16,
 626 |    "metadata": {
 627 |     "collapsed": false
 628 |    },
 629 |    "outputs": [
 630 |     {
 631 |      "name": "stdout",
 632 |      "output_type": "stream",
 633 |      "text": [
 634 |       "Training ranking loss:  0.366064771696\n",
 635 |       "Development ranking loss:  0.394886934101\n"
 636 |      ]
 637 |     }
 638 |    ],
 639 |    "source": [
 640 |     "# ranking loss\n",
 641 |     "training_ranking_loss = label_ranking_loss(train_labels, train_y_hat)\n",
 642 |     "print \"Training ranking loss: \", training_ranking_loss\n",
 643 |     "dev_ranking_loss = label_ranking_loss(dev_labels, dev_y_hat)\n",
 644 |     "print \"Development ranking loss: \", dev_ranking_loss"
 645 |    ]
 646 |   },
 647 |   {
 648 |    "cell_type": "markdown",
 649 |    "metadata": {},
 650 |    "source": [
 651 |     "## TODO  create a model for thresholding\n",
 652 |     "\n",
 653 |     "Large-scale Multi-label Text Classification—Revisiting Neural Networks\n",
 654 |     "\n",
 655 |     "\n",
 656 |     "\"3.3 Thresholding\n",
 657 |     "Once training of the neural network is finished, its output may be interpreted as a probability\n",
 658 |     "distribution p (ojx) over the labels for a given document x. The probability distribution\n",
 659 |     "can be used to rank labels, but additional measures are needed in order to split\n",
 660 |     "the ranking into relevant and irrelevant labels. For transforming the ranked list of labels\n",
 661 |     "into a set of binary predictions, we train a multi-label threshold predictor from training\n",
 662 |     "data. This sort of thresholding methods are also used in [6, 31]\n",
 663 |     "For each document xm, labels are sorted by the probabilities in decreasing order.\n",
 664 |     "Ideally, if NNs successfully learn a mapping function f , all correct (positive) labels\n",
 665 |     "will be placed on top of the sorted list and there should be large margin between the set\n",
 666 |     "of positive labels and the set of negative labels. Using F1 score as a reference measure,\n",
 667 |     "we calculate classification performances at every pair of successive positive labels and\n",
 668 |     "choose a threshold value tm that produces the best performance\""
 669 |    ]
 670 |   },
 671 |   {
 672 |    "cell_type": "code",
 673 |    "execution_count": 17,
 674 |    "metadata": {
 675 |     "collapsed": true
 676 |    },
 677 |    "outputs": [],
 678 |    "source": [
 679 |     "def get_f1_score(y_true,y_hat,threshold, average):\n",
 680 |     "    hot_y = np.where(np.array(y_hat) > threshold, 1, 0)\n",
 681 |     "    return f1_score(np.array(y_true), hot_y, average=average)"
 682 |    ]
 683 |   },
 684 |   {
 685 |    "cell_type": "code",
 686 |    "execution_count": 18,
 687 |    "metadata": {
 688 |     "collapsed": false
 689 |    },
 690 |    "outputs": [
 691 |     {
 692 |      "name": "stdout",
 693 |      "output_type": "stream",
 694 |      "text": [
 695 |       "F1 scores\n",
 696 |       "threshold | training | dev  \n",
 697 |       "0.005:      0.310      0.308\n",
 698 |       "0.010:      0.311      0.299\n",
 699 |       "0.020:      0.320      0.300\n",
 700 |       "0.030:      0.328      0.308\n",
 701 |       "0.040:      0.328      0.311\n",
 702 |       "0.050:      0.329      0.305\n",
 703 |       "0.055:      0.327      0.307\n",
 704 |       "0.058:      0.326      0.307\n",
 705 |       "0.060:      0.324      0.307\n",
 706 |       "0.070:      0.324      0.296\n",
 707 |       "0.080:      0.324      0.287\n",
 708 |       "0.100:      0.311      0.280\n",
 709 |       "0.500:      0.018      0.012\n"
 710 |      ]
 711 |     }
 712 |    ],
 713 |    "source": [
 714 |     "print 'F1 scores'\n",
 715 |     "print 'threshold | training | dev  '\n",
 716 |     "f1_score_average = 'micro'\n",
 717 |     "for threshold in [ 0.005, 0.01,0.02,0.03,0.04,0.05,0.055,0.058,0.06, 0.07, 0.08, 0.1, 0.5]:\n",
 718 |     "    train_f1 = get_f1_score(train_labels, train_y_hat,threshold,f1_score_average)\n",
 719 |     "    dev_f1 = get_f1_score(dev_labels, dev_y_hat,threshold,f1_score_average)\n",
 720 |     "    print '%1.3f:      %1.3f      %1.3f' % (threshold,train_f1, dev_f1)"
 721 |    ]
 722 |   },
 723 |   {
 724 |    "cell_type": "markdown",
 725 |    "metadata": {},
 726 |    "source": [
 727 |     "```\n",
 728 |     "10,000 records, 1 epoch\n",
 729 |     "adam optimizer\n",
 730 |     "F1 scores\n",
 731 |     "threshold | training | dev  \n",
 732 |     "0.005:      0.321      0.317\n",
 733 |     "0.010:      0.334      0.331\n",
 734 |     "0.020:      0.347      0.345\n",
 735 |     "0.030:      0.351      0.350\n",
 736 |     "0.040:      0.349      0.344\n",
 737 |     "0.050:      0.342      0.337\n",
 738 |     "0.055:      0.340      0.334\n",
 739 |     "0.058:      0.337      0.332\n",
 740 |     "0.060:      0.335      0.330\n",
 741 |     "0.070:      0.324      0.320\n",
 742 |     "0.080:      0.313      0.308\n",
 743 |     "0.100:      0.292      0.283\n",
 744 |     "0.500:      0.046      0.043\n",
 745 |     "\n",
 746 |     "```"
 747 |    ]
 748 |   },
 749 |   {
 750 |    "cell_type": "markdown",
 751 |    "metadata": {},
 752 |    "source": [
 753 |     "adam optimizer with learning rate 0.0001,dropout= 0.9\n",
 754 |     "\n",
 755 |     "```\n",
 756 |     "F1 scores\n",
 757 |     "threshold | training | dev  \n",
 758 |     "0.005:      0.298      0.292\n",
 759 |     "0.010:      0.291      0.291\n",
 760 |     "0.020:      0.304      0.301\n",
 761 |     "0.030:      0.313      0.309\n",
 762 |     "0.040:      0.323      0.307\n",
 763 |     "0.050:      0.328      0.305\n",
 764 |     "0.055:      0.325      0.301\n",
 765 |     "0.058:      0.325      0.297\n",
 766 |     "0.060:      0.327      0.294\n",
 767 |     "0.070:      0.324      0.288\n",
 768 |     "0.080:      0.316      0.275\n",
 769 |     "0.100:      0.306      0.264\n",
 770 |     "0.500:      0.007      0.004\n",
 771 |     "\n",
 772 |     "\n",
 773 |     "```"
 774 |    ]
 775 |   },
 776 |   {
 777 |    "cell_type": "markdown",
 778 |    "metadata": {},
 779 |    "source": [
 780 |     "```\n",
 781 |     "1000 notes, top 20 labels, adadelta optimizer (but it goes wild on epoch #13) \n",
 782 |     "learning rate = 0.5, training-dropout = 1.0, batch_size = 50, num_epochs = 5\n",
 783 |     "\n",
 784 |     "F1 scores\n",
 785 |     "threshold | training | dev  \n",
 786 |     "0.005:      0.315      0.311\n",
 787 |     "0.010:      0.337      0.323\n",
 788 |     "0.020:      0.367      0.342\n",
 789 |     "0.030:      0.391      0.337\n",
 790 |     "0.040:      0.406      0.346\n",
 791 |     "0.050:      0.417      0.353\n",
 792 |     "0.055:      0.420      0.343\n",
 793 |     "0.058:      0.420      0.343\n",
 794 |     "0.060:      0.421      0.343\n",
 795 |     "0.070:      0.414      0.340\n",
 796 |     "0.080:      0.411      0.332\n",
 797 |     "0.100:      0.393      0.312\n",
 798 |     "0.500:      0.040      0.034\n",
 799 |     "```"
 800 |    ]
 801 |   },
 802 |   {
 803 |    "cell_type": "markdown",
 804 |    "metadata": {},
 805 |    "source": [
 806 |     "```\n",
 807 |     "1000 notes, top 20 labels, adadelta optimizer (goes wild on #epoch 13)\n",
 808 |     "learning rate = 0.5, training-dropout = 0.5, batch_size = 50, num_epochs = 5\n",
 809 |     "F1 scores\n",
 810 |     "threshold | training | dev  \n",
 811 |     "0.005:      0.375      0.362\n",
 812 |     "0.010:      0.382      0.364\n",
 813 |     "0.020:      0.378      0.356\n",
 814 |     "0.030:      0.352      0.342\n",
 815 |     "0.040:      0.331      0.324\n",
 816 |     "0.050:      0.319      0.324\n",
 817 |     "0.060:      0.306      0.312\n",
 818 |     "0.100:      0.278      0.294\n",
 819 |     "0.500:      0.200      0.20\n",
 820 |     "```"
 821 |    ]
 822 |   },
 823 |   {
 824 |    "cell_type": "markdown",
 825 |    "metadata": {},
 826 |    "source": []
 827 |   },
 828 |   {
 829 |    "cell_type": "markdown",
 830 |    "metadata": {
 831 |     "collapsed": true
 832 |    },
 833 |    "source": [
 834 |     "## Thoughts so far\n",
 835 |     "\n",
 836 |     "The CNN loss gets stuck with dropout_keep = 0.5.. \n",
 837 |     "I change it to 0.9, no overfitting, but the dev F1 score of 36%,which is just 1% hihter than the baseline model that always predict the top 4 most common icd-9 code and to the NN Baseline.\n",
 838 |     "\n",
 839 |     "\n",
 840 |     "\n",
 841 |     "### Lessons learned: \n",
 842 |     "* Adadelta optimizer has problems when running more than 10 epochs, the training loss stops going down and instead goes upd wildly "
 843 |    ]
 844 |   },
 845 |   {
 846 |    "cell_type": "markdown",
 847 |    "metadata": {},
 848 |    "source": [
 849 |     "## using Keras\n",
 850 |     "base on example: https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/sentiment_cnn.py"
 851 |    ]
 852 |   },
 853 |   {
 854 |    "cell_type": "code",
 855 |    "execution_count": 19,
 856 |    "metadata": {
 857 |     "collapsed": false
 858 |    },
 859 |    "outputs": [
 860 |     {
 861 |      "name": "stderr",
 862 |      "output_type": "stream",
 863 |      "text": [
 864 |       "Using TensorFlow backend.\n"
 865 |      ]
 866 |     }
 867 |    ],
 868 |    "source": [
 869 |     "from keras.models import Sequential, Model\n",
 870 |     "from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding\n",
 871 |     "from keras.layers.merge import Concatenate"
 872 |    ]
 873 |   },
 874 |   {
 875 |    "cell_type": "code",
 876 |    "execution_count": 28,
 877 |    "metadata": {
 878 |     "collapsed": false
 879 |    },
 880 |    "outputs": [],
 881 |    "source": [
 882 |     "#### set parameters:\n",
 883 |     "input_shape=  (max_document_length,)\n",
 884 |     "embedding_dims = 128\n",
 885 |     "num_filters = 100\n",
 886 |     "filter_sizes = [2,3,4,5]\n",
 887 |     "training_dropout_keep_prob = 0.9\n",
 888 |     "num_classes=20\n",
 889 |     "batch_size = 50\n",
 890 |     "epochs = 5"
 891 |    ]
 892 |   },
 893 |   {
 894 |    "cell_type": "code",
 895 |    "execution_count": 29,
 896 |    "metadata": {
 897 |     "collapsed": false
 898 |    },
 899 |    "outputs": [
 900 |     {
 901 |      "name": "stdout",
 902 |      "output_type": "stream",
 903 |      "text": [
 904 |       "Train on 700 samples, validate on 150 samples\n",
 905 |       "Epoch 1/5\n",
 906 |       "56s - loss: 1.5223 - acc: 0.8247 - val_loss: 0.8964 - val_acc: 0.8297\n",
 907 |       "Epoch 2/5\n",
 908 |       "56s - loss: 0.6087 - acc: 0.8310 - val_loss: 0.5500 - val_acc: 0.8297\n",
 909 |       "Epoch 3/5\n",
 910 |       "56s - loss: 0.5453 - acc: 0.8310 - val_loss: 0.5508 - val_acc: 0.8297\n",
 911 |       "Epoch 4/5\n",
 912 |       "57s - loss: 0.5438 - acc: 0.8310 - val_loss: 0.5494 - val_acc: 0.8297\n",
 913 |       "Epoch 5/5\n",
 914 |       "57s - loss: 0.5394 - acc: 0.8310 - val_loss: 0.5466 - val_acc: 0.8297\n"
 915 |      ]
 916 |     },
 917 |     {
 918 |      "data": {
 919 |       "text/plain": [
 920 |        "<keras.callbacks.History at 0x7f9c64a91fd0>"
 921 |       ]
 922 |      },
 923 |      "execution_count": 29,
 924 |      "metadata": {},
 925 |      "output_type": "execute_result"
 926 |     }
 927 |    ],
 928 |    "source": [
 929 |     "model_input = Input(shape=input_shape)\n",
 930 |     "z = Embedding(vocabulary_size, embedding_dims, input_length=max_document_length , name=\"embedding\")(model_input)\n",
 931 |     "\n",
 932 |     "# Convolutional block\n",
 933 |     "conv_blocks = []\n",
 934 |     "for sz in filter_sizes:\n",
 935 |     "    conv = Convolution1D(filters=num_filters,\n",
 936 |     "                         kernel_size=sz,\n",
 937 |     "                         padding=\"valid\",\n",
 938 |     "                         activation=\"relu\",\n",
 939 |     "                         strides=1)(z)\n",
 940 |     "    window_pool_size =  max_document_length  - sz + 1 \n",
 941 |     "    conv = MaxPooling1D(pool_size=2)(conv)  #pool_size?\n",
 942 |     "    conv = Flatten()(conv)\n",
 943 |     "    conv_blocks.append(conv)\n",
 944 |     "\n",
 945 |     "#concatenate\n",
 946 |     "z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]\n",
 947 |     "z = Dropout(training_dropout_keep_prob)(z)\n",
 948 |     "\n",
 949 |     "#score prediction\n",
 950 |     "#z = Dense(num_classes, activation=\"relu\")(z)  I don't think this is necessary\n",
 951 |     "model_output = Dense(num_classes, activation=\"softmax\")(z)\n",
 952 |     "\n",
 953 |     "#creating model\n",
 954 |     "model = Model(model_input, model_output)\n",
 955 |     "# what to use for tf.nn.softmax_cross_entropy_with_logits?\n",
 956 |     "model.compile(loss=\"binary_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"])\n",
 957 |     "\n",
 958 |     "# Train the model\n",
 959 |     "model.fit(train_notes, train_labels, batch_size=batch_size, epochs=epochs,\n",
 960 |     "validation_data=(dev_notes, dev_labels), verbose=2)"
 961 |    ]
 962 |   },
 963 |   {
 964 |    "cell_type": "code",
 965 |    "execution_count": 32,
 966 |    "metadata": {
 967 |     "collapsed": false
 968 |    },
 969 |    "outputs": [],
 970 |    "source": [
 971 |     "pred_train = model.predict(train_notes, batch_size=50)\n",
 972 |     "pred_dev = model.predict(dev_notes, batch_size=50)\n"
 973 |    ]
 974 |   },
 975 |   {
 976 |    "cell_type": "code",
 977 |    "execution_count": 34,
 978 |    "metadata": {
 979 |     "collapsed": false
 980 |    },
 981 |    "outputs": [
 982 |     {
 983 |      "name": "stdout",
 984 |      "output_type": "stream",
 985 |      "text": [
 986 |       "F1 scores\n",
 987 |       "threshold | training | dev  \n",
 988 |       "0.010:      0.289      0.291\n",
 989 |       "0.020:      0.289      0.291\n",
 990 |       "0.030:      0.290      0.291\n",
 991 |       "0.040:      0.294      0.291\n",
 992 |       "0.050:      0.418      0.356\n",
 993 |       "0.055:      0.402      0.303\n",
 994 |       "0.058:      0.290      0.134\n",
 995 |       "0.060:      0.192      0.074\n",
 996 |       "0.080:      0.016      0.000\n",
 997 |       "0.100:      0.006      0.000\n",
 998 |       "0.500:      0.000      0.000\n"
 999 |      ]
1000 |     }
1001 |    ],
1002 |    "source": [
1003 |     "\n",
1004 |     "print 'F1 scores'\n",
1005 |     "print 'threshold | training | dev  '\n",
1006 |     "f1_score_average = 'micro'\n",
1007 |     "for threshold in [  0.01,0.02,0.03,0.04,0.05,0.055,0.058,0.06, 0.08, 0.1, 0.5]:\n",
1008 |     "    train_f1 = get_f1_score(train_labels, pred_train,threshold,f1_score_average)\n",
1009 |     "    dev_f1 = get_f1_score(dev_labels, pred_dev,threshold,f1_score_average)\n",
1010 |     "    print '%1.3f:      %1.3f      %1.3f' % (threshold,train_f1, dev_f1)"
1011 |    ]
1012 |   },
1013 |   {
1014 |    "cell_type": "markdown",
1015 |    "metadata": {},
1016 |    "source": []
1017 |   }
1018 |  ],
1019 |  "metadata": {
1020 |   "kernelspec": {
1021 |    "display_name": "Python 2",
1022 |    "language": "python",
1023 |    "name": "python2"
1024 |   },
1025 |   "language_info": {
1026 |    "codemirror_mode": {
1027 |     "name": "ipython",
1028 |     "version": 2
1029 |    },
1030 |    "file_extension": ".py",
1031 |    "mimetype": "text/x-python",
1032 |    "name": "python",
1033 |    "nbconvert_exporter": "python",
1034 |    "pygments_lexer": "ipython2",
1035 |    "version": "2.7.13"
1036 |   }
1037 |  },
1038 |  "nbformat": 4,
1039 |  "nbformat_minor": 2
1040 | }
1041 | 


--------------------------------------------------------------------------------
/icd9_cnn/CNN_for_text2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/CNN_for_text2.png


--------------------------------------------------------------------------------
/icd9_cnn/cnn_model.py:
--------------------------------------------------------------------------------
  1 | import tensorflow as tf
  2 | 
  3 | # core logic based on http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
  4 | 
  5 | def with_self_graph(function):
  6 |     def wrapper(self, *args, **kwargs):
  7 |         with self.graph.as_default():
  8 |             return function(self, *args, **kwargs)
  9 |     return wrapper
 10 | 
 11 | class NNLM(object):
 12 |     def __init__(self, graph=None, *args, **kwargs):
 13 |         # Set TensorFlow graph. All TF code will work on this graph.
 14 |         self.graph = graph or tf.Graph()
 15 |         self.SetParams(*args, **kwargs)        
 16 | 
 17 |         
 18 |     @with_self_graph
 19 |     def SetParams(self, vocab_size, sequence_length, embedding_size, num_classes, learning_rate, filter_sizes,num_filters,l2_reg_lambda=0.0):
 20 |         self.vocab_size = vocab_size 
 21 |         self.embedding_size =embedding_size 
 22 |         self.num_classes =num_classes
 23 |         self.filter_sizes = filter_sizes
 24 |         self.num_filters = num_filters
 25 |         self.l2_reg_lambda = l2_reg_lambda
 26 |         # sequence_length: The length of our sentences. In this example all our sentences 
 27 |         #have the same length (59)
 28 |         self.sequence_length = sequence_length
 29 |         
 30 |         self.learning_rate = learning_rate
 31 |         
 32 |         
 33 |         # Training hyperparameters; these can be changed with feed_dict,
 34 |         with tf.name_scope("Training_Parameters"):
 35 |             self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
 36 |             
 37 |         # Keeping track of l2 regularization loss (optional)
 38 |         self.l2_loss = tf.constant(0.0)
 39 |     
 40 |     
 41 |         
 42 |     @with_self_graph
 43 |     def BuildCoreGraph(self):
 44 |         
 45 |         self.input_x = tf.placeholder(tf.int32, [None, self.sequence_length], name="input_x")
 46 |         #self.x = tf.placeholder(tf.float32,shape=[None,self.sequence_length,self.embedding_size],name="input_x") embedded already 
 47 |         self.input_y = tf.placeholder(tf.float32, shape=[None,self.num_classes],  name="input_y")
 48 |         
 49 |         # Embedding
 50 |         # -----------------------------------------------------------------------------
 51 |                 # Embedding layer
 52 |         with tf.device('/cpu:0'), tf.name_scope("embedding"):
 53 |             self.W = tf.Variable(tf.random_uniform([self.vocab_size, self.embedding_size], -1.0, 1.0), name="W")
 54 |             self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
 55 |             
 56 |         # x embedded SHAPE:   [batch_size, sequence_length, embedding_size]
 57 |         
 58 |         # TensorFlow convolutional conv2d operation expects a 4-dimensional tensor
 59 |         # with dimensions corresponding to batch, width, height and channel. 
 60 |         # The result of our embedding does not contain the channel dimension, so we add it manually, 
 61 |         self.x_expanded = tf.expand_dims(self.embedded_chars, -1)
 62 |         #self.x_expanded .SHAPE: [batch_size, sequence_length, embedding_size, 1]
 63 |         
 64 |         # Create a convolution + maxpool layer for each filter size
 65 |         pooled_outputs = [] 
 66 |         for i, filter_size in enumerate(self.filter_sizes): 
 67 |             with tf.name_scope("conv-maxpool-%s" % filter_size):
 68 |                 # Convolution Layer
 69 |                 # ----------------------------------------------------------------------
 70 |                 # filter shape: [window_region_height, window_region_width, 
 71 |                 #                number of input channels, number of filters for each region)
 72 |                 filter_shape = [filter_size, self.embedding_size, 1, self.num_filters]
 73 |                 
 74 |                 # Here, W is our filter matrix. Each filter slides over the whole embedding matrix, 
 75 |                 # but varies in how many words it covers.
 76 |                 # "VALID" padding means that we slide the filter over our sentence without padding the edges, 
 77 |                 # performing a narrow convolution that gives us an output of shape 
 78 |                 W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
 79 |                 b = tf.Variable(tf.constant(0.1, shape=[self.num_filters]), name="b")                 
 80 |                 conv = tf.nn.bias_add(tf.nn.conv2d( self.x_expanded, W, 
 81 |                      strides=[1, 1, 1, 1], padding="VALID", name="conv"), b)
 82 |                                 
 83 |                 # Apply nonlinearity  
 84 |                 h = tf.nn.relu(conv, name="relu")   
 85 |                 # h.SHAPE: [1, sequence_length - filter_size + 1, 1, 1]
 86 |                 
 87 |                 # Maxpooling over the outputs
 88 |                 # ------------------------------------------------------------------ 
 89 |                 conv_vector_length =  self.sequence_length - filter_size + 1  
 90 |                 # The pooling ops sweep a rectangular window over the input tensor, computing a reduction operation for each window
 91 |                 # in this case max. Each pooling op uses rectangular windows of size ksize separated by offset strides   
 92 |                 k_size = [1, conv_vector_length, 1, 1] # shape of output vector from conv
 93 |                 pooled = tf.nn.max_pool( h, ksize=k_size,
 94 |                      strides=[1, 1, 1, 1], padding='VALID', name="pool") 
 95 |                 # pooled. SHAPE: [batch_size, 1, 1, num_filters]
 96 |                 # This is essentially a feature vector, where the last dimension corresponds to our features.
 97 |                 
 98 |                 pooled_outputs.append(pooled)
 99 |                 
100 |         # Combine all the pooled features
101 |         # -----------------------------------------------------------------
102 |         # Once we have all the pooled output tensors from each filter size we combine them into one long feature vector
103 |         num_filters_total = self.num_filters * len(self.filter_sizes)
104 |         self.h_pool = tf.concat(pooled_outputs, 3)
105 |         self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
106 |         # self.h_pool_flat SHAPE: batch_size, num_filters_total]
107 |        
108 |         # Add dropout
109 |         with tf.name_scope("dropout"):
110 |             self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
111 |             
112 |         # Final (unnormalized) scores and predictions
113 |         with tf.name_scope("output"):
114 |             W = tf.get_variable(
115 |                 "W",
116 |                 shape=[num_filters_total, self.num_classes],
117 |                 initializer=tf.contrib.layers.xavier_initializer())
118 |             b = tf.Variable(tf.constant(0.1, shape=[self.num_classes]), name="b")
119 |             self.l2_loss += tf.nn.l2_loss(W)
120 |             self.l2_loss += tf.nn.l2_loss(b)
121 |             self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
122 |             #self.predictions = tf.argmax(self.scores, 1, name="predictions")
123 |             
124 |             #self.y_hat = tf.sigmoid(self.scores) 
125 |             self.y_hat = tf.nn.softmax(self.scores) 
126 |         
127 |         # CalculateMean cross-entropy loss
128 |         with tf.name_scope("loss"):
129 |             losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
130 |             self.loss = tf.reduce_mean(losses) + self.l2_reg_lambda * self.l2_loss
131 |             
132 |     @with_self_graph
133 |     def BuildTrainGraph(self):        
134 |         self.global_step = tf.Variable(0, name="global_step", trainable=False)
135 |         optimizer = tf.train.AdamOptimizer(self.learning_rate)
136 |         #optimizer = tf.train.AdadeltaOptimizer (self.learning_rate)
137 |         self.train_op = optimizer.minimize(self.loss, global_step=self.global_step)


--------------------------------------------------------------------------------
/icd9_cnn/cnn_model.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/cnn_model.pyc


--------------------------------------------------------------------------------
/icd9_cnn/cnn_top20_leave.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Clasifying Top 20 leaf icd-9 codes\n",
  8 |     "\n",
  9 |     "Running with the full file"
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": 1,
 15 |    "metadata": {
 16 |     "collapsed": false
 17 |    },
 18 |    "outputs": [
 19 |     {
 20 |      "name": "stderr",
 21 |      "output_type": "stream",
 22 |      "text": [
 23 |       "Using TensorFlow backend.\n"
 24 |      ]
 25 |     }
 26 |    ],
 27 |    "source": [
 28 |     "%load_ext autoreload\n",
 29 |     "%autoreload 2\n",
 30 |     "# General imports\n",
 31 |     "import numpy as np\n",
 32 |     "import pandas as pd\n",
 33 |     "from sklearn.metrics import f1_score\n",
 34 |     "import sys \n",
 35 |     "\n",
 36 |     "#keras\n",
 37 |     "from keras.models import Sequential, Model\n",
 38 |     "from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding\n",
 39 |     "from keras.layers.merge import Concatenate\n",
 40 |     "\n",
 41 |     "# Custom functions\n",
 42 |     "sys.path.append(\"../pipeline\")\n",
 43 |     "import icd9_cnn_model\n",
 44 |     "import database_selection\n",
 45 |     "import vectorization\n",
 46 |     "import helpers"
 47 |    ]
 48 |   },
 49 |   {
 50 |    "cell_type": "markdown",
 51 |    "metadata": {},
 52 |    "source": [
 53 |     "## Read Input File"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 2,
 59 |    "metadata": {
 60 |     "collapsed": true
 61 |    },
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "df = pd.read_csv('../data/disch_notes_all_icd9.csv',\n",
 65 |     "                 names = ['HADM_ID', 'SUBJECT_ID', 'DATE', 'ICD9','TEXT'])"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 3,
 71 |    "metadata": {
 72 |     "collapsed": false
 73 |    },
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "N_TOP = 20 \n",
 77 |     "full_df, top_codes = database_selection.filter_top_codes(df, 'ICD9', N_TOP, filter_empty = True)\n",
 78 |     "#df = full_df.head(1000)\n",
 79 |     "df = full_df"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "## Vectorize Labels"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": 4,
 92 |    "metadata": {
 93 |     "collapsed": false
 94 |    },
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "#preprocess icd9 codes\n",
 98 |     "labels = vectorization.vectorize_icd_column(df, 'ICD9', top_codes)\n"
 99 |    ]
100 |   },
101 |   {
102 |    "cell_type": "markdown",
103 |    "metadata": {},
104 |    "source": [
105 |     "## Vectorize Notes"
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": 5,
111 |    "metadata": {
112 |     "collapsed": false
113 |    },
114 |    "outputs": [
115 |     {
116 |      "name": "stdout",
117 |      "output_type": "stream",
118 |      "text": [
119 |       "Vocabulary size: 130488\n",
120 |       "Average note length: 1728.09244863\n",
121 |       "Max note length: 10924\n",
122 |       "Final Vocabulary: 130488\n",
123 |       "Final Max Sequence Length: 5000\n"
124 |      ]
125 |     }
126 |    ],
127 |    "source": [
128 |     "#preprocess notes\n",
129 |     "MAX_VOCAB = None # to limit original number of words (None if no limit)\n",
130 |     "MAX_SEQ_LENGTH = 5000 # to limit length of word sequence (None if no limit)\n",
131 |     "df.TEXT = vectorization.clean_notes(df, 'TEXT')\n",
132 |     "data, dictionary, MAX_VOCAB = vectorization.vectorize_notes(df.TEXT, MAX_VOCAB, verbose = True)\n",
133 |     "data, MAX_SEQ_LENGTH = vectorization.pad_notes(data, MAX_SEQ_LENGTH)\n",
134 |     "print(\"Final Vocabulary: %s\" % MAX_VOCAB)\n",
135 |     "print(\"Final Max Sequence Length: %s\" % MAX_SEQ_LENGTH)"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": 8,
141 |    "metadata": {
142 |     "collapsed": false
143 |    },
144 |    "outputs": [
145 |     {
146 |      "name": "stdout",
147 |      "output_type": "stream",
148 |      "text": [
149 |       "('Vocabulary in notes:', 130488)\n",
150 |       "('Vocabulary in original embedding:', 21056)\n",
151 |       "('Vocabulary intersection:', 20620)\n"
152 |      ]
153 |     }
154 |    ],
155 |    "source": [
156 |     "#creating embeddings\n",
157 |     "#EMBEDDING_LOC = '../data/glove.6B.100d.txt' # location of embedding\n",
158 |     "# embedding pre-trained will all MIMIC notes\n",
159 |     "EMBEDDING_LOC = '../data/notes.100.txt' # location of embedding\n",
160 |     "EMBEDDING_DIM = 100 # given the glove that we chose\n",
161 |     "embedding_matrix, embedding_dict = vectorization.embedding_matrix(EMBEDDING_LOC,\n",
162 |     "                                                                  dictionary, EMBEDDING_DIM, verbose = True)\n"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "metadata": {},
168 |    "source": [
169 |     "## Split Files"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": 6,
175 |    "metadata": {
176 |     "collapsed": false
177 |    },
178 |    "outputs": [
179 |     {
180 |      "name": "stdout",
181 |      "output_type": "stream",
182 |      "text": [
183 |       "('Train: ', (30794, 5000), (30794, 20))\n",
184 |       "('Validation: ', (8798, 5000), (8798, 20))\n",
185 |       "('Test: ', (4400, 5000), (4400, 20))\n"
186 |      ]
187 |     }
188 |    ],
189 |    "source": [
190 |     "#split sets\n",
191 |     "X_train, X_val, X_test, y_train, y_val, y_test = helpers.train_val_test_split(\n",
192 |     "    data, labels, val_size=0.2, test_size=0.1, random_state=101)\n",
193 |     "print(\"Train: \", X_train.shape, y_train.shape)\n",
194 |     "print(\"Validation: \", X_val.shape, y_val.shape)\n",
195 |     "print(\"Test: \", X_test.shape, y_test.shape)"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 7,
201 |    "metadata": {
202 |     "collapsed": false
203 |    },
204 |    "outputs": [],
205 |    "source": [
206 |     "# Delete temporary variables to free some memory\n",
207 |     "del df, data, labels"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "markdown",
212 |    "metadata": {},
213 |    "source": [
214 |     "## CNN for text classification\n",
215 |     "\n",
216 |     "Based on the following papers and links:\n",
217 |     "* \"Convolutional Neural Networks for Sentence Classification\"   \n",
218 |     "* \"A Sensitivity Analysis of (and Practitioners� Guide to) Convolutional Neural Networks for Sentence Classification\"\n",
219 |     "* http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/\n",
220 |     "* https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/sentiment_cnn.py"
221 |    ]
222 |   },
223 |   {
224 |    "cell_type": "code",
225 |    "execution_count": 9,
226 |    "metadata": {
227 |     "collapsed": false
228 |    },
229 |    "outputs": [
230 |     {
231 |      "name": "stdout",
232 |      "output_type": "stream",
233 |      "text": [
234 |       "____________________________________________________________________________________________________\n",
235 |       "Layer (type)                     Output Shape          Param #     Connected to                     \n",
236 |       "====================================================================================================\n",
237 |       "input_1 (InputLayer)             (None, 5000)          0                                            \n",
238 |       "____________________________________________________________________________________________________\n",
239 |       "embedding_1 (Embedding)          (None, 5000, 100)     13048900    input_1[0][0]                    \n",
240 |       "____________________________________________________________________________________________________\n",
241 |       "conv1d_1 (Conv1D)                (None, 4999, 100)     20100       embedding_1[0][0]                \n",
242 |       "____________________________________________________________________________________________________\n",
243 |       "conv1d_2 (Conv1D)                (None, 4998, 100)     30100       embedding_1[0][0]                \n",
244 |       "____________________________________________________________________________________________________\n",
245 |       "conv1d_3 (Conv1D)                (None, 4997, 100)     40100       embedding_1[0][0]                \n",
246 |       "____________________________________________________________________________________________________\n",
247 |       "conv1d_4 (Conv1D)                (None, 4996, 100)     50100       embedding_1[0][0]                \n",
248 |       "____________________________________________________________________________________________________\n",
249 |       "max_pooling1d_1 (MaxPooling1D)   (None, 1, 100)        0           conv1d_1[0][0]                   \n",
250 |       "____________________________________________________________________________________________________\n",
251 |       "max_pooling1d_2 (MaxPooling1D)   (None, 1, 100)        0           conv1d_2[0][0]                   \n",
252 |       "____________________________________________________________________________________________________\n",
253 |       "max_pooling1d_3 (MaxPooling1D)   (None, 1, 100)        0           conv1d_3[0][0]                   \n",
254 |       "____________________________________________________________________________________________________\n",
255 |       "max_pooling1d_4 (MaxPooling1D)   (None, 1, 100)        0           conv1d_4[0][0]                   \n",
256 |       "____________________________________________________________________________________________________\n",
257 |       "flatten_1 (Flatten)              (None, 100)           0           max_pooling1d_1[0][0]            \n",
258 |       "____________________________________________________________________________________________________\n",
259 |       "flatten_2 (Flatten)              (None, 100)           0           max_pooling1d_2[0][0]            \n",
260 |       "____________________________________________________________________________________________________\n",
261 |       "flatten_3 (Flatten)              (None, 100)           0           max_pooling1d_3[0][0]            \n",
262 |       "____________________________________________________________________________________________________\n",
263 |       "flatten_4 (Flatten)              (None, 100)           0           max_pooling1d_4[0][0]            \n",
264 |       "____________________________________________________________________________________________________\n",
265 |       "concatenate_1 (Concatenate)      (None, 400)           0           flatten_1[0][0]                  \n",
266 |       "                                                                   flatten_2[0][0]                  \n",
267 |       "                                                                   flatten_3[0][0]                  \n",
268 |       "                                                                   flatten_4[0][0]                  \n",
269 |       "____________________________________________________________________________________________________\n",
270 |       "dropout_1 (Dropout)              (None, 400)           0           concatenate_1[0][0]              \n",
271 |       "____________________________________________________________________________________________________\n",
272 |       "dense_1 (Dense)                  (None, 20)            8020        dropout_1[0][0]                  \n",
273 |       "====================================================================================================\n",
274 |       "Total params: 13,197,320\n",
275 |       "Trainable params: 13,197,320\n",
276 |       "Non-trainable params: 0\n",
277 |       "____________________________________________________________________________________________________\n",
278 |       "None\n"
279 |      ]
280 |     }
281 |    ],
282 |    "source": [
283 |     "reload(icd9_cnn_model)\n",
284 |     "#### build model\n",
285 |     "model = icd9_cnn_model.build_icd9_cnn_model (input_seq_length=MAX_SEQ_LENGTH, max_vocab = MAX_VOCAB,\n",
286 |     "                             external_embeddings = True,\n",
287 |     "                             embedding_dim=EMBEDDING_DIM,embedding_matrix=embedding_matrix,\n",
288 |     "                             num_filters = 100, filter_sizes=[2,3,4,5],\n",
289 |     "                             training_dropout_keep_prob=0.5,\n",
290 |     "                             num_classes=N_TOP )"
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": 10,
296 |    "metadata": {
297 |     "collapsed": false
298 |    },
299 |    "outputs": [
300 |     {
301 |      "name": "stdout",
302 |      "output_type": "stream",
303 |      "text": [
304 |       "Train on 30794 samples, validate on 8798 samples\n",
305 |       "Epoch 1/5\n",
306 |       "1008s - loss: 0.4447 - acc: 0.8289 - val_loss: 0.3207 - val_acc: 0.8677\n",
307 |       "Epoch 2/5\n",
308 |       "984s - loss: 0.3245 - acc: 0.8698 - val_loss: 0.2738 - val_acc: 0.8868\n",
309 |       "Epoch 3/5\n",
310 |       "981s - loss: 0.2889 - acc: 0.8835 - val_loss: 0.2522 - val_acc: 0.8978\n",
311 |       "Epoch 4/5\n",
312 |       "980s - loss: 0.2708 - acc: 0.8915 - val_loss: 0.2422 - val_acc: 0.9047\n",
313 |       "Epoch 5/5\n",
314 |       "977s - loss: 0.2605 - acc: 0.8965 - val_loss: 0.2391 - val_acc: 0.9050\n"
315 |      ]
316 |     },
317 |     {
318 |      "data": {
319 |       "text/plain": [
320 |        "<keras.callbacks.History at 0x7f87e163d310>"
321 |       ]
322 |      },
323 |      "execution_count": 10,
324 |      "metadata": {},
325 |      "output_type": "execute_result"
326 |     }
327 |    ],
328 |    "source": [
329 |     "#first 5 epochs\n",
330 |     "model.fit(X_train, y_train, batch_size=50, epochs=5, validation_data=(X_val, y_val), verbose=2)"
331 |    ]
332 |   },
333 |   {
334 |    "cell_type": "code",
335 |    "execution_count": 11,
336 |    "metadata": {
337 |     "collapsed": true
338 |    },
339 |    "outputs": [],
340 |    "source": [
341 |     "model.save('models/cnn_5_epochs_allr.h5')"
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "code",
346 |    "execution_count": 12,
347 |    "metadata": {
348 |     "collapsed": false
349 |    },
350 |    "outputs": [
351 |     {
352 |      "name": "stdout",
353 |      "output_type": "stream",
354 |      "text": [
355 |       "F1 scores\n",
356 |       "threshold | training | dev  \n",
357 |       "0.020:      0.358      0.353\n",
358 |       "0.030:      0.407      0.400\n",
359 |       "0.040:      0.456      0.446\n",
360 |       "0.050:      0.502      0.488\n",
361 |       "0.055:      0.523      0.508\n",
362 |       "0.058:      0.534      0.518\n",
363 |       "0.060:      0.541      0.525\n",
364 |       "0.080:      0.602      0.582\n",
365 |       "0.100:      0.645      0.621\n",
366 |       "0.200:      0.732      0.704\n",
367 |       "0.300:      0.747      0.717\n",
368 |       "0.400:      0.738      0.707\n",
369 |       "0.500:      0.712      0.679\n",
370 |       "0.600:      0.668      0.631\n",
371 |       "0.700:      0.594      0.558\n"
372 |      ]
373 |     }
374 |    ],
375 |    "source": [
376 |     "pred_train = model.predict(X_train, batch_size=200)\n",
377 |     "pred_dev = model.predict(X_val, batch_size=200)\n",
378 |     "helpers.show_f1_score(y_train, pred_train, y_val, pred_dev)"
379 |    ]
380 |   },
381 |   {
382 |    "cell_type": "code",
383 |    "execution_count": 13,
384 |    "metadata": {
385 |     "collapsed": false
386 |    },
387 |    "outputs": [
388 |     {
389 |      "name": "stdout",
390 |      "output_type": "stream",
391 |      "text": [
392 |       "Train on 30794 samples, validate on 8798 samples\n",
393 |       "Epoch 1/2\n",
394 |       "834s - loss: 0.2518 - acc: 0.9010 - val_loss: 0.2371 - val_acc: 0.9076\n",
395 |       "Epoch 2/2\n",
396 |       "837s - loss: 0.2437 - acc: 0.9041 - val_loss: 0.2367 - val_acc: 0.9075\n"
397 |      ]
398 |     },
399 |     {
400 |      "data": {
401 |       "text/plain": [
402 |        "<keras.callbacks.History at 0x7f87a57c4310>"
403 |       ]
404 |      },
405 |      "execution_count": 13,
406 |      "metadata": {},
407 |      "output_type": "execute_result"
408 |     }
409 |    ],
410 |    "source": [
411 |     "# 2 more epochs\n",
412 |     "model.fit(X_train, y_train, batch_size=50, epochs=2, validation_data=(X_val, y_val), verbose=2)"
413 |    ]
414 |   },
415 |   {
416 |    "cell_type": "code",
417 |    "execution_count": 14,
418 |    "metadata": {
419 |     "collapsed": true
420 |    },
421 |    "outputs": [],
422 |    "source": [
423 |     "model.save('models/cnn_7_epochs_allr.h5')"
424 |    ]
425 |   },
426 |   {
427 |    "cell_type": "code",
428 |    "execution_count": 15,
429 |    "metadata": {
430 |     "collapsed": false
431 |    },
432 |    "outputs": [
433 |     {
434 |      "name": "stdout",
435 |      "output_type": "stream",
436 |      "text": [
437 |       "F1 scores\n",
438 |       "threshold | training | dev  \n",
439 |       "0.020:      0.373      0.366\n",
440 |       "0.030:      0.424      0.412\n",
441 |       "0.040:      0.472      0.455\n",
442 |       "0.050:      0.515      0.494\n",
443 |       "0.055:      0.536      0.512\n",
444 |       "0.058:      0.548      0.522\n",
445 |       "0.060:      0.555      0.528\n",
446 |       "0.080:      0.622      0.587\n",
447 |       "0.100:      0.669      0.629\n",
448 |       "0.200:      0.759      0.713\n",
449 |       "0.300:      0.774      0.724\n",
450 |       "0.400:      0.767      0.714\n",
451 |       "0.500:      0.746      0.691\n",
452 |       "0.600:      0.708      0.651\n",
453 |       "0.700:      0.644      0.584\n"
454 |      ]
455 |     }
456 |    ],
457 |    "source": [
458 |     "pred_train = model.predict(X_train, batch_size=200)\n",
459 |     "pred_dev = model.predict(X_val, batch_size=200)\n",
460 |     "helpers.show_f1_score(y_train, pred_train, y_val, pred_dev)"
461 |    ]
462 |   }
463 |  ],
464 |  "metadata": {
465 |   "kernelspec": {
466 |    "display_name": "Python [Root]",
467 |    "language": "python",
468 |    "name": "Python [Root]"
469 |   },
470 |   "language_info": {
471 |    "codemirror_mode": {
472 |     "name": "ipython",
473 |     "version": 2
474 |    },
475 |    "file_extension": ".py",
476 |    "mimetype": "text/x-python",
477 |    "name": "python",
478 |    "nbconvert_exporter": "python",
479 |    "pygments_lexer": "ipython2",
480 |    "version": "2.7.12"
481 |   }
482 |  },
483 |  "nbformat": 4,
484 |  "nbformat_minor": 2
485 | }
486 | 


--------------------------------------------------------------------------------
/icd9_cnn/mimic_CNN_text_classification.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/mimic_CNN_text_classification.png


--------------------------------------------------------------------------------
/icd9_cnn/tf_saved/cnn_trained.data-00000-of-00001:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/tf_saved/cnn_trained.data-00000-of-00001


--------------------------------------------------------------------------------
/icd9_cnn/tf_saved/cnn_trained.index:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/tf_saved/cnn_trained.index


--------------------------------------------------------------------------------
/icd9_cnn/tf_saved/cnn_trained.meta:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/tf_saved/cnn_trained.meta


--------------------------------------------------------------------------------
/icd9_cnn/utils.py:
--------------------------------------------------------------------------------
  1 | ## Code from w266 materials
  2 | 
  3 | from IPython.display import display
  4 | import itertools
  5 | import numpy as np
  6 | import pandas as pd
  7 | import re
  8 | import time
  9 | 
 10 | def flatten(list_of_lists):
 11 |     """Flatten a list-of-lists into a single list."""
 12 |     return list(itertools.chain.from_iterable(list_of_lists))
 13 | 
 14 | def pretty_print_matrix(M, rows=None, cols=None, dtype=float):
 15 |     """Pretty-print a matrix using Pandas.
 16 | 
 17 |     Args:
 18 |       M : 2D numpy array
 19 |       rows : list of row labels
 20 |       cols : list of column labels
 21 |       dtype : data type (float or int)
 22 |     """
 23 |     display(pd.DataFrame(M, index=rows, columns=cols, dtype=dtype))
 24 | 
 25 | def pretty_timedelta(fmt="%d:%02d:%02d", since=None, until=None):
 26 |     """Pretty-print a timedelta, using the given format string."""
 27 |     since = since or time.time()
 28 |     until = until or time.time()
 29 |     delta_s = until - since
 30 |     hours, remainder = divmod(delta_s, 3600)
 31 |     minutes, seconds = divmod(remainder, 60)
 32 |     return fmt % (hours, minutes, seconds)
 33 | 
 34 | 
 35 | ##
 36 | # Word processing functions
 37 | def canonicalize_digits(word):
 38 |     if any([c.isalpha() for c in word]): return word
 39 |     word = re.sub("\d", "DG", word)
 40 |     if word.startswith("DG"):
 41 |         word = word.replace(",", "") # remove thousands separator
 42 |     return word
 43 | 
 44 | def canonicalize_word(word, wordset=None, digits=True):
 45 |     word = word.lower()
 46 |     if digits:
 47 |         if (wordset != None) and (word in wordset): return word
 48 |         word = canonicalize_digits(word) # try to canonicalize numbers
 49 |     if (wordset == None) or (word in wordset): return word
 50 |     else: return "<unk>" # unknown token
 51 | 
 52 | def canonicalize_words(words, **kw):
 53 |     return [canonicalize_word(word, **kw) for word in words]
 54 | 
 55 | ##
 56 | # Data loading functions
 57 | import nltk
 58 | import vocabulary
 59 | 
 60 | def get_corpus(name="brown"):
 61 |     return nltk.corpus.__getattr__(name)
 62 | 
 63 | def build_vocab(corpus, V=10000):
 64 |     token_feed = (canonicalize_word(w) for w in corpus.words())
 65 |     vocab = vocabulary.Vocabulary(token_feed, size=V)
 66 |     return vocab
 67 | 
 68 | def get_train_test_sents(corpus, split=0.8, shuffle=True):
 69 |     """Get train and test sentences.
 70 |     
 71 |     Args:
 72 |       corpus: nltk.corpus that supports sents() function
 73 |       split (double): fraction to use as training set
 74 |       shuffle (int or bool): seed for shuffle of input data, or False to just 
 75 |       take the training data as the first xx% contiguously.
 76 | 
 77 |     Returns:
 78 |       train_sentences, test_sentences ( list(list(string)) ): the train and test 
 79 |       splits
 80 |     """
 81 |     sentences = np.array(corpus.sents(), dtype=object)
 82 |     fmt = (len(sentences), sum(map(len, sentences)))
 83 |     print "Loaded %d sentences (%g tokens)" % fmt
 84 | 
 85 |     if shuffle:
 86 |         rng = np.random.RandomState(shuffle)
 87 |         rng.shuffle(sentences)  # in-place
 88 |     train_frac = 0.8
 89 |     split_idx = int(train_frac * len(sentences))
 90 |     train_sentences = sentences[:split_idx]
 91 |     test_sentences = sentences[split_idx:]
 92 | 
 93 |     fmt = (len(train_sentences), sum(map(len, train_sentences)))
 94 |     print "Training set: %d sentences (%d tokens)" % fmt
 95 |     fmt = (len(test_sentences), sum(map(len, test_sentences)))
 96 |     print "Test set: %d sentences (%d tokens)" % fmt
 97 | 
 98 |     return train_sentences, test_sentences
 99 | 
100 | def preprocess_sentences(sentences, vocab):
101 |     """Preprocess sentences by canonicalizing and mapping to ids.
102 | 
103 |     Args:
104 |       sentences ( list(list(string)) ): input sentences
105 |       vocab: Vocabulary object, already initialized
106 | 
107 |     Returns:
108 |       ids ( array(int) ): flattened array of sentences, including boundary <s> 
109 |       tokens.
110 |     """
111 |     # Add sentence boundaries, canonicalize, and handle unknowns
112 |     words = ["<s>"] + flatten(s + ["<s>"] for s in sentences)
113 |     words = [canonicalize_word(w, wordset=vocab.word_to_id)
114 |              for w in words]
115 |     return np.array(vocab.words_to_ids(words))
116 | 
117 | ##
118 | # Use this function
119 | def load_corpus(name, split=0.8, V=10000, shuffle=0):
120 |     """Load a named corpus and split train/test along sentences."""
121 |     corpus = get_corpus(name)
122 |     vocab = build_vocab(corpus, V)
123 |     train_sentences, test_sentences = get_train_test_sents(corpus, split, shuffle)
124 |     train_ids = preprocess_sentences(train_sentences, vocab)
125 |     test_ids = preprocess_sentences(test_sentences, vocab)
126 |     return vocab, train_ids, test_ids
127 | 
128 | ##
129 | # Use this function
130 | def batch_generator(ids, batch_size, max_time):
131 |     """Convert ids to data-matrix form."""
132 |     # Clip to multiple of max_time for convenience
133 |     clip_len = ((len(ids)-1) / batch_size) * batch_size
134 |     input_w = ids[:clip_len]     # current word
135 |     target_y = ids[1:clip_len+1]  # next word
136 |     # Reshape so we can select columns
137 |     input_w = input_w.reshape([batch_size,-1])
138 |     target_y = target_y.reshape([batch_size,-1])
139 | 
140 |     # Yield batches
141 |     for i in xrange(0, input_w.shape[1], max_time):
142 | 	yield input_w[:,i:i+max_time], target_y[:,i:i+max_time]
143 | 
144 | 
145 | 


--------------------------------------------------------------------------------
/icd9_cnn/utils.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/utils.pyc


--------------------------------------------------------------------------------
/icd9_cnn/vocabulary.py:
--------------------------------------------------------------------------------
 1 | ## Code from w266 class material
 2 | import collections
 3 | 
 4 | class Vocabulary(object):
 5 | 
 6 |   START_TOKEN = "<s>"
 7 |   END_TOKEN = "</s>"
 8 |   UNK_TOKEN = "<unk>"
 9 | 
10 |   def __init__(self, tokens, size=None):
11 |     self.unigram_counts = collections.Counter(tokens)
12 |     self.num_unigrams = sum(self.unigram_counts.itervalues())
13 |     # leave space for "<s>", "</s>", and "<unk>"
14 |     top_counts = self.unigram_counts.most_common(None if size is None else (size - 3))
15 |     vocab = ([self.START_TOKEN, self.END_TOKEN, self.UNK_TOKEN] +
16 |              [w for w,c in top_counts])
17 | 
18 |     # Assign an id to each word, by frequency
19 |     self.id_to_word = dict(enumerate(vocab))
20 |     self.word_to_id = {v:k for k,v in self.id_to_word.iteritems()}
21 |     self.size = len(self.id_to_word)
22 |     if size is not None:
23 |         assert(self.size <= size)
24 | 
25 |     # For convenience
26 |     self.wordset = set(self.word_to_id.iterkeys())
27 | 
28 |     # Store special IDs
29 |     self.START_ID = self.word_to_id[self.START_TOKEN]
30 |     self.END_ID = self.word_to_id[self.END_TOKEN]
31 |     self.UNK_ID = self.word_to_id[self.UNK_TOKEN]
32 | 
33 |   def words_to_ids(self, words):
34 |     return [self.word_to_id.get(w, self.UNK_ID) for w in words]
35 | 
36 |   def ids_to_words(self, ids):
37 |     return [self.id_to_word[i] for i in ids]
38 | 
39 |   def sentence_to_ids(self, words):
40 |     return [self.START_ID] + self.words_to_ids(words) + [self.END_ID]
41 | 
42 |   def ordered_words(self):
43 |     """Return a list of words, ordered by id."""
44 |     return self.ids_to_words(range(self.size))
45 | 


--------------------------------------------------------------------------------
/icd9_cnn/vocabulary.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/vocabulary.pyc


--------------------------------------------------------------------------------
/pipeline/.ipynb_checkpoints/Temp Guillaume-checkpoint.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "deletable": true,
  7 |     "editable": true
  8 |    },
  9 |    "source": [
 10 |     "# Clean(er) Pipeline"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "markdown",
 15 |    "metadata": {
 16 |     "deletable": true,
 17 |     "editable": true
 18 |    },
 19 |    "source": [
 20 |     "This is an attempt to merge the pipelines from Zenobia and Guillaume"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "markdown",
 25 |    "metadata": {
 26 |     "deletable": true,
 27 |     "editable": true
 28 |    },
 29 |    "source": [
 30 |     "## Importing Modules"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": 64,
 36 |    "metadata": {
 37 |     "collapsed": false,
 38 |     "deletable": true,
 39 |     "editable": true
 40 |    },
 41 |    "outputs": [],
 42 |    "source": [
 43 |     "# General imports\n",
 44 |     "import numpy as np\n",
 45 |     "import pandas as pd\n",
 46 |     "import re, nltk, string, os\n",
 47 |     "from sklearn.model_selection import train_test_split\n",
 48 |     "from sklearn.metrics import f1_score\n",
 49 |     "import datetime, time\n",
 50 |     "import matplotlib\n",
 51 |     "from collections import Counter\n",
 52 |     "from matplotlib import pyplot as plt\n",
 53 |     "matplotlib.style.use('ggplot')\n",
 54 |     "%matplotlib inline"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": 65,
 60 |    "metadata": {
 61 |     "collapsed": false,
 62 |     "deletable": true,
 63 |     "editable": true
 64 |    },
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "# NN imports\n",
 68 |     "# Upgrade the package called dask\n",
 69 |     "import keras\n",
 70 |     "from keras.preprocessing.text import Tokenizer\n",
 71 |     "from keras.preprocessing.sequence import pad_sequences\n",
 72 |     "from keras.layers import Input, Conv1D, MaxPooling1D, Flatten, Dense, Embedding\n",
 73 |     "from keras.models import Model"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 66,
 79 |    "metadata": {
 80 |     "collapsed": false,
 81 |     "deletable": true,
 82 |     "editable": true
 83 |    },
 84 |    "outputs": [
 85 |     {
 86 |      "name": "stdout",
 87 |      "output_type": "stream",
 88 |      "text": [
 89 |       "The autoreload extension is already loaded. To reload it, use:\n",
 90 |       "  %reload_ext autoreload\n"
 91 |      ]
 92 |     }
 93 |    ],
 94 |    "source": [
 95 |     "# Custom functions\n",
 96 |     "%load_ext autoreload\n",
 97 |     "%autoreload 2\n",
 98 |     "import database_selection\n",
 99 |     "import vectorization\n",
100 |     "import helpers"
101 |    ]
102 |   },
103 |   {
104 |    "cell_type": "markdown",
105 |    "metadata": {
106 |     "deletable": true,
107 |     "editable": true
108 |    },
109 |    "source": [
110 |     "## Select data corresponding to the top ICD codes"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "metadata": {
116 |     "deletable": true,
117 |     "editable": true
118 |    },
119 |    "source": [
120 |     "Here, we filter for only the top `n_top` ICD codes   \n",
121 |     "\n",
122 |     "Note: We offer the option to exclude notes that do not contain any of the top codes. However, it may actually be more rigorous to keep them, no?"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": 67,
128 |    "metadata": {
129 |     "collapsed": false,
130 |     "deletable": true,
131 |     "editable": true
132 |    },
133 |    "outputs": [],
134 |    "source": [
135 |     "# Inputs\n",
136 |     "N_TOP = 20\n",
137 |     "df = pd.read_csv('../data/disch_notes_all_icd9.csv',\n",
138 |     "                 names = ['HADM_ID', 'SUBJECT_ID', 'DATE', 'ICD9','TEXT'])\n",
139 |     "df = df.head(10000) # Speeding up"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 68,
145 |    "metadata": {
146 |     "collapsed": false,
147 |     "deletable": true,
148 |     "editable": true
149 |    },
150 |    "outputs": [],
151 |    "source": [
152 |     "df, top_codes = database_selection.filter_top_codes(df, 'ICD9', N_TOP, filter_empty = False)"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 69,
158 |    "metadata": {
159 |     "collapsed": false,
160 |     "deletable": true,
161 |     "editable": true
162 |    },
163 |    "outputs": [
164 |     {
165 |      "data": {
166 |       "text/plain": [
167 |        "(10000, 5)"
168 |       ]
169 |      },
170 |      "execution_count": 69,
171 |      "metadata": {},
172 |      "output_type": "execute_result"
173 |     }
174 |    ],
175 |    "source": [
176 |     "df.shape"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": 70,
182 |    "metadata": {
183 |     "collapsed": false,
184 |     "deletable": true,
185 |     "editable": true
186 |    },
187 |    "outputs": [
188 |     {
189 |      "data": {
190 |       "text/plain": [
191 |        "['4019', '4280', '42731', '41401', '5849']"
192 |       ]
193 |      },
194 |      "execution_count": 70,
195 |      "metadata": {},
196 |      "output_type": "execute_result"
197 |     }
198 |    ],
199 |    "source": [
200 |     "top_codes[0:5]"
201 |    ]
202 |   },
203 |   {
204 |    "cell_type": "code",
205 |    "execution_count": 71,
206 |    "metadata": {
207 |     "collapsed": false,
208 |     "deletable": true,
209 |     "editable": true
210 |    },
211 |    "outputs": [
212 |     {
213 |      "data": {
214 |       "text/html": [
215 |        "<div>\n",
216 |        "<style>\n",
217 |        "    .dataframe thead tr:only-child th {\n",
218 |        "        text-align: right;\n",
219 |        "    }\n",
220 |        "\n",
221 |        "    .dataframe thead th {\n",
222 |        "        text-align: left;\n",
223 |        "    }\n",
224 |        "\n",
225 |        "    .dataframe tbody tr th {\n",
226 |        "        vertical-align: top;\n",
227 |        "    }\n",
228 |        "</style>\n",
229 |        "<table border=\"1\" class=\"dataframe\">\n",
230 |        "  <thead>\n",
231 |        "    <tr style=\"text-align: right;\">\n",
232 |        "      <th></th>\n",
233 |        "      <th>HADM_ID</th>\n",
234 |        "      <th>SUBJECT_ID</th>\n",
235 |        "      <th>DATE</th>\n",
236 |        "      <th>ICD9</th>\n",
237 |        "      <th>TEXT</th>\n",
238 |        "    </tr>\n",
239 |        "  </thead>\n",
240 |        "  <tbody>\n",
241 |        "    <tr>\n",
242 |        "      <th>0</th>\n",
243 |        "      <td>100001</td>\n",
244 |        "      <td>58526</td>\n",
245 |        "      <td>2117-09-17 00:00:00</td>\n",
246 |        "      <td>5849</td>\n",
247 |        "      <td>Admission Date:  [**2117-9-11**]              ...</td>\n",
248 |        "    </tr>\n",
249 |        "    <tr>\n",
250 |        "      <th>1</th>\n",
251 |        "      <td>100003</td>\n",
252 |        "      <td>54610</td>\n",
253 |        "      <td>2150-04-21 00:00:00</td>\n",
254 |        "      <td>4019 2851</td>\n",
255 |        "      <td>Admission Date:  [**2150-4-17**]              ...</td>\n",
256 |        "    </tr>\n",
257 |        "    <tr>\n",
258 |        "      <th>2</th>\n",
259 |        "      <td>100006</td>\n",
260 |        "      <td>9895</td>\n",
261 |        "      <td>2108-04-17 00:00:00</td>\n",
262 |        "      <td>51881 486</td>\n",
263 |        "      <td>Admission Date:  [**2108-4-6**]       Discharg...</td>\n",
264 |        "    </tr>\n",
265 |        "    <tr>\n",
266 |        "      <th>3</th>\n",
267 |        "      <td>100007</td>\n",
268 |        "      <td>23018</td>\n",
269 |        "      <td>2145-04-07 00:00:00</td>\n",
270 |        "      <td>4019 486</td>\n",
271 |        "      <td>Admission Date:  [**2145-3-31**]              ...</td>\n",
272 |        "    </tr>\n",
273 |        "    <tr>\n",
274 |        "      <th>4</th>\n",
275 |        "      <td>100009</td>\n",
276 |        "      <td>533</td>\n",
277 |        "      <td>2162-05-21 00:00:00</td>\n",
278 |        "      <td>4019 41401 25000 2720 2859</td>\n",
279 |        "      <td>Admission Date:  [**2162-5-16**]              ...</td>\n",
280 |        "    </tr>\n",
281 |        "  </tbody>\n",
282 |        "</table>\n",
283 |        "</div>"
284 |       ],
285 |       "text/plain": [
286 |        "   HADM_ID  SUBJECT_ID                 DATE                        ICD9  \\\n",
287 |        "0   100001       58526  2117-09-17 00:00:00                        5849   \n",
288 |        "1   100003       54610  2150-04-21 00:00:00                   4019 2851   \n",
289 |        "2   100006        9895  2108-04-17 00:00:00                   51881 486   \n",
290 |        "3   100007       23018  2145-04-07 00:00:00                    4019 486   \n",
291 |        "4   100009         533  2162-05-21 00:00:00  4019 41401 25000 2720 2859   \n",
292 |        "\n",
293 |        "                                                TEXT  \n",
294 |        "0  Admission Date:  [**2117-9-11**]              ...  \n",
295 |        "1  Admission Date:  [**2150-4-17**]              ...  \n",
296 |        "2  Admission Date:  [**2108-4-6**]       Discharg...  \n",
297 |        "3  Admission Date:  [**2145-3-31**]              ...  \n",
298 |        "4  Admission Date:  [**2162-5-16**]              ...  "
299 |       ]
300 |      },
301 |      "execution_count": 71,
302 |      "metadata": {},
303 |      "output_type": "execute_result"
304 |     }
305 |    ],
306 |    "source": [
307 |     "df.head()"
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "markdown",
312 |    "metadata": {
313 |     "deletable": true,
314 |     "editable": true
315 |    },
316 |    "source": [
317 |     "## Vectorize ICD9 codes"
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "markdown",
322 |    "metadata": {
323 |     "deletable": true,
324 |     "editable": true
325 |    },
326 |    "source": [
327 |     "Here we vectorize and move it to an `np.array` because that is what TensorFlow prefers"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "code",
332 |    "execution_count": 72,
333 |    "metadata": {
334 |     "collapsed": false,
335 |     "deletable": true,
336 |     "editable": true
337 |    },
338 |    "outputs": [],
339 |    "source": [
340 |     "labels = vectorization.vectorize_icd_column(df, 'ICD9', top_codes)"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "code",
345 |    "execution_count": 73,
346 |    "metadata": {
347 |     "collapsed": false,
348 |     "deletable": true,
349 |     "editable": true
350 |    },
351 |    "outputs": [
352 |     {
353 |      "data": {
354 |       "text/plain": [
355 |        "array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
356 |        "       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],\n",
357 |        "       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],\n",
358 |        "       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],\n",
359 |        "       [1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])"
360 |       ]
361 |      },
362 |      "execution_count": 73,
363 |      "metadata": {},
364 |      "output_type": "execute_result"
365 |     }
366 |    ],
367 |    "source": [
368 |     "labels[0:5]"
369 |    ]
370 |   },
371 |   {
372 |    "cell_type": "markdown",
373 |    "metadata": {
374 |     "deletable": true,
375 |     "editable": true
376 |    },
377 |    "source": [
378 |     "## Clean, and write text for embedding"
379 |    ]
380 |   },
381 |   {
382 |    "cell_type": "code",
383 |    "execution_count": 74,
384 |    "metadata": {
385 |     "collapsed": true,
386 |     "deletable": true,
387 |     "editable": true
388 |    },
389 |    "outputs": [],
390 |    "source": [
391 |     "# Clean\n",
392 |     "df.TEXT = vectorization.clean_notes(df, 'TEXT')"
393 |    ]
394 |   },
395 |   {
396 |    "cell_type": "code",
397 |    "execution_count": 75,
398 |    "metadata": {
399 |     "collapsed": false,
400 |     "deletable": true,
401 |     "editable": true
402 |    },
403 |    "outputs": [],
404 |    "source": [
405 |     "helpers.write_col(df, 'TEXT', '../data/only_text.csv')"
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "markdown",
410 |    "metadata": {
411 |     "deletable": true,
412 |     "editable": true
413 |    },
414 |    "source": [
415 |     "## Vectorize text and pad sequence"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "markdown",
420 |    "metadata": {
421 |     "deletable": true,
422 |     "editable": true
423 |    },
424 |    "source": [
425 |     "Here, we vectorize the text and pad with 0s so that notes appear of the same length"
426 |    ]
427 |   },
428 |   {
429 |    "cell_type": "code",
430 |    "execution_count": 76,
431 |    "metadata": {
432 |     "collapsed": true,
433 |     "deletable": true,
434 |     "editable": true
435 |    },
436 |    "outputs": [],
437 |    "source": [
438 |     "# Inputs for tokenization\n",
439 |     "MAX_VOCAB = None # to limit original vocabulary to most common words (None if no limit)\n",
440 |     "MAX_SEQ_LENGTH = 5000 # to limit length of word sequence (None if no limit)"
441 |    ]
442 |   },
443 |   {
444 |    "cell_type": "code",
445 |    "execution_count": 77,
446 |    "metadata": {
447 |     "collapsed": false,
448 |     "deletable": true,
449 |     "editable": true
450 |    },
451 |    "outputs": [
452 |     {
453 |      "name": "stdout",
454 |      "output_type": "stream",
455 |      "text": [
456 |       "Vocabulary size: 60619\n",
457 |       "Average note length: 1623.5809\n",
458 |       "Max note length: 8725\n"
459 |      ]
460 |     }
461 |    ],
462 |    "source": [
463 |     "# Vectorize\n",
464 |     "data, dictionary, MAX_VOCAB = vectorization.vectorize_notes(df.TEXT, MAX_VOCAB, verbose = True)"
465 |    ]
466 |   },
467 |   {
468 |    "cell_type": "code",
469 |    "execution_count": 78,
470 |    "metadata": {
471 |     "collapsed": true,
472 |     "deletable": true,
473 |     "editable": true
474 |    },
475 |    "outputs": [],
476 |    "source": [
477 |     "# Pad and turn into a matrix\n",
478 |     "data, MAX_SEQ_LENGTH = vectorization.pad_notes(data, MAX_SEQ_LENGTH)"
479 |    ]
480 |   },
481 |   {
482 |    "cell_type": "code",
483 |    "execution_count": 79,
484 |    "metadata": {
485 |     "collapsed": false,
486 |     "deletable": true,
487 |     "editable": true
488 |    },
489 |    "outputs": [
490 |     {
491 |      "name": "stdout",
492 |      "output_type": "stream",
493 |      "text": [
494 |       "Final Vocabulary: 60619\n",
495 |       "Final Max Sequence Length: 5000\n"
496 |      ]
497 |     }
498 |    ],
499 |    "source": [
500 |     "print(\"Final Vocabulary: %s\" % MAX_VOCAB)\n",
501 |     "print(\"Final Max Sequence Length: %s\" % MAX_SEQ_LENGTH)"
502 |    ]
503 |   },
504 |   {
505 |    "cell_type": "code",
506 |    "execution_count": 80,
507 |    "metadata": {
508 |     "collapsed": false,
509 |     "deletable": true,
510 |     "editable": true
511 |    },
512 |    "outputs": [
513 |     {
514 |      "data": {
515 |       "text/plain": [
516 |        "array([[   0,    0,    0, ..., 2998,   24,   88],\n",
517 |        "       [   0,    0,    0, ...,    1,  374,   35],\n",
518 |        "       [   0,    0,    0, ...,    1,    1,  621],\n",
519 |        "       [   0,    0,    0, ...,   32,  374,   35],\n",
520 |        "       [   0,    0,    0, ...,   67,  374,   35]], dtype=int32)"
521 |       ]
522 |      },
523 |      "execution_count": 80,
524 |      "metadata": {},
525 |      "output_type": "execute_result"
526 |     }
527 |    ],
528 |    "source": [
529 |     "data[0:5] "
530 |    ]
531 |   },
532 |   {
533 |    "cell_type": "markdown",
534 |    "metadata": {
535 |     "deletable": true,
536 |     "editable": true
537 |    },
538 |    "source": [
539 |     "## Split into Sets"
540 |    ]
541 |   },
542 |   {
543 |    "cell_type": "markdown",
544 |    "metadata": {
545 |     "deletable": true,
546 |     "editable": true
547 |    },
548 |    "source": [
549 |     "Here we split into sets and free up some memory"
550 |    ]
551 |   },
552 |   {
553 |    "cell_type": "code",
554 |    "execution_count": 81,
555 |    "metadata": {
556 |     "collapsed": false,
557 |     "deletable": true,
558 |     "editable": true
559 |    },
560 |    "outputs": [],
561 |    "source": [
562 |     "X_train, X_val, X_test, y_train, y_val, y_test = helpers.train_val_test_split(\n",
563 |     "    data, labels, val_size=0.2, test_size=0.1, random_state=101)"
564 |    ]
565 |   },
566 |   {
567 |    "cell_type": "code",
568 |    "execution_count": 82,
569 |    "metadata": {
570 |     "collapsed": false,
571 |     "deletable": true,
572 |     "editable": true
573 |    },
574 |    "outputs": [
575 |     {
576 |      "name": "stdout",
577 |      "output_type": "stream",
578 |      "text": [
579 |       "Train:  (6999, 5000) (6999, 20)\n",
580 |       "Validation:  (2000, 5000) (2000, 20)\n",
581 |       "Test:  (1001, 5000) (1001, 20)\n"
582 |      ]
583 |     }
584 |    ],
585 |    "source": [
586 |     "print(\"Train: \", X_train.shape, y_train.shape)\n",
587 |     "print(\"Validation: \", X_val.shape, y_val.shape)\n",
588 |     "print(\"Test: \", X_test.shape, y_test.shape)"
589 |    ]
590 |   },
591 |   {
592 |    "cell_type": "code",
593 |    "execution_count": 83,
594 |    "metadata": {
595 |     "collapsed": false,
596 |     "deletable": true,
597 |     "editable": true
598 |    },
599 |    "outputs": [],
600 |    "source": [
601 |     "# Delete temporary variables to free some memory\n",
602 |     "del df, data, labels"
603 |    ]
604 |   },
605 |   {
606 |    "cell_type": "markdown",
607 |    "metadata": {
608 |     "deletable": true,
609 |     "editable": true
610 |    },
611 |    "source": [
612 |     "## Reload Embedding Matrix"
613 |    ]
614 |   },
615 |   {
616 |    "cell_type": "markdown",
617 |    "metadata": {
618 |     "deletable": true,
619 |     "editable": true
620 |    },
621 |    "source": [
622 |     "Creates an embedding matrix based on a pretrained vector"
623 |    ]
624 |   },
625 |   {
626 |    "cell_type": "markdown",
627 |    "metadata": {
628 |     "deletable": true,
629 |     "editable": true
630 |    },
631 |    "source": [
632 |     "List of pretrained vectors http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/ for embedding Google cannot be downloaded, so I used Glove:    \n",
633 |     "- Go to https://nlp.stanford.edu/projects/glove/\n",
634 |     "- Download a pretrained model, e.g. `glove.6B.zip`, and put the unzipped files in `/data`"
635 |    ]
636 |   },
637 |   {
638 |    "cell_type": "code",
639 |    "execution_count": 97,
640 |    "metadata": {
641 |     "collapsed": false,
642 |     "deletable": true,
643 |     "editable": true
644 |    },
645 |    "outputs": [],
646 |    "source": [
647 |     "EMBEDDING_LOC = '../data/notes.100.txt' # location of embedding\n",
648 |     "#EMBEDDING_LOC = '../data/glove.6B.100d.txt' # location of embedding\n",
649 |     "EMBEDDING_DIM = 100 # given the glove that we chose"
650 |    ]
651 |   },
652 |   {
653 |    "cell_type": "code",
654 |    "execution_count": 98,
655 |    "metadata": {
656 |     "collapsed": false,
657 |     "deletable": true,
658 |     "editable": true
659 |    },
660 |    "outputs": [
661 |     {
662 |      "name": "stdout",
663 |      "output_type": "stream",
664 |      "text": [
665 |       "Vocabulary in notes: 60619\n",
666 |       "Vocabulary in original embedding: 21056\n",
667 |       "Vocabulary intersection: 20640\n"
668 |      ]
669 |     }
670 |    ],
671 |    "source": [
672 |     "embedding_matrix, embedding_dict = vectorization.embedding_matrix(EMBEDDING_LOC,\n",
673 |     "                                                                  dictionary, EMBEDDING_DIM,\n",
674 |     "                                                                  verbose = True, sigma = None)"
675 |    ]
676 |   },
677 |   {
678 |    "cell_type": "code",
679 |    "execution_count": 99,
680 |    "metadata": {
681 |     "collapsed": false,
682 |     "deletable": true,
683 |     "editable": true
684 |    },
685 |    "outputs": [
686 |     {
687 |      "data": {
688 |       "text/plain": [
689 |        "(60620, 100)"
690 |       ]
691 |      },
692 |      "execution_count": 99,
693 |      "metadata": {},
694 |      "output_type": "execute_result"
695 |     }
696 |    ],
697 |    "source": [
698 |     "embedding_matrix.shape"
699 |    ]
700 |   },
701 |   {
702 |    "cell_type": "code",
703 |    "execution_count": 100,
704 |    "metadata": {
705 |     "collapsed": true,
706 |     "deletable": true,
707 |     "editable": true
708 |    },
709 |    "outputs": [],
710 |    "source": [
711 |     "del embedding_dict"
712 |    ]
713 |   },
714 |   {
715 |    "cell_type": "markdown",
716 |    "metadata": {
717 |     "deletable": true,
718 |     "editable": true
719 |    },
720 |    "source": [
721 |     "## Simple Neural Network"
722 |    ]
723 |   },
724 |   {
725 |    "cell_type": "markdown",
726 |    "metadata": {
727 |     "deletable": true,
728 |     "editable": true
729 |    },
730 |    "source": [
731 |     "Simple Neural to show that it works\n",
732 |     "- softmax with categorical cross entropy and adam gave f1 = 0.1696042216358839\n",
733 |     "- sigmoid with glove original embedding gave 0.27048167970358172"
734 |    ]
735 |   },
736 |   {
737 |    "cell_type": "code",
738 |    "execution_count": 101,
739 |    "metadata": {
740 |     "collapsed": true,
741 |     "deletable": true,
742 |     "editable": true
743 |    },
744 |    "outputs": [],
745 |    "source": [
746 |     "EMBEDDING_TRAINABLE = True"
747 |    ]
748 |   },
749 |   {
750 |    "cell_type": "code",
751 |    "execution_count": 102,
752 |    "metadata": {
753 |     "collapsed": false,
754 |     "deletable": true,
755 |     "editable": true
756 |    },
757 |    "outputs": [],
758 |    "source": [
759 |     "# We build the embedding layer separately because it's a little more complex than the others\n",
760 |     "embedding_layer = Embedding(len(dictionary) + 1,\n",
761 |     "                            EMBEDDING_DIM,\n",
762 |     "                            weights=[embedding_matrix],\n",
763 |     "                            input_length=MAX_SEQ_LENGTH,\n",
764 |     "                            trainable=EMBEDDING_TRAINABLE)"
765 |    ]
766 |   },
767 |   {
768 |    "cell_type": "code",
769 |    "execution_count": 103,
770 |    "metadata": {
771 |     "collapsed": false,
772 |     "deletable": true,
773 |     "editable": true
774 |    },
775 |    "outputs": [],
776 |    "source": [
777 |     "sequence_input = Input(shape=(MAX_SEQ_LENGTH,), dtype='int32')\n",
778 |     "embedded_sequences = embedding_layer(sequence_input)\n",
779 |     "embedded_sequences = Flatten()(embedded_sequences)\n",
780 |     "preds = Dense(len(top_codes), activation='sigmoid')(embedded_sequences)\n",
781 |     "\n",
782 |     "model = Model(sequence_input, preds)\n",
783 |     "model.compile(loss='binary_crossentropy',\n",
784 |     "              optimizer='rmsprop',\n",
785 |     "              metrics=['acc'])"
786 |    ]
787 |   },
788 |   {
789 |    "cell_type": "code",
790 |    "execution_count": 104,
791 |    "metadata": {
792 |     "collapsed": false,
793 |     "deletable": true,
794 |     "editable": true
795 |    },
796 |    "outputs": [
797 |     {
798 |      "name": "stdout",
799 |      "output_type": "stream",
800 |      "text": [
801 |       "Train on 6999 samples, validate on 2000 samples\n",
802 |       "Epoch 1/2\n",
803 |       "6999/6999 [==============================] - 81s - loss: 2.1248 - acc: 0.8452 - val_loss: 2.1141 - val_acc: 0.8571\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n",
804 |       "Epoch 2/2\n",
805 |       "6999/6999 [==============================] - 61s - loss: 1.8115 - acc: 0.8536 - val_loss: 1.8044 - val_acc: 0.8483\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n"
806 |      ]
807 |     },
808 |     {
809 |      "data": {
810 |       "text/plain": [
811 |        "<keras.callbacks.History at 0x1203f9080>"
812 |       ]
813 |      },
814 |      "execution_count": 104,
815 |      "metadata": {},
816 |      "output_type": "execute_result"
817 |     }
818 |    ],
819 |    "source": [
820 |     "model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=2, batch_size=128)"
821 |    ]
822 |   },
823 |   {
824 |    "cell_type": "code",
825 |    "execution_count": 105,
826 |    "metadata": {
827 |     "collapsed": true,
828 |     "deletable": true,
829 |     "editable": true
830 |    },
831 |    "outputs": [],
832 |    "source": [
833 |     "pred_val = model.predict(X_val, batch_size=128)"
834 |    ]
835 |   },
836 |   {
837 |    "cell_type": "code",
838 |    "execution_count": 106,
839 |    "metadata": {
840 |     "collapsed": false,
841 |     "deletable": true,
842 |     "editable": true
843 |    },
844 |    "outputs": [
845 |     {
846 |      "data": {
847 |       "text/plain": [
848 |        "1.0"
849 |       ]
850 |      },
851 |      "execution_count": 106,
852 |      "metadata": {},
853 |      "output_type": "execute_result"
854 |     }
855 |    ],
856 |    "source": [
857 |     "np.max(pred_val)"
858 |    ]
859 |   },
860 |   {
861 |    "cell_type": "code",
862 |    "execution_count": 107,
863 |    "metadata": {
864 |     "collapsed": false,
865 |     "deletable": true,
866 |     "editable": true
867 |    },
868 |    "outputs": [
869 |     {
870 |      "data": {
871 |       "text/plain": [
872 |        "0.2786826774462014"
873 |       ]
874 |      },
875 |      "execution_count": 107,
876 |      "metadata": {},
877 |      "output_type": "execute_result"
878 |     }
879 |    ],
880 |    "source": [
881 |     "f1_score(y_val, np.where(pred_val>0.5, 1, 0), average = 'micro')"
882 |    ]
883 |   },
884 |   {
885 |    "cell_type": "code",
886 |    "execution_count": null,
887 |    "metadata": {
888 |     "collapsed": true
889 |    },
890 |    "outputs": [],
891 |    "source": []
892 |   }
893 |  ],
894 |  "metadata": {
895 |   "kernelspec": {
896 |    "display_name": "Python 2",
897 |    "language": "python",
898 |    "name": "python2"
899 |   },
900 |   "language_info": {
901 |    "codemirror_mode": {
902 |     "name": "ipython",
903 |     "version": 2
904 |    },
905 |    "file_extension": ".py",
906 |    "mimetype": "text/x-python",
907 |    "name": "python",
908 |    "nbconvert_exporter": "python",
909 |    "pygments_lexer": "ipython2",
910 |    "version": "2.7.13"
911 |   }
912 |  },
913 |  "nbformat": 4,
914 |  "nbformat_minor": 2
915 | }
916 | 


--------------------------------------------------------------------------------
/pipeline/__pycache__/database_selection.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/pipeline/__pycache__/database_selection.cpython-35.pyc


--------------------------------------------------------------------------------
/pipeline/__pycache__/helpers.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/pipeline/__pycache__/helpers.cpython-35.pyc


--------------------------------------------------------------------------------
/pipeline/__pycache__/vectorization.cpython-35.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/pipeline/__pycache__/vectorization.cpython-35.pyc


--------------------------------------------------------------------------------
/pipeline/attention_util.py:
--------------------------------------------------------------------------------
 1 | from keras.layers import Dense,  Input
 2 | from keras.layers import  Merge,TimeDistributed
 3 | from keras.layers.merge import Concatenate
 4 | from keras.layers.core import *
 5 | from keras.layers import merge, dot, add
 6 | from keras import backend as K
 7 | # based on paper: Hierarchical Attention networks for document classification
 8 | # starting code from:
 9 | # https://groups.google.com/forum/#!msg/keras-users/IWK9opMFavQ/AITppppfAgAJ
10 | 
11 | # note: there is a lot of sample codes in the internet that do not work, and their authors do mention that, 
12 | # they don't see a difference when applying the attention mechanism
13 | #
14 | # I did have to review closely the formulas presented on the papers about Attention to figure it out what type of
15 | # code will actually work
16 | # Author: Zenobia Liendo
17 | 
18 | def attention_layer(inputs, TIME_STEPS,lstm_units, i='1'):
19 | 
20 |     # inputs.shape = (batch_size, time_steps, input_dim)
21 |     #(3) u_it: we first feed the word annotation through a one-layer MLP to get the hidden representation u_it
22 |     inputs= Dropout(0.5)(inputs)
23 |     u_it = TimeDistributed(Dense(lstm_units, activation='tanh',
24 |                                  kernel_regularizer=regularizers.l2(0.0001),
25 |                                  name='u_it'+i))(inputs)
26 | 
27 |     u_it= Dropout(0.5)(u_it)
28 |     # (4) alpha_it: then we measure the importance of x as the similarity of u_it with a x level
29 |     # context vector u_w and get a normalized importance weight alpha_it through a softmax function
30 |     # The word context vector uw is randomly initialized and jointly learned during the training process.
31 |     #alpha_it  = TimeDistributed(Dense(TIME_STEPS, activation='softmax',use_bias=False))(u_it)
32 |     att = TimeDistributed(Dense(1, 
33 |                                 kernel_regularizer=regularizers.l2(0.0001),
34 |                                 bias=False))(u_it)                         
35 |     att = Reshape((TIME_STEPS,))(att)                                                       
36 |     att = Activation('softmax', name='alpha_it_softmax'+i)(att) 
37 | 
38 |     
39 |     # (5) s_i: After that, we compute the sentence vector s_i 
40 |     #     as a weighted sum of the word annotations based on the weights alpha_it.
41 |     s_i =merge([att, inputs], mode='dot', dot_axes=(1,1), name='s_i_dot'+i) 
42 |     
43 |     
44 |     return s_i


--------------------------------------------------------------------------------
/pipeline/database_selection.py:
--------------------------------------------------------------------------------
 1 | ###This file contains the functions reextracting the database only for the top ICD codes
 2 | # Author: Zenobia Liendo
 3 | 
 4 | import numpy as np
 5 | import pandas as pd
 6 | from collections import Counter
 7 | 
 8 | 
 9 | def find_top_codes(df, col_name, n):
10 |     """ Find the top codes from a columns of strings
11 |         Returns a list of strings to make sure codes are treated as classes down the line """
12 |     string_total = df[col_name].str.cat(sep=' ')
13 |     counter_total = Counter(string_total.split(' '))
14 |     return [word for word, word_count in counter_total.most_common(n)]
15 | 
16 | def select_codes_in_string(string, top_codes):
17 |     """ Creates a sring of the codes which are both in the original string
18 |         and in the top codes list """
19 |     r = ''
20 |     for code in top_codes:
21 |         if code in string:
22 |             r += ' ' + code
23 |     return r.strip()
24 | 
25 | def filter_top_codes(df, col_name, n, filter_empty = True):
26 |     """ Creates a dataframe with the codes column containing only the top codes
27 |         and filters out the lines without any of the top codes if True
28 |         
29 |         Note: we may actually want to keep even the empty lines """
30 |     r = df.copy()
31 |     top_codes = find_top_codes(r, col_name, n)
32 |     r[col_name] = r[col_name].apply(lambda x: select_codes_in_string(x, top_codes))
33 |     if filter_empty:
34 |         r = r.loc[r[col_name] != '']
35 |     return r, top_codes
36 | 


--------------------------------------------------------------------------------
/pipeline/hatt_model.py:
--------------------------------------------------------------------------------
 1 | from keras.models import Sequential, Model
 2 | from keras.layers import Dense, Flatten, Input, Convolution1D
 3 | from keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout, LSTM, GRU, Bidirectional, TimeDistributed
 4 | from keras.layers.merge import Concatenate
 5 | from keras.layers.core import *
 6 | from keras.layers import merge, dot, add
 7 | from keras import backend as K
 8 | from keras import optimizers
 9 | 
10 | import attention_util
11 | 
12 | # based on paper: Hierarchical Attention networks for document classification
13 | # starting code from:
14 | # * https://github.com/richliao/textClassifier/blob/master/textClassifierHATT.py
15 | # but the github sources above had misteakes in the attention layer (IMO) that had been corrected here
16 | # Author: Zenobia Liendo
17 | 
18 | def build_hierarhical_att_model(MAX_SENTS, MAX_SENT_LENGTH, embedding_matrix,
19 |                          max_vocab, embedding_dim, 
20 |                          num_classes,training_dropout):
21 |     
22 |     # WORDS in one SENTENCE LAYER
23 |     #-----------------------------------------
24 |     #Embedding
25 | 	# note_input [sentences, words_in_a_sentence]
26 |     sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32')
27 |     # use embedding_matrix 
28 | 	# (1) embed the words to vectors through an embedding matrix
29 |     embedded_sequences = Embedding(max_vocab + 1,
30 |                            embedding_dim,
31 |                            weights=[embedding_matrix],
32 |                             input_length=MAX_SENT_LENGTH, embeddings_regularizer=regularizers.l2(0.0001),
33 |                             trainable=True)(sentence_input)
34 |     #embedded_sequences =  Embedding(max_vocab + 1, 
35 |     #              embedding_dim, 
36 |     #              input_length=MAX_SENT_LENGTH, embeddings_regularizer=regularizers.l2(0.0001),
37 |     #              name="embedding")(sentence_input)
38 | 
39 | 	# (2) GRU to get annotations of words by summarizing information
40 |     #     h_it: We obtain an annotation for a given word  by concatenating the forward hidden state  and
41 |     #     backward hidden state
42 |     gru_dim = 50
43 |     #h_it_sentence_vector = Bidirectional(GRU(gru_dim, return_sequences=True))(embedded_sequences)
44 |     h_it_sentence_vector =  Bidirectional(LSTM(gru_dim, return_sequences=True))(embedded_sequences)
45 |      
46 | 	#  Attention layer
47 | 	#  Not all words contribute equally to the representation of the sentence meaning.
48 | 	#  Hence, we introduce attention mechanism to extract such words that are important to the meaning of the
49 | 	#  sentence and aggregate the representation of those informative words to form a sentence vector
50 |     words_attention_vector = attention_util.attention_layer(h_it_sentence_vector,MAX_SENT_LENGTH,gru_dim) 
51 | 
52 | 	#  Keras model that process words in one sentence
53 |     sentEncoder = Model(sentence_input, words_attention_vector)
54 |     
55 |     print sentEncoder.summary()
56 | 
57 |     # SENTENCE LAYER
58 |     #---------------------------------------------------------------------------------------------------------------------
59 |     note_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32')
60 | 	# TimeDistributes wrapper applies a layer to every temporal slice of an input.
61 | 	# The input should be at least 3D, and the dimension of index one will be considered to be the temporal dimension
62 | 	# Here the sentEncoder is applied to each input record (a note) 
63 |     note_encoder = TimeDistributed(sentEncoder)(note_input)
64 |     #document_vector = Bidirectional(GRU(gru_dim, return_sequences=True))(note_encoder)
65 |     document_vector = Bidirectional(LSTM(gru_dim, return_sequences=True))(note_encoder)
66 |     
67 | 	#attention layer
68 |     sentences_attention_vector = attention_util.attention_layer(document_vector,MAX_SENTS,gru_dim) 
69 |     
70 | 	# output layer
71 |     z = Dropout(training_dropout)(sentences_attention_vector)
72 |     preds = Dense(num_classes, activation='sigmoid', name='preds')(z)
73 |     
74 |     #model
75 |     model = Model(note_input, preds)
76 | 
77 |     model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
78 |     #sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=False)
79 |     #model.compile(loss="binary_crossentropy", optimizer="sgd", metrics=["accuracy"])
80 | 
81 |     print("model fitting - Hierachical Attention GRU")
82 |     print model.summary()
83 | 
84 |     return model


--------------------------------------------------------------------------------
/pipeline/helpers.py:
--------------------------------------------------------------------------------
 1 | ### This contains helper functions
 2 | # Author: Zenobia Liendo
 3 | 
 4 | import numpy as np
 5 | import pandas as pd
 6 | from sklearn.model_selection import train_test_split
 7 | from sklearn.metrics import f1_score
 8 | 
 9 | 
10 | 
11 | def train_val_test_split(X, y, val_size=0.2, test_size=0.2, random_state=101):
12 |     """Splits the input and labels into 3 sets"""
13 |     X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=(val_size+test_size), random_state=random_state)
14 |     X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=test_size/(val_size+test_size), random_state=random_state)
15 |     return X_train, X_val, X_test, y_train, y_val, y_test
16 | 
17 | 
18 | def replace_with_grandparent_codes(string_codes, ICD9_FIRST_LEVEL):
19 |     """replace_with_grandparent_codes takes a list of ICD9 codes and 
20 |     returns the list of their grandparents ICD9 code in the first level of the ICD9 hierarchy"""
21 |     ICD9_RANGES = [x.split('-') for x in ICD9_FIRST_LEVEL]
22 |     resulting_codes = []
23 |     for code in string_codes.split(' '):
24 |         for i,gparent_range in enumerate(ICD9_RANGES):
25 |             range = gparent_range[1] if len(gparent_range) == 2 else gparent_range[0]
26 |             if code[0:3] <= range:
27 |                 resulting_codes.append(ICD9_FIRST_LEVEL[i])
28 |                 break 
29 |     return ' '.join (set(resulting_codes))
30 | 
31 | 
32 | def write_col(df, col_name, fname):
33 |     df[col_name].to_csv(fname)
34 |     
35 | 
36 | def get_f1_score(y_true,y_hat,threshold, average):
37 |     hot_y = np.where(np.array(y_hat) > threshold, 1, 0)
38 |     return f1_score(np.array(y_true), hot_y, average=average)
39 | 
40 | def show_f1_score(y_train, pred_train, y_val, pred_dev):
41 |     print('F1 scores')
42 |     print('threshold | training | dev  ')
43 |     f1_score_average = 'micro'
44 |     for threshold in [ 0.02, 0.03,0.04,0.05,0.055,0.058,0.06, 0.08, 0.1,0.2,0.3, 0.4, 0.5, 0.6,0.7]:
45 |         train_f1 = get_f1_score(y_train, pred_train,threshold,f1_score_average)
46 |         dev_f1 = get_f1_score(y_val, pred_dev,threshold,f1_score_average)
47 |         print('%1.3f:      %1.3f      %1.3f' % (threshold,train_f1, dev_f1))


--------------------------------------------------------------------------------
/pipeline/icd9_cnn_att.py:
--------------------------------------------------------------------------------
 1 | from keras.models import Sequential, Model
 2 | from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding
 3 | from keras.layers.merge import Concatenate
 4 | from keras import regularizers
 5 | import attention_util
 6 | from keras import optimizers
 7 | 
 8 | # Author: Zenobia Liendo
 9 | 
10 | ''' code based on:
11 | https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/sentiment_cnn.py
12 | http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
13 | https://github.com/dennybritz/cnn-text-classification-tf/blob/master/text_cnn.py
14 | '''
15 | 
16 | def build_icd9_cnn_model(input_seq_length, 
17 |                          max_vocab, external_embeddings, embedding_dim, embedding_matrix, 
18 |                          num_filters, filter_sizes, 
19 |                          training_dropout,
20 |                          num_classes,trainable_Embeddings=True):
21 |     #Embedding
22 |     model_input = Input(shape=(input_seq_length, ))
23 |     if external_embeddings:
24 |         # use embedding_matrix 
25 |         z = Embedding(max_vocab + 1,
26 |                             embedding_dim,
27 |                             weights=[embedding_matrix],
28 |                             input_length=input_seq_length,
29 |                             trainable=trainable_Embeddings,embeddings_regularizer=regularizers.l2(0.0001))(model_input)
30 |     else:
31 |         # train embeddings 
32 |         z =  Embedding(max_vocab + 1, 
33 |                    embedding_dim, 
34 |                    input_length=input_seq_length, embeddings_regularizer=regularizers.l2(0.0001),
35 |                    name="embedding")(model_input)
36 | 
37 |     #z = Dropout(0.1)(z)
38 |     # Convolutional block
39 |     conv_blocks = []
40 |     for i,sz in enumerate(filter_sizes):
41 |         conv = Convolution1D(filters=num_filters,                         
42 |                          kernel_size=sz,
43 |                          padding="valid",kernel_regularizer=regularizers.l2(0.001),
44 |                          activation="relu",
45 |                          strides=1)(z)
46 |         window_pool_size =  input_seq_length  - sz + 1 
47 |         #conv = MaxPooling1D(pool_size=window_pool_size)(conv)  
48 |         words_attention_vector = attention_util.attention_layer(conv, window_pool_size,50,str(i))
49 |         #conv = Flatten()(words_attention_vector)
50 |         conv_blocks.append(words_attention_vector)
51 | 
52 |     #concatenate
53 |     z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
54 | 
55 |     #score prediction
56 |     #z = Dense(200, activation="relu")(z)  #to avoid overfitting? I don't think this is necessary
57 |     z = Dropout(training_dropout)(z)
58 |     #model_output = Dense(num_classes, activation="softmax")(z)
59 |     model_output = Dense(num_classes, kernel_regularizer=regularizers.l2(0.0001),
60 |                          activation="sigmoid")(z)
61 | 
62 |     #creating model
63 |     model = Model(model_input, model_output)
64 |     # what to use for tf.nn.softmax_cross_entropy_with_logits?
65 |     adam_op = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
66 |     model.compile(loss="binary_crossentropy", optimizer=adam_op, metrics=["accuracy"])
67 |     #model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
68 |     #model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
69 |     
70 |     print model.summary()
71 | 
72 |     return model


--------------------------------------------------------------------------------
/pipeline/icd9_cnn_model.py:
--------------------------------------------------------------------------------
 1 | from keras.models import Sequential, Model
 2 | from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding
 3 | from keras.layers.merge import Concatenate
 4 | from keras import regularizers
 5 | 
 6 | # Author: Zenobia Liendo
 7 | 
 8 | ''' code based on:
 9 | https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/sentiment_cnn.py
10 | http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
11 | https://github.com/dennybritz/cnn-text-classification-tf/blob/master/text_cnn.py
12 | '''
13 | 
14 | def build_icd9_cnn_model(input_seq_length, 
15 |                          max_vocab, external_embeddings, embedding_dim, embedding_matrix,
16 |                          num_filters, filter_sizes,
17 |                          training_dropout_keep_prob,
18 |                          num_classes):
19 |     #Embedding
20 |     model_input = Input(shape=(input_seq_length, ))
21 |     if external_embeddings:
22 |         # use embedding_matrix 
23 |         z = Embedding(max_vocab + 1,
24 |                             embedding_dim,
25 |                             weights=[embedding_matrix],
26 |                             input_length=input_seq_length,
27 |                             trainable=True)(model_input)
28 |     else:
29 |         # train embeddings 
30 |         z =  Embedding(max_vocab + 1, 
31 |                    embedding_dim, 
32 |                    input_length=input_seq_length, embeddings_regularizer=regularizers.l2(0.0001),
33 |                    name="embedding")(model_input)
34 | 
35 |     # Convolutional block
36 |     conv_blocks = []
37 |     for sz in filter_sizes:
38 |         conv = Convolution1D(filters=num_filters,                         
39 |                          kernel_size=sz,
40 |                          padding="valid",
41 |                          activation="relu",
42 |                          strides=1)(z)
43 |         window_pool_size =  input_seq_length  - sz + 1 
44 |         conv = MaxPooling1D(pool_size=window_pool_size)(conv)  
45 |         conv = Flatten()(conv)
46 |         conv_blocks.append(conv)
47 | 
48 |     #concatenate
49 |     z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]
50 |     z = Dropout(training_dropout_keep_prob)(z)
51 | 
52 |     #score prediction
53 |     #z = Dense(num_classes, activation="relu")(z)  I don't think this is necessary
54 |     #model_output = Dense(num_classes, activation="softmax")(z)
55 |     model_output = Dense(num_classes, activation="sigmoid")(z)
56 | 
57 |     #creating model
58 |     model = Model(model_input, model_output)
59 |     # what to use for tf.nn.softmax_cross_entropy_with_logits?
60 |     model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
61 |     #model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
62 |     
63 |     print model.summary()
64 | 
65 |     return model


--------------------------------------------------------------------------------
/pipeline/icd9_lstm_att_model.py:
--------------------------------------------------------------------------------
 1 | from keras.models import  Model
 2 | from keras.layers import Dense, Dropout, Flatten, Input,  Embedding,Bidirectional
 3 | from keras.layers.merge import Concatenate
 4 | from keras.layers import LSTM
 5 | from keras.layers import  MaxPooling1D, Embedding, Merge, Dropout, LSTM, Bidirectional
 6 | from keras.layers.merge import Concatenate
 7 | from keras.layers.core import *
 8 | from keras.layers import merge, dot, add
 9 | from keras import backend as K
10 | import attention_util
11 | 
12 | # Author: Zenobia Liendo
13 | 
14 | def build_lstm_att_model(input_seq_length, 
15 |                          max_vocab, external_embeddings, embedding_trainable, embedding_dim, embedding_matrix,                         
16 |                           training_dropout_keep_prob,num_classes):
17 |     #Embedding
18 |     model_input = Input(shape=(input_seq_length, ))
19 |     if external_embeddings:
20 |         # use embedding_matrix 
21 |         z = Embedding(max_vocab + 1,
22 |                             embedding_dim,
23 |                             weights=[embedding_matrix],
24 |                             input_length=input_seq_length,
25 |                             trainable=embedding_trainable,name = "Embeddng")(model_input)
26 |     else:
27 |         # train embeddings 
28 |         z = Embedding(max_vocab + 1, 
29 |                    embedding_dim, 
30 |                    input_length=input_seq_length, 
31 |                    name="Embedding")(model_input)
32 | 
33 |     # LSTM
34 |     lstm_units= 50
35 |     l_lstm = LSTM(lstm_units,return_sequences=True)(z)
36 |     
37 |     #attention
38 |     words_attention_vector = attention_util.attention_layer(l_lstm,input_seq_length,lstm_units) 
39 |     
40 |     #score prediction 
41 |     z = Dropout(training_dropout_keep_prob)(words_attention_vector)
42 |     model_output = Dense(num_classes, activation="sigmoid", name="Output_Layer")(z)
43 | 
44 |     #creating model
45 |     model = Model(model_input, model_output)
46 |     # what to use for tf.nn.softmax_cross_entropy_with_logits?
47 |     model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
48 |     #model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
49 |     
50 |     print model.summary()
51 | 
52 |     return model
53 | 


--------------------------------------------------------------------------------
/pipeline/icd9_lstm_cnn.py:
--------------------------------------------------------------------------------
  1 | #%matplotlib inline
  2 | # General imports
  3 | import numpy as np
  4 | import pandas as pd
  5 | from sklearn.metrics import f1_score
  6 | import random
  7 | from collections import Counter, defaultdict
  8 | from operator import itemgetter
  9 | import matplotlib.pyplot as plt
 10 | 
 11 | 
 12 | #keras
 13 | from keras.models import Sequential, Model
 14 | from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding
 15 | from keras.layers.merge import Concatenate
 16 | from keras.models import load_model
 17 | from IPython.display import SVG
 18 | from keras.utils.vis_utils import model_to_dot
 19 | 
 20 | # Custom functions
 21 | #%load_ext autoreload
 22 | #%autoreload 2
 23 | import database_selection
 24 | import vectorization
 25 | import helpers
 26 | import icd9_cnn_model
 27 | import lstm_model
 28 | import icd9_lstm_att_model
 29 | 
 30 | 
 31 | # Author: Zenobia Liendo
 32 | 
 33 | df = pd.read_csv('../data/disch_notes_all_icd9.csv',
 34 |                  names = ['HADM_ID', 'SUBJECT_ID', 'DATE', 'ICD9','TEXT'])
 35 | 
 36 | ICD9_FIRST_LEVEL = [
 37 |     '001-139','140-239','240-279','290-319', '320-389', '390-459','460-519', '520-579', '580-629', 
 38 |     '630-679', '680-709','710-739', '760-779', '780-789', '790-796', '797', '798', '799', '800-999' ]
 39 | N_TOP = len(ICD9_FIRST_LEVEL)
 40 | # replacing leave ICD9 codes with the grandparents
 41 | df['ICD9'] = df['ICD9'].apply(lambda x: helpers.replace_with_grandparent_codes(x,ICD9_FIRST_LEVEL))
 42 | 
 43 | #counts by icd9_codes
 44 | icd9_codes = Counter()
 45 | for label in df['ICD9']:
 46 |     for icd9_code in label.split():
 47 |         icd9_codes[icd9_code] += 1
 48 | number_icd9_first_level = len (icd9_codes)
 49 | 
 50 | top_codes = ICD9_FIRST_LEVEL
 51 | labels = vectorization.vectorize_icd_column(df, 'ICD9', top_codes)
 52 | 
 53 | #preprocess notes
 54 | MAX_VOCAB = None # to limit original number of words (None if no limit)
 55 | MAX_SEQ_LENGTH = 5000 # to limit length of word sequence (None if no limit)
 56 | df.TEXT = vectorization.clean_notes(df, 'TEXT')
 57 | data_vectorized, dictionary, MAX_VOCAB = vectorization.vectorize_notes(df.TEXT, MAX_VOCAB, verbose = True)
 58 | data, MAX_SEQ_LENGTH = vectorization.pad_notes(data_vectorized, MAX_SEQ_LENGTH)
 59 | 
 60 | EMBEDDING_DIM = 100 # given the glove that we chose
 61 | EMBEDDING_MATRIX= []
 62 | 
 63 | #creating glove embeddings
 64 | EMBEDDING_LOC = '../data/glove.6B.100d.txt' # location of embedding
 65 | EMBEDDING_MATRIX, embedding_dict = vectorization.embedding_matrix(EMBEDDING_LOC,
 66 |                                                                   dictionary, EMBEDDING_DIM, verbose = True, sigma=True)
 67 | 
 68 | #split sets
 69 | X_train, X_val, X_test, y_train, y_val, y_test = helpers.train_val_test_split(
 70 |     data, labels, val_size=0.2, test_size=0.1, random_state=101)
 71 | print("Train: ", X_train.shape, y_train.shape)
 72 | print("Validation: ", X_val.shape, y_val.shape)
 73 | print("Test: ", X_test.shape, y_test.shape)
 74 | 
 75 | # Delete temporary variables to free some memory
 76 | del df, data, labels
 77 | 
 78 | # finding out the top icd9 codes
 79 | top_4_icd9 = icd9_codes.most_common(4)
 80 | print "most common 4 icd9_codes: ", top_4_icd9
 81 | top_4_icd9_label = ' '.join(code for code,count in top_4_icd9 )
 82 | print 'label for the top 4 icd9 codes: ', top_4_icd9_label
 83 | 
 84 | #converting ICD9 prediction to a vector
 85 | top4_icd9_vector =  vectorization.vectorize_icd_string(top_4_icd9_label, ICD9_FIRST_LEVEL)
 86 | 
 87 | ## assign icd9_prediction_vector to every discharge
 88 | train_y_hat_baseline = [top4_icd9_vector]* len (y_train)
 89 | dev_y_hat_baseline = [top4_icd9_vector]* len (y_val)
 90 | 
 91 | reload(lstm_model)
 92 | ##### build model
 93 | l_model = lstm_model.build_lstm_model (input_seq_length=MAX_SEQ_LENGTH, max_vocab = MAX_VOCAB,
 94 |                              external_embeddings = True, embedding_trainable =True,
 95 |                              embedding_dim=EMBEDDING_DIM,embedding_matrix=EMBEDDING_MATRIX,                             
 96 |                              num_classes=N_TOP )
 97 | 
 98 | l_model.fit(X_train, y_train, batch_size=50, epochs=10, validation_data=(X_val, y_val), verbose=1)
 99 | pred_train = l_model.predict(X_train, batch_size=100)
100 | pred_dev = l_model.predict(X_val, batch_size=100)
101 | helpers.show_f1_score(y_train, pred_train, y_val, pred_dev)
102 | 
103 | reload(icd9_lstm_att_model)
104 | #### build model
105 | latt_model = icd9_lstm_att_model.build_lstm_att_model (input_seq_length=MAX_SEQ_LENGTH, max_vocab = MAX_VOCAB,
106 |                              external_embeddings = True, embedding_trainable =True,
107 |                              embedding_dim=EMBEDDING_DIM,embedding_matrix=EMBEDDING_MATRIX,                             
108 |                              num_classes=N_TOP )
109 | 
110 | #model_lst_att_fit = latt_model.fit(X_train, y_train, batch_size=50, epochs=1, validation_data=(X_val, y_val), verbose=1)
111 | 
112 | model_lst_att_fit = latt_model.fit(X_train, y_train, batch_size=50, epochs=10, validation_data=(X_val, y_val), verbose=1)
113 | pred_train = latt_model.predict(X_train, batch_size=100)
114 | pred_dev = latt_model.predict(X_val, batch_size=100)
115 | helpers.show_f1_score(y_train, pred_train, y_val, pred_dev)
116 | latt_model.save('models/latt_model_5_epochs_5k.h5')
117 | 
118 | 
119 | reload(icd9_cnn_model)
120 | #### build model
121 | model = icd9_cnn_model.build_icd9_cnn_model (input_seq_length=MAX_SEQ_LENGTH, max_vocab = MAX_VOCAB,
122 |                              external_embeddings = False,
123 |                              embedding_dim=EMBEDDING_DIM,embedding_matrix=EMBEDDING_MATRIX,
124 |                              num_filters = 100, filter_sizes=[2,3,4,5],
125 |                              training_dropout_keep_prob=0.5,
126 |                              num_classes=N_TOP )
127 | 
128 | 
129 | 
130 | model.fit(X_train, y_train, batch_size=50, epochs=20, validation_data=(X_val, y_val), verbose=2)
131 | 
132 | pred_train = model.predict(X_train, batch_size=50)
133 | pred_dev = model.predict(X_val, batch_size=50)
134 | # perform evaluation
135 | helpers.show_f1_score(y_train, pred_train, y_val, pred_dev)
136 | 
137 | model.save('models/cnn_20_epochs.h5')
138 | 
139 | 
140 | 
141 | 


--------------------------------------------------------------------------------
/pipeline/lstm_model.py:
--------------------------------------------------------------------------------
 1 | from keras.models import  Model
 2 | from keras.layers import Dense, Dropout, Flatten, Input,  Embedding,Bidirectional
 3 | from keras.layers.merge import Concatenate
 4 | from keras.layers import LSTM
 5 | 
 6 | # Author: Zenobia Liendo
 7 | def build_lstm_model(input_seq_length, 
 8 |                          max_vocab, external_embeddings, embedding_trainable, embedding_dim, embedding_matrix,
 9 |                      training_dropout_keep_prob, num_classes):
10 |     #Embedding
11 |     model_input = Input(shape=(input_seq_length, ))
12 |     if external_embeddings:
13 |         # use embedding_matrix 
14 |         z = Embedding(max_vocab + 1,
15 |                             embedding_dim,
16 |                             weights=[embedding_matrix],
17 |                             input_length=input_seq_length,
18 |                             trainable=embedding_trainable)(model_input)
19 |     else:
20 |         # train embeddings 
21 |         z =  Embedding(max_vocab + 1, 
22 |                    embedding_dim, 
23 |                    input_length=input_seq_length, 
24 |                    name="embedding")(model_input)
25 | 
26 |     # LSTM
27 |     l_lstm = LSTM(50)(z)
28 |     
29 |     z = Dropout(training_dropout_keep_prob)(l_lstm)
30 |     
31 |     #score prediction 
32 |     model_output = Dense(num_classes, activation="sigmoid")(z)
33 | 
34 |     #creating model
35 |     model = Model(model_input, model_output)
36 |     # what to use for tf.nn.softmax_cross_entropy_with_logits?
37 |     model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
38 |     #model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
39 |     
40 |     print model.summary()
41 | 
42 |     return model


--------------------------------------------------------------------------------
/pipeline/vectorization.py:
--------------------------------------------------------------------------------
  1 | ### This file contains the functions necessary to vectorize the ICD labels and text inputs
  2 | # Author: Guillaume De Roo
  3 | import numpy as np
  4 | import pandas as pd
  5 | import re
  6 | import keras
  7 | from keras.preprocessing.text import Tokenizer
  8 | from keras.preprocessing.sequence import pad_sequences
  9 | 
 10 | 
 11 | # Vectorize ICD codes
 12 | 
 13 | def vectorize_icd_string(x, code_list):
 14 |     """Takes a string with ICD codes and returns an array of the right of 0/1"""
 15 |     r = []
 16 |     for code in code_list:
 17 |         if code in x: r.append(1)
 18 |         else: r.append(0)
 19 |     return np.asarray(r)
 20 | 
 21 | def vectorize_icd_column(df, col_name, code_list):
 22 |     """Takes a column and applies the """
 23 |     r = df[col_name].apply(lambda x: vectorize_icd_string(x, code_list))
 24 |     r = np.transpose(np.column_stack(r))
 25 |     return r
 26 | 
 27 | 
 28 | # Clean Text
 29 | 
 30 | def clean_str(string):
 31 |     """Cleaning of notes"""
 32 | 
 33 |     """ Cleaning from Guillaume """
 34 |     string = string.lower()
 35 |     string = string.replace("\n", " ") # remove the lines
 36 |     string = re.sub("\[\*\*.*?\*\*\]", "", string) # remove the things inside the [** **]
 37 |     string = re.sub("[^a-zA-Z0-9\ \']+", " ", string)
 38 | 
 39 |     """ Tokenization/string cleaning for all datasets except for SST.
 40 |         Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
 41 |         """
 42 |     #string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
 43 |     string = re.sub(r"\'s", " \'s", string)
 44 |     string = re.sub(r"\'ve", " \'ve", string)
 45 |     string = re.sub(r"n\'t", " n\'t", string)
 46 |     string = re.sub(r"\'re", " \'re", string)
 47 |     string = re.sub(r"\'d", " \'d", string)
 48 |     string = re.sub(r"\'ll", " \'ll", string)
 49 |     #string = re.sub(r",", " , ", string)
 50 |     #string = re.sub(r"!", " ! ", string)
 51 |     #string = re.sub(r"\(", " \( ", string)
 52 |     #string = re.sub(r"\)", " \) ", string)
 53 |     #string = re.sub(r"\?", " \? ", string)
 54 |     string = re.sub(r"\s{2,}", " ", string)
 55 |     
 56 |     """ Canonize numbers"""
 57 |     string = re.sub(r"(\d+)", "DG", string)
 58 |     
 59 |     return string.strip()
 60 | 
 61 | def clean_notes(df, col_name):
 62 |     r = df[col_name].apply(lambda x: clean_str(x))
 63 |     return r
 64 | 
 65 | 
 66 | # Vectorize and Pad notes Text
 67 | 
 68 | def vectorize_notes(col, MAX_NB_WORDS, verbose = True):
 69 |     """Takes a note column and encodes it into a series of integer
 70 |         Also returns the dictionnary mapping the word to the integer"""
 71 |     tokenizer = Tokenizer(num_words = MAX_NB_WORDS)
 72 |     tokenizer.fit_on_texts(col)
 73 |     data = tokenizer.texts_to_sequences(col)
 74 |     note_length =  [len(x) for x in data]
 75 |     vocab = tokenizer.word_index
 76 |     MAX_VOCAB = len(vocab)
 77 |     if verbose:
 78 |         print('Vocabulary size: %s' % MAX_VOCAB)
 79 |         print('Average note length: %s' % np.mean(note_length))
 80 |         print('Max note length: %s' % np.max(note_length))
 81 |     return data, vocab, MAX_VOCAB
 82 | 
 83 | def pad_notes(data, MAX_SEQ_LENGTH):
 84 |     data = pad_sequences(data, maxlen = MAX_SEQ_LENGTH)
 85 |     return data, data.shape[1]
 86 | 
 87 | 
 88 | # Creates an embedding Matrix
 89 | # Based on https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
 90 | 
 91 | def embedding_matrix(f_name, dictionary, EMBEDDING_DIM, verbose = True, sigma = None):
 92 |     """Takes a pre-trained embedding and adapts it to the dictionary at hand
 93 |         Words not found will be all-zeros in the matrix"""
 94 | 
 95 |     # Dictionary of words from the pre trained embedding
 96 |     pretrained_dict = {}
 97 |     with open(f_name, 'r') as f:
 98 |         for line in f:
 99 |             values = line.split()
100 |             word = values[0]
101 |             coefs = np.asarray(values[1:], dtype='float32')
102 |             pretrained_dict[word] = coefs
103 | 
104 |     # Default values for absent words
105 |     if sigma:
106 |         pretrained_matrix = sigma * np.random.rand(len(dictionary) + 1, EMBEDDING_DIM)
107 |     else:
108 |         pretrained_matrix = np.zeros((len(dictionary) + 1, EMBEDDING_DIM))
109 |     
110 |     # Substitution of default values by pretrained values when applicable
111 |     for word, i in dictionary.items():
112 |         vector = pretrained_dict.get(word)
113 |         if vector is not None:
114 |             pretrained_matrix[i] = vector
115 | 
116 |     if verbose:
117 |         print('Vocabulary in notes:', len(dictionary))
118 |         print('Vocabulary in original embedding:', len(pretrained_dict))
119 |         inter = list( set(dictionary.keys()) & set(pretrained_dict.keys()) )
120 |         print('Vocabulary intersection:', len(inter))
121 | 
122 |     return pretrained_matrix, pretrained_dict
123 | 


--------------------------------------------------------------------------------
/pre_processing/MIMICERdiagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/pre_processing/MIMICERdiagram.png


--------------------------------------------------------------------------------
/pre_processing/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | ## ICD9-codes
  4 | 
  5 | 
  6 | ER database diagram of tables used in this project
  7 | 
  8 | 
  9 | ![database diagram of tables used in this project](MIMICERdiagram.png)
 10 | 
 11 | The DIAGNOSES_ICD table contains the ICD9-codes assigned to a hospital admission. There could be many ICD9-codes assigned to one admission.
 12 | There are 57,786 admissions in this table, with a total of 651,047 ICD9-codes assigned.
 13 | 
 14 | Related previous research didn't work with all ICD9-codes but only with the most used in diagnoses reports[2] [3]. We will not consider ICD9-codes that start with "E" (additional information indicating the cause of injury or adverse event) nor "V" (codes used when the visit is due to circumstances other than disease or injury, e.g.: new born to indicate birth status)   
 15 | 
 16 | *	We identify the top 20 labels based on number of patients with that label. 
 17 | *	We then remove all patients who don’t have at least one of these labels,  
 18 | *	and then filter the set of labels for each patient to only include these labels.   
 19 | 
 20 | As a result, we get 45,293 admissions with 152,299 icd9-codes (only including the ones in the top 20)
 21 | 
 22 | Here is the list of the top 20 ICD9-codes that will be used in the baseline
 23 | 
 24 | ```
 25 | select icd9_code, COUNT(DISTINCT SUBJECT_ID) subjects_qty 
 26 | from diagnoses_icd  where SUBSTRING(icd9_code from 1 for 1) != 'V' 
 27 | group by icd9_code order by subjects_qty  
 28 | desc  limit 20;
 29 | 
 30 |  icd9_code | subjects_qty
 31 | -----------+--------
 32 |  4019      |  17613
 33 |  41401     |  10775
 34 |  42731     |  10271
 35 |  4280      |   9843
 36 |  5849      |   7687
 37 |  2724      |   7465
 38 |  25000     |   7370
 39 |  51881     |   6719
 40 |  5990      |   5779
 41 |  2720      |   5335
 42 |  53081     |   5272
 43 |  2859      |   4993
 44 |  486       |   4423
 45 |  2851      |   4241
 46 |  2762      |   4177
 47 |  2449      |   3819
 48 |  496       |   3592
 49 |  99592     |   3560
 50 |  0389      |   3433
 51 |  5070      |   3396
 52 | (20 rows)
 53 | 
 54 | 
 55 | ```
 56 | 
 57 | The file containing the list of admissions resulting of the filtering above is in this file: baseline\psql_files\diagnoses_icd_codes.csv
 58 | (note: we removed all files containing MIMIC data because they need authorization by MIMIC to access)
 59 | 
 60 | Here is the sql that created that file
 61 | 
 62 | ```
 63 | select hadm_id, max(subject_id) subject_id,  string_agg(icd9_code, ' ') icd9_codes 
 64 | from diagnoses_icd 
 65 | where icd9_code IN ( select icd9_code from (select icd9_code, COUNT(DISTINCT SUBJECT_ID) subjects_qty 
 66 | from diagnoses_icd where SUBSTRING(icd9_code from 1 for 1) != 'V' 
 67 | group by icd9_code order by subjects_qty  desc  limit 20) as icd9_subject_list)
 68 | group by hadm_id;
 69 | ```
 70 | 
 71 | (note: for the final model, do we consider the code's description and/or hierarchy?)
 72 | 
 73 | ## Clinical Notes
 74 | 
 75 | The clinical notes related to an admission are located in the NOTEEVENTS and ADMISSION tables.
 76 | The ADMISSION table has only one type of clinical note, the one related to the preliminary diagnoses done during admission.
 77 | The NOTEEVENTS contains all the other type of clinical notes, here is a list of these types
 78 | ```
 79 | mimic=# select category from noteevents group by category;
 80 |      category
 81 | -------------------
 82 |  ECG
 83 |  Respiratory
 84 |  Discharge summary
 85 |  Radiology
 86 |  Rehab Services
 87 |  Nursing/other
 88 |  Nutrition
 89 |  Pharmacy
 90 |  Social Work
 91 |  Case Management
 92 |  Physician
 93 |  General
 94 |  Nursing
 95 |  Echo
 96 |  Consult
 97 | (15 rows)
 98 | 
 99 | ``` 
100 | The baseline will  ONLY  use the 'Discharge Summary' clinical notes. (note: we may use the other clinical notes for the final project)
101 | An example of one discharge summary note can be found at: baseline/psql_files/discharge_note_sample.out
102 | (we removde all data related files since they need granted authorization by MIMIC)
103 | 
104 | 
105 | It looks like the discharge summary can have Addendum, we will not include Addendums for the baseline
106 | ```
107 | mimic=# select category, description  from noteevents  where category = 'Discharge summary' group by category, description;
108 |      category      | description
109 | -------------------+-------------
110 |  Discharge summary | Report
111 |  Discharge summary | Addendum
112 | (2 rows)
113 | ```
114 | 
115 | It looks like there are duplicates entries for some admissions, for example, 
116 | ```
117 | mimic=# select HADM_ID, SUBJECT_ID, CHARTDATE, CGID, ISERROR, substring(TEXT from 1 for 20)
118 | mimic-# from noteevents
119 | mimic-# where HADM_ID = '178053' and noteevents.category = 'Discharge summary'   and noteevents.DESCRIPTION = 'Report';
120 | 
121 |  hadm_id | subject_id |      chartdate      | cgid | iserror |      substring
122 | ---------+------------+---------------------+------+---------+----------------------
123 |   178053 |      18976 | 2120-11-28 00:00:00 |      |         | Admission Date:  [**
124 |   178053 |      18976 | 2120-11-28 00:00:00 |      |         | Admission Date:  [**
125 |   178053 |      18976 | 2120-12-16 00:00:00 |      |         | Admission Date:  [**
126 |   178053 |      18976 | 2120-12-16 00:00:00 |      |         | Admission Date:  [**
127 |   178053 |      18976 | 2120-11-26 00:00:00 |      |         | Admission Date:  [**
128 | 
129 | (5 rows)
130 | ```
131 | in this case the clinical notes are different, seems the did multiple entries for the same discharge, and the discharge happened in two different dates with the same admission_id (that looks like a mistake since a patient returning should get a new admission_id)
132 | 
133 | We will handle this situation for the final model.
134 | 
135 | ##  Joining information from the clinical notes and the corresponding ICD9 codes
136 | 
137 | This is the sql statement that created a table with a join from the discharge summary notes and its ICD_CODES that are in the top 20.
138 | ```
139 | DROP TABLE IF EXISTS W266_DISCHARGE_NOTE_ICD9_CODES;
140 | CREATE TABLE  W266_DISCHARGE_NOTE_ICD9_CODES  AS 
141 | 
142 | select noteevents.HADM_ID, dianoses_top20_icd.SUBJECT_ID, noteevents.CHARTDATE, noteevents.TEXT,  dianoses_top20_icd.ICD9_CODES
143 | from noteevents 
144 | JOIN
145 | (select HADM_ID, max(SUBJECT_ID) SUBJECT_ID,  string_agg(ICD9_CODE, ' ') ICD9_CODES 
146 | from diagnoses_icd 
147 | where ICD9_CODE IN ( select ICD9_CODE from (select ICD9_CODE, COUNT(DISTINCT SUBJECT_ID) subjects_qty 
148 | from diagnoses_icd 
149 | where SUBSTRING(ICD9_CODE from 1 for 1) != 'V' 
150 | group by ICD9_CODE  order by subjects_qty  
151 | desc  limit 20) as icd9_subject_list)
152 | group by HADM_ID ) as dianoses_top20_icd
153 | ON (noteevents.HADM_ID = dianoses_top20_icd.HADM_ID)
154 | where noteevents.category = 'Discharge summary'  and noteevents.DESCRIPTION = 'Report';
155 | 
156 | 
157 | CREATE INDEX W266_DISCHARGE_NOTE_ICD9_CODES_index 
158 | ON W266_DISCHARGE_NOTE_ICD9_CODES(HADM_ID) ;
159 | 
160 | 
161 | ```
162 | 
163 | the file is about 474 MB
164 | 
165 | ## Representing clinical notes for the baseline
166 | 
167 | Previous research represents this documents as bag-of-words vectors [1]. In particular, it takes the 10,000 tokens with the largest tf-idf scores from the training.   
168 | 
169 | (note for the final model: we could use here POS tagging, parsing and entity recognition)
170 | 
171 | 
172 | ## References
173 | [1] Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics   
174 | [2] Applying Deep Learning to ICD-9 Multi-label Classification from Medical Records   
175 | [3] ICD-9 Coding of Discharge Summaries   
176 | [4] Large-scale Multi-label Text Classification - Revisiting Neural Networks   
177 | 


--------------------------------------------------------------------------------
/pre_processing/psql_files/create_discharge_notes_all_icd9:
--------------------------------------------------------------------------------
 1 | DROP TABLE IF EXISTS W266_DISCHARGE_NOTE_ICD9_CODES_B;
 2 | CREATE TABLE  W266_DISCHARGE_NOTE_ICD9_CODES_B  AS 
 3 | 
 4 | select notes.HADM_ID, diagnoses_icd9.SUBJECT_ID, notes.CHARTDATE, diagnoses_icd9.ICD9_CODES, notes.NOTE_TEXT   
 5 | from 
 6 | (select HADM_ID,max(SUBJECT_ID) , MIN( CHARTDATE) CHARTDATE ,string_agg(TEXT , ' ' ORDER BY CHARTDATE) NOTE_TEXT 
 7 | from noteevents  
 8 | where category = 'Discharge summary'  
 9 | group by HADM_ID ) as notes
10 | JOIN
11 | ( select HADM_ID, max(SUBJECT_ID) SUBJECT_ID,  string_agg(ICD9_CODE, ' ') ICD9_CODES 
12 | from diagnoses_icd 
13 | where SUBSTRING(ICD9_CODE from 1 for 1) != 'V' and SUBSTRING(ICD9_CODE from 1 for 1) != 'E' 
14 | group by HADM_ID )  as diagnoses_icd9
15 | ON (notes.HADM_ID = diagnoses_icd9.HADM_ID);
16 | 
17 | 
18 | CREATE INDEX W266_DISCHARGE_NOTE_ICD9_B_CODES_index 
19 | ON W266_DISCHARGE_NOTE_ICD9_CODES_B(HADM_ID) ;
20 | 


--------------------------------------------------------------------------------
/pre_processing/psql_files/query01_top_icd9_codes.sql:
--------------------------------------------------------------------------------
 1 | -- We identify the top 20 labels based on number of patients with that label.
 2 | 
 3 | select icd9_code, COUNT(DISTINCT SUBJECT_ID) subjects_qty 
 4 | from diagnoses_icd 
 5 | where SUBSTRING(icd9_code from 1 for 1) != 'V' 
 6 | group by icd9_code 
 7 | order by subjects_qty  
 8 | desc  limit 20;
 9 | 
10 | 
11 | 


--------------------------------------------------------------------------------
/pre_processing/psql_files/query02_filter_diagnoses_by_icd9_code.sql:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/pre_processing/psql_files/query02_filter_diagnoses_by_icd9_code.sql


--------------------------------------------------------------------------------
/pre_processing/psql_files/query03C_icd9_codes_by_admission_create_table.sql:
--------------------------------------------------------------------------------
 1 | DROP TABLE IF EXISTS W266_DIAGNOSES_TOP_ICD9_CODES;
 2 | CREATE TABLE  W266_DIAGNOSES_TOP_ICD9_CODES  AS 
 3 | 
 4 | select hadm_id, max(subject_id) subject_id,  string_agg(icd9_code, ' ') icd9_codes 
 5 | from diagnoses_icd 
 6 | where icd9_code IN ( select icd9_code from (select icd9_code, COUNT(DISTINCT SUBJECT_ID) subjects_qty 
 7 | from diagnoses_icd 
 8 | where SUBSTRING(icd9_code from 1 for 1) != 'V' 
 9 | group by icd9_code 
10 | order by subjects_qty  
11 | desc  limit 20) as icd9_subject_list)
12 | group by hadm_id;
13 | 
14 | CREATE INDEX W266_DIAGNOSES_TOP_ICD9_CODES_index 
15 | ON W266_DIAGNOSES_TOP_ICD9_CODES (HADM_ID) ;


--------------------------------------------------------------------------------
/pre_processing/psql_files/query03_icd9_codes_by_admission.sql:
--------------------------------------------------------------------------------
 1 | --select hadm_id, max(subject_id),  string_agg(icd9_code, ',') from diagnoses_icd where hadm_id = '145834' group by hadm_id
 2 | 
 3 | -- aggregates icd9-codes in one row
 4 | -- generates diagnoses_icd_codes.csv
 5 | 
 6 | select hadm_id, max(subject_id) subject_id,  string_agg(icd9_code, ' ') icd9_codes 
 7 | from diagnoses_icd 
 8 | where icd9_code IN ( select icd9_code from (select icd9_code, COUNT(DISTINCT SUBJECT_ID) subjects_qty 
 9 | from diagnoses_icd 
10 | where SUBSTRING(icd9_code from 1 for 1) != 'V' 
11 | group by icd9_code 
12 | order by subjects_qty  
13 | desc  limit 20) as icd9_subject_list)
14 | group by hadm_id;


--------------------------------------------------------------------------------
/pre_processing/psql_files/query04_filtering_discharge_summary_notes.sql:
--------------------------------------------------------------------------------
 1 | select noteevents.HADM_ID, dianoses_top20_icd.SUBJECT_ID, noteevents.CHARTDATE, noteevents.TEXT,  dianoses_top20_icd.ICD9_CODES
 2 | from noteevents 
 3 | JOIN
 4 | (select HADM_ID, max(SUBJECT_ID) SUBJECT_ID,  string_agg(ICD9_CODE, ' ') ICD9_CODES 
 5 | from diagnoses_icd 
 6 | where ICD9_CODE IN ( select ICD9_CODE from (select ICD9_CODE, COUNT(DISTINCT SUBJECT_ID) subjects_qty 
 7 | from diagnoses_icd 
 8 | where SUBSTRING(ICD9_CODE from 1 for 1) != 'V' 
 9 | group by ICD9_CODE  order by subjects_qty  
10 | desc  limit 20) as icd9_subject_list)
11 | group by HADM_ID ) as dianoses_top20_icd
12 | ON (noteevents.HADM_ID = dianoses_top20_icd.HADM_ID)
13 | where noteevents.category = 'Discharge summary'  and noteevents.DESCRIPTION = 'Report';
14 | 
15 | 
16 | 


--------------------------------------------------------------------------------
/pre_processing/psql_files/query05_discharge_notes_icd9_create_table.sql:
--------------------------------------------------------------------------------
 1 | DROP TABLE IF EXISTS W266_DISCHARGE_NOTE_ICD9_CODES;
 2 | CREATE TABLE  W266_DISCHARGE_NOTE_ICD9_CODES  AS 
 3 | 
 4 | select noteevents.HADM_ID, dianoses_top20_icd.SUBJECT_ID, noteevents.CHARTDATE, noteevents.TEXT,  dianoses_top20_icd.ICD9_CODES
 5 | from noteevents 
 6 | JOIN
 7 | (select HADM_ID, max(SUBJECT_ID) SUBJECT_ID,  string_agg(ICD9_CODE, ' ') ICD9_CODES 
 8 | from diagnoses_icd 
 9 | where ICD9_CODE IN ( select ICD9_CODE from (select ICD9_CODE, COUNT(DISTINCT SUBJECT_ID) subjects_qty 
10 | from diagnoses_icd 
11 | where SUBSTRING(ICD9_CODE from 1 for 1) != 'V' 
12 | group by ICD9_CODE  order by subjects_qty  
13 | desc  limit 20) as icd9_subject_list)
14 | group by HADM_ID ) as dianoses_top20_icd
15 | ON (noteevents.HADM_ID = dianoses_top20_icd.HADM_ID)
16 | where noteevents.category = 'Discharge summary'  and noteevents.DESCRIPTION = 'Report';
17 | 
18 | 
19 | CREATE INDEX W266_DISCHARGE_NOTE_ICD9_CODES_index 
20 | ON W266_DISCHARGE_NOTE_ICD9_CODES(HADM_ID) ;
21 | 
22 | 
23 | 
24 | 


--------------------------------------------------------------------------------
/pre_processing/psql_files/query06_export_w266_table.sql:
--------------------------------------------------------------------------------
1 | select HADM_ID, SUBJECT_ID, CHARTDATE, regexp_replace(TEXT, E'[\\n\\r]+', ' ', 'g' ),  ICD9_CODES
2 | from W266_DISCHARGE_NOTE_ICD9_CODES;
3 | 


--------------------------------------------------------------------------------
/pre_processing/psql_files/query_all_discharge_notes:
--------------------------------------------------------------------------------
1 | select HADM_ID, SUBJECT_ID , CHARTDATE , DESCRIPTION, regexp_replace(TEXT, E'[\\n\\r]+', ' ', 'g' ) 
2 | from noteevents  
3 | where category = 'Discharge summary'
4 | 


--------------------------------------------------------------------------------
/pre_processing/psql_files/query_icd9_codes:
--------------------------------------------------------------------------------
1 | select HADM_ID, max(SUBJECT_ID) SUBJECT_ID,  string_agg(ICD9_CODE, ' ') ICD9_CODES 
2 | from diagnoses_icd 
3 | where SUBSTRING(ICD9_CODE from 1 for 1) != 'V' and SUBSTRING(ICD9_CODE from 1 for 1) != 'E' 
4 | group by HADM_ID 
5 | 


--------------------------------------------------------------------------------
/pre_processing/psql_files/top_icd9_codes.txt:
--------------------------------------------------------------------------------
 1 |  top 20 labels based on number of patients with that label 
 2 | (not considering icd9 codes that start with "V" nor "E" because they refer to information that is not diagnosis)
 3 | 
 4 | 
 5 |  icd9_code | icd9_q
 6 | -----------+--------
 7 |  4019      |  17613
 8 |  41401     |  10775
 9 |  42731     |  10271
10 |  4280      |   9843
11 |  5849      |   7687
12 |  2724      |   7465
13 |  25000     |   7370
14 |  51881     |   6719
15 |  5990      |   5779
16 |  2720      |   5335
17 |  53081     |   5272
18 |  2859      |   4993
19 |  486       |   4423
20 |  2851      |   4241
21 |  2762      |   4177
22 |  2449      |   3819
23 |  496       |   3592
24 |  99592     |   3560
25 |  0389      |   3433
26 |  5070      |   3396
27 | (20 rows)


--------------------------------------------------------------------------------
/w266FinalReport_ICD_9_Classification.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/w266FinalReport_ICD_9_Classification.pdf


--------------------------------------------------------------------------------