├── .gitignore ├── README.MD ├── baseline ├── .ipynb_checkpoints │ └── mimic_icd9_baseline-checkpoint.ipynb ├── README.md ├── mimic_icd9_baseline.ipynb ├── nn_model.py └── paper_ranking_loss_scores.png ├── data └── readme.md ├── icd9_cnn ├── .ipynb_checkpoints │ ├── Untitled-checkpoint.ipynb │ ├── cnn_top20_leave-checkpoint.ipynb │ └── icd9_cnn_multilabel-checkpoint.ipynb ├── CNN_for_text2.png ├── cnn_model.py ├── cnn_model.pyc ├── cnn_top20_leave.ipynb ├── mimic_CNN_text_classification.png ├── tf_saved │ ├── cnn_trained.data-00000-of-00001 │ ├── cnn_trained.index │ └── cnn_trained.meta ├── utils.py ├── utils.pyc ├── vocabulary.py └── vocabulary.pyc ├── pipeline ├── .ipynb_checkpoints │ ├── Exploration-checkpoint.ipynb │ ├── Temp Guillaume-checkpoint.ipynb │ ├── Temp Guillaume2-checkpoint.ipynb │ ├── icd9_cnn_50K_run-checkpoint.ipynb │ ├── icd9_cnn_att_workbook-checkpoint.ipynb │ ├── icd9_hatt_workbook-checkpoint.ipynb │ └── icd9_lstm_cnn_workbook-checkpoint.ipynb ├── __pycache__ │ ├── database_selection.cpython-35.pyc │ ├── helpers.cpython-35.pyc │ └── vectorization.cpython-35.pyc ├── attention_util.py ├── database_selection.py ├── hatt_model.py ├── helpers.py ├── icd9_cnn_50K_run.ipynb ├── icd9_cnn_att.py ├── icd9_cnn_att_50K_records.ipynb ├── icd9_cnn_att_workbook.ipynb ├── icd9_cnn_model.py ├── icd9_hatt_workbook.ipynb ├── icd9_lstm_att_model.py ├── icd9_lstm_cnn.py ├── icd9_lstm_cnn_workbook.ipynb ├── lstm_model.py └── vectorization.py ├── pre_processing ├── MIMICERdiagram.png ├── README.md └── psql_files │ ├── create_discharge_notes_all_icd9 │ ├── query01_top_icd9_codes.sql │ ├── query02_filter_diagnoses_by_icd9_code.sql │ ├── query03C_icd9_codes_by_admission_create_table.sql │ ├── query03_icd9_codes_by_admission.sql │ ├── query04_filtering_discharge_summary_notes.sql │ ├── query05_discharge_notes_icd9_create_table.sql │ ├── query06_export_w266_table.sql │ ├── query_all_discharge_notes │ ├── query_icd9_codes │ └── top_icd9_codes.txt └── w266FinalReport_ICD_9_Classification.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | 7 | 8 | 9 | # IPython Notebook 10 | .ipynb_checkpoints 11 | 12 | # pyenv 13 | .python-version 14 | 15 | 16 | 17 | -------------------------------------------------------------------------------- /README.MD: -------------------------------------------------------------------------------- 1 | # Classifying medical notes into standard disease codes 2 | 3 | **August 2017** 4 | 5 | This repository contains the code I implemented to classify automatically EHR patient discharge notes into standard 6 | disease labels (ICD9 codes). I implemented deep learning models ( CNN, LSTM and Hierarchical models) using embeddings and 7 | attention layers. The CNN model with attention outperformed previous algorithms used in this task. 8 | The dataset used for modeling was: [MIMIC III dataset](https://mimic.physionet.org) . 9 | 10 | The code was implemented on August 2017, during my graduate studies at the Master of Information and Data Science (MIDS) program at UC Berkeley. The class was: W266 Natural Language Processing with Deep Learning 11 | 12 | This is the final project report: [w266FinalReport_ICD_9_Classification.pdf](w266FinalReport_ICD_9_Classification.pdf) 13 | 14 | (note: code refactoring pending) 15 | 16 | ## Preprocessing 17 | Getting information from database, pulling data, filtering and joining tables: [Pre processing](pre_processing) 18 | 19 | ## Main Notebooks 20 | 21 | ### Classification into top level codes in the ICD-9 hierarchy with 5K records 22 | | Model | ICD 9 code level| N. Records | Epochs | Notebook | 23 | | --- | --- | --- | --- | --- | 24 | | Baseline | First-Level|5K| -|[pipeline/icd9_lstm_cnn_workbook.ipynb](pipeline/icd9_lstm_cnn_workbook.ipynb)
Section: "Super Basic Baseline with top 4" Always predict top 4 icd-9 codes, F1-score= 52.6| 25 | | CNN Replication| First-Level | 5K| 20|[pipeline/icd9_lstm_cnn_workbook.ipynb](pipeline/icd9_lstm_cnn_workbook.ipynb)
Section: "CNN running with 20 epochs". CNN model to replicate results from paper: [Comparing Rule-Based and Deep Learning Models for Patient Phenotyping](https://arxiv.org/abs/1703.08705).In order to compare F1 performance results, I took into consideration the dataset size and number of classes. F1-score= 76.2| 26 | | CNN| Firs-Level| 5K | 5|[pipeline/icd9_lstm_cnn_workbook.ipynb](pipeline/icd9_lstm_cnn_workbook.ipynb)
Section: "CNN running with 5 epochs" running with the 17 first level ICD-9 codes, using 5 epochs and Embeddings. F1-score= 69.1| 27 | | LSTM | First-Level | 5K| 5|[pipeline/icd9_lstm_cnn_workbook.ipynb](pipeline/icd9_lstm_cnn_workbook.ipynb)
Section "Basic LSTM" running with the 17 first level ICD-9 codes, using 5 epochs and Embeddings. F1-score= 64.6 | 28 | 29 | **Attention** 30 | The average length of discharge clinical notes is 1639 words. The text to classify may be too long for a LSTM or CNN to 31 | remember all relevant information. [Raffel et al. (2016)](https://arxiv.org/abs/1512.08756) displayed better performance in many NLP tasks on long text using Attention. Here, we seek to emulate his results by implementing algorithms based on the formulas presented in [Raffel et al. (2016)](https://arxiv.org/abs/1512.08756) and [Yang et al. (2016)](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf). 32 | 33 | | Model | ICD 9 code level| N. Records | Epochs | Notebook | 34 | | --- | --- | --- | --- | --- | 35 | | LSTM with Attention| First-Level | 5K|5| [pipeline/icd9_lstm_cnn_workbook.ipynb](pipeline/icd9_lstm_cnn_workbook.ipynb)
Section: "LSTM with Attention"
F1-score:67.0| 36 | | CNN with Attention| First-Level | 5K| 5|[pipeline/icd9_cnn_att_workbook.ipynb](pipeline/icd9_cnn_att_workbook.ipynb)
F1-score:72.8| 37 | | Hierarchical LSTM Attention | First-level| 5k|5| [pipeline/icd9_hatt_workbook.ipynb](pipeline/icd9_hatt_workbook.ipynb)
This model was implemented based on [Yang et al. (2016)](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf) which specifically targets document classifications. It has two levels of attention mechanisms, the first one creates vectors that represent each sentence, using attention mechanism across words; and the second level creates a vector that represent the document using attention mechanisms across sentences. F1-score: 67.6| 38 | 39 | 40 | ### Classification into most common ICD-9 Codes in the bottom of the ICD-9 Hierarchy (leaves) 41 | | Model | ICD 9 code level| N. Records | Epochs | Notebook | 42 | | --- | --- | --- | --- | --- | 43 | | Baseline | First-Level |46K and 5K| -|[baseline/mimic_icd9_baseline.ipynb](baseline/mimic_icd9_baseline.ipynb)
- Some Initial Exploration with Python and Sql
- Basic Baseline Model: For the basic baseline, we make a fixed prediction corresponding to the top 4 ICD-9 codes for 46K records
- NN Baseline Model: A neural network (not Recurrent) with one hidden layer, with relu activation on the hidden layer and sigmoid activation on the output layer. Using cross entropy loss,which is the loss functions for multilabel classification (using Tensorflow), using 5K records. F1-score: 35 | 44 | | CNN for top 20 leaf icd-9 codes | Leaf | 46K | 7 | [icd9_cnn/cnn_top20_leave.ipynb](icd9_cnn/cnn_top20_leave.ipynb)
Classifies clinical notes into the 20 most common ICD-9 that are in the bottom of the ICD-9 hierarchy (leaves), this run was for comparison with previous work. F1-score:72.4 | 45 | 46 | ### Classification into top level codes in the ICD-9 hierarchy with 52.6K records 47 | | Model | ICD 9 code level| N. Records | Epochs | Notebook | 48 | | --- | --- | --- | --- | --- | 49 | | CNN | First-Level | 52.6K | - | [pipeline/icd9_cnn_50K_run.ipynb](/pipeline/icd9_cnn_50K_run.ipynb)
F1-score: 79.7 | 50 | | CNN with Attention | First-Level | 52.6K | - | [pipeline/icd9_cnn_att_50K_records.ipynb](pipeline/icd9_cnn_att_50K_records.ipynb)
F1-score: 78.2.At this stage, the CNN ATT model still overfits: even though it had the highest score during the experimental runs with 5K records and 5 epochs, it didn’t reach the best f1-score when running it with the full data set. Further work would explore hyper-parameters tuning and evaluating the number of parameters to attempt undoing the over fitting situation.| 51 | 52 | 53 | 54 | ## Model Python modules 55 | 56 | | Model | Python module | 57 | | --- | --- | 58 | | LSTM | [pipeline/lstm_model.py](pipeline/lstm_model.py) | 59 | | CNN | [pipeline/icd9_cnn_model.py](pipeline/icd9_cnn_model.py) | 60 | | Attention Layer |[pipeline/attention_util.py](pipeline/attention_util.py) | 61 | | LSTM_ATT | [pipeline/icd9_lstm_att_model.py](pipeline/icd9_lstm_att_model.py) | 62 | | CNN_ATT | [pipeline/icd9_cnn_att.py](pipeline/icd9_cnn_att.py) | 63 | | Hierarchical LSTM Attention | [pipeline/hatt_model.py](pipeline/hatt_model.py) | 64 | 65 | ## Helper classes for Preprocessing 66 | 67 | | Helper | Python module | 68 | | --- | --- | 69 | | Filtering clinical-notes to keep the ones that have been assigned the top common N icd-9 codes (this is a multi-label), removing any code from the label that is not in the top N | [pipeline/database_selection.py](pipeline/database_selection.py) | 70 | | Three main methods: (1) Splits input file in training, valiation and test (2) Replace leaf icd9-code with its grandparent in the first level (3) Calculates and Diplay F1 scores for a set of possible thresholds| [pipeline/helpers.py](pipeline/helpers.py) | 71 | | functions necessary to vectorize the ICD labels and text inputs (I didn't implement this module, is listed here because it is used by the notebooks I had implemented)| [pipeline/vectorization.py](pipeline/vectorization.py) | 72 | -------------------------------------------------------------------------------- /baseline/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## NN Baseline Model 3 | A neural network (not Recurrent) with one hidden layer, with relu activation on the hidden layer and sigmoid activation on the output layer. Using cross entropy loss,which is the loss functions for multilabel classification [4] 4 | 5 | 6 | ## Evaluation 7 | rank loss metric to evaluate performance [3] and F1 score [2] 8 | 9 | ## References 10 | [1] Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics 11 | [2] Applying Deep Learning to ICD-9 Multi-label Classification from Medical Records 12 | [3] ICD-9 Coding of Discharge Summaries 13 | [4] Large-scale Multi-label Text Classification - Revisiting Neural Networks 14 | -------------------------------------------------------------------------------- /baseline/nn_model.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | 4 | def with_self_graph(function): 5 | def wrapper(self, *args, **kwargs): 6 | with self.graph.as_default(): 7 | return function(self, *args, **kwargs) 8 | return wrapper 9 | 10 | 11 | class NNLM(object): 12 | def __init__(self, graph=None, *args, **kwargs): 13 | # Set TensorFlow graph. All TF code will work on this graph. 14 | self.graph = graph or tf.Graph() 15 | self.SetParams(*args, **kwargs) 16 | 17 | 18 | 19 | @with_self_graph 20 | def SetParams(self, Hidden_dims, learning_rate, vocabulary_size, y_dim): 21 | # Model structure; these need to be fixed for a given model. 22 | self.Hidden_dims = Hidden_dims 23 | self.learning_rate = learning_rate 24 | self.V =vocabulary_size 25 | self.y_dim = y_dim 26 | 27 | @with_self_graph 28 | def affine_layer(self, hidden_dim, x, seed=0): 29 | self.W = tf.get_variable("W", initializer=tf.contrib.layers.xavier_initializer(seed = seed), \ 30 | trainable=True,shape=[x.shape[1],hidden_dim]) 31 | self.b = tf.get_variable("b", initializer=tf.zeros_initializer(), \ 32 | trainable=True,shape=[hidden_dim]) 33 | return tf.matmul(x,self.W) + self.b 34 | 35 | @with_self_graph 36 | def fully_connected_layers(self,x): 37 | for i in range(len(self.Hidden_dims)): 38 | with tf.variable_scope("layer_" + str(i)): 39 | x = tf.nn.relu(self.affine_layer(self.Hidden_dims[i], x)) 40 | return x 41 | @with_self_graph 42 | def BuildCoreGraph(self): 43 | self.x = tf.placeholder(tf.float32, shape=[None, self.V]) 44 | self.target_y = tf.placeholder(tf.float32, shape=[None,None]) 45 | 46 | z = self.fully_connected_layers(self.x) 47 | 48 | self.y_logit = tf.squeeze(self.affine_layer(self.y_dim,z)) 49 | self.y_hat = tf.sigmoid(self.y_logit) 50 | 51 | self.loss = tf.reduce_mean (tf.nn.sigmoid_cross_entropy_with_logits(labels=self.target_y, logits=self.y_logit)) 52 | 53 | @with_self_graph 54 | def BuildTrainGraph(self): 55 | optimizer = tf.train.GradientDescentOptimizer(self.learning_rate) 56 | self.train = optimizer.minimize(self.loss) 57 | -------------------------------------------------------------------------------- /baseline/paper_ranking_loss_scores.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/baseline/paper_ranking_loss_scores.png -------------------------------------------------------------------------------- /data/readme.md: -------------------------------------------------------------------------------- 1 | We are not including the MIMIC data here because it needs authorization from https://mimic.physionet.org/gettingstarted/access/ 2 | 3 | 4 | 5 | -------------------------------------------------------------------------------- /icd9_cnn/.ipynb_checkpoints/Untitled-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /icd9_cnn/.ipynb_checkpoints/cnn_top20_leave-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [ 10 | { 11 | "name": "stderr", 12 | "output_type": "stream", 13 | "text": [ 14 | "Using TensorFlow backend.\n" 15 | ] 16 | } 17 | ], 18 | "source": [ 19 | "# General imports\n", 20 | "import numpy as np\n", 21 | "import pandas as pd\n", 22 | "from sklearn.metrics import f1_score\n", 23 | "\n", 24 | "# Custom functions\n", 25 | "%load_ext autoreload\n", 26 | "%autoreload 2\n", 27 | "import database_selection\n", 28 | "import vectorization\n", 29 | "import helpers\n", 30 | "\n", 31 | "#keras\n", 32 | "from keras.models import Sequential, Model\n", 33 | "from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding\n", 34 | "from keras.layers.merge import Concatenate\n" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 21, 40 | "metadata": { 41 | "collapsed": true 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "df = pd.read_csv('../data/disch_notes_all_icd9.csv',\n", 46 | " names = ['HADM_ID', 'SUBJECT_ID', 'DATE', 'ICD9','TEXT'])" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 22, 52 | "metadata": { 53 | "collapsed": false 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "N_TOP = 10 \n", 58 | "full_df, top_codes = database_selection.filter_top_codes(df, 'ICD9', N_TOP, filter_empty = True)\n", 59 | "df = full_df.head(1000)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 23, 65 | "metadata": { 66 | "collapsed": false 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "#preprocess icd9 codes\n", 71 | "labels = vectorization.vectorize_icd_column(df, 'ICD9', top_codes)\n" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 24, 77 | "metadata": { 78 | "collapsed": false 79 | }, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | "Vocabulary size: 22330\n", 86 | "Average note length: 1767.581\n", 87 | "Max note length: 5641\n", 88 | "Final Vocabulary: 22330\n", 89 | "Final Max Sequence Length: 5000\n" 90 | ] 91 | } 92 | ], 93 | "source": [ 94 | "#preprocess notes\n", 95 | "MAX_VOCAB = None # to limit original number of words (None if no limit)\n", 96 | "MAX_SEQ_LENGTH = 5000 # to limit length of word sequence (None if no limit)\n", 97 | "df.TEXT = vectorization.clean_notes(df, 'TEXT')\n", 98 | "data, dictionary, MAX_VOCAB = vectorization.vectorize_notes(df.TEXT, MAX_VOCAB, verbose = True)\n", 99 | "data, MAX_SEQ_LENGTH = vectorization.pad_notes(data, MAX_SEQ_LENGTH)\n", 100 | "print(\"Final Vocabulary: %s\" % MAX_VOCAB)\n", 101 | "print(\"Final Max Sequence Length: %s\" % MAX_SEQ_LENGTH)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 25, 107 | "metadata": { 108 | "collapsed": false 109 | }, 110 | "outputs": [ 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | "('Train: ', (699, 5000), (699, 10))\n", 116 | "('Validation: ', (200, 5000), (200, 10))\n", 117 | "('Test: ', (101, 5000), (101, 10))\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "#split sets\n", 123 | "X_train, X_val, X_test, y_train, y_val, y_test = helpers.train_val_test_split(\n", 124 | " data, labels, val_size=0.2, test_size=0.1, random_state=101)\n", 125 | "print(\"Train: \", X_train.shape, y_train.shape)\n", 126 | "print(\"Validation: \", X_val.shape, y_val.shape)\n", 127 | "print(\"Test: \", X_test.shape, y_test.shape)" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 26, 133 | "metadata": { 134 | "collapsed": false 135 | }, 136 | "outputs": [], 137 | "source": [ 138 | "# Delete temporary variables to free some memory\n", 139 | "del df, data, labels" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 27, 145 | "metadata": { 146 | "collapsed": false 147 | }, 148 | "outputs": [ 149 | { 150 | "name": "stdout", 151 | "output_type": "stream", 152 | "text": [ 153 | "('Vocabulary in notes:', 22330)\n", 154 | "('Vocabulary in original embedding:', 400000)\n", 155 | "('Vocabulary intersection:', 14239)\n" 156 | ] 157 | } 158 | ], 159 | "source": [ 160 | "#creating embeddings\n", 161 | "EMBEDDING_LOC = '../data/glove.6B.100d.txt' # location of embedding\n", 162 | "EMBEDDING_DIM = 100 # given the glove that we chose\n", 163 | "embedding_matrix, embedding_dict = vectorization.embedding_matrix(EMBEDDING_LOC,\n", 164 | " dictionary, EMBEDDING_DIM, verbose = True)\n" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "## CNN for text classification\n", 172 | "\n", 173 | "Based on the following papers and links:\n", 174 | "* \"Convolutional Neural Networks for Sentence Classification\" \n", 175 | "* \"A Sensitivity Analysis of (and Practitioners� Guide to) Convolutional Neural Networks for Sentence Classification\"\n", 176 | "* http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/\n", 177 | "* https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/sentiment_cnn.py" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 28, 183 | "metadata": { 184 | "collapsed": true 185 | }, 186 | "outputs": [], 187 | "source": [ 188 | "#### set parameters:\n", 189 | "num_filters = 100\n", 190 | "filter_sizes = [2,3,4,5]\n", 191 | "training_dropout_keep_prob = 0.9\n", 192 | "num_classes=N_TOP\n", 193 | "batch_size = 50\n", 194 | "epochs = 5\n", 195 | "external_embeddings = False\n", 196 | "EMBEDDING_TRAINABLE = True" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 29, 202 | "metadata": { 203 | "collapsed": false 204 | }, 205 | "outputs": [ 206 | { 207 | "name": "stdout", 208 | "output_type": "stream", 209 | "text": [ 210 | "Train on 699 samples, validate on 200 samples\n", 211 | "Epoch 1/5\n", 212 | "25s - loss: 0.6631 - acc: 0.7506 - val_loss: 0.6439 - val_acc: 0.7445\n", 213 | "Epoch 2/5\n", 214 | "26s - loss: 0.6463 - acc: 0.7506 - val_loss: 0.6408 - val_acc: 0.7445\n", 215 | "Epoch 3/5\n", 216 | "26s - loss: 0.6409 - acc: 0.7509 - val_loss: 0.6390 - val_acc: 0.7445\n", 217 | "Epoch 4/5\n", 218 | "25s - loss: 0.6372 - acc: 0.7506 - val_loss: 0.6373 - val_acc: 0.7445\n", 219 | "Epoch 5/5\n", 220 | "26s - loss: 0.6311 - acc: 0.7506 - val_loss: 0.6361 - val_acc: 0.7445\n" 221 | ] 222 | }, 223 | { 224 | "data": { 225 | "text/plain": [ 226 | "" 227 | ] 228 | }, 229 | "execution_count": 29, 230 | "metadata": {}, 231 | "output_type": "execute_result" 232 | } 233 | ], 234 | "source": [ 235 | "#Embedding\n", 236 | "model_input = Input(shape=(MAX_SEQ_LENGTH, ))\n", 237 | "if external_embeddings:\n", 238 | " # use embedding_matrix plus local training\n", 239 | " z = Embedding(MAX_VOCAB + 1,\n", 240 | " EMBEDDING_DIM,\n", 241 | " weights=[embedding_matrix],\n", 242 | " input_length=MAX_SEQ_LENGTH,\n", 243 | " trainable=EMBEDDING_TRAINABLE)(model_input)\n", 244 | "else:\n", 245 | " # train embeddings \n", 246 | " z = Embedding(MAX_VOCAB + 1, \n", 247 | " EMBEDDING_DIM, \n", 248 | " input_length=MAX_SEQ_LENGTH, \n", 249 | " name=\"embedding\")(model_input)\n", 250 | "\n", 251 | "# Convolutional block\n", 252 | "conv_blocks = []\n", 253 | "for sz in filter_sizes:\n", 254 | " conv = Convolution1D(filters=num_filters,\n", 255 | " kernel_size=sz,\n", 256 | " padding=\"valid\",\n", 257 | " activation=\"relu\",\n", 258 | " strides=1)(z)\n", 259 | " window_pool_size = MAX_SEQ_LENGTH - sz + 1 \n", 260 | " conv = MaxPooling1D(pool_size=window_pool_size)(conv) \n", 261 | " conv = Flatten()(conv)\n", 262 | " conv_blocks.append(conv)\n", 263 | "\n", 264 | "#concatenate\n", 265 | "z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]\n", 266 | "z = Dropout(training_dropout_keep_prob)(z)\n", 267 | "\n", 268 | "#score prediction\n", 269 | "#z = Dense(num_classes, activation=\"relu\")(z) I don't think this is necessary\n", 270 | "model_output = Dense(num_classes, activation=\"softmax\")(z)\n", 271 | "\n", 272 | "#creating model\n", 273 | "model = Model(model_input, model_output)\n", 274 | "# what to use for tf.nn.softmax_cross_entropy_with_logits?\n", 275 | "model.compile(loss=\"binary_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"])\n", 276 | "\n", 277 | "# Train the model\n", 278 | "model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs,\n", 279 | "validation_data=(X_val, y_val), verbose=2)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 30, 285 | "metadata": { 286 | "collapsed": false 287 | }, 288 | "outputs": [], 289 | "source": [ 290 | "pred_train = model.predict(X_train, batch_size=50)\n", 291 | "pred_dev = model.predict(X_val, batch_size=50)" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 31, 297 | "metadata": { 298 | "collapsed": false 299 | }, 300 | "outputs": [ 301 | { 302 | "name": "stdout", 303 | "output_type": "stream", 304 | "text": [ 305 | "F1 scores\n", 306 | "threshold | training | dev \n", 307 | "0.020: 0.399 0.407\n", 308 | "0.030: 0.399 0.407\n", 309 | "0.040: 0.399 0.407\n", 310 | "0.050: 0.408 0.413\n", 311 | "0.055: 0.433 0.420\n", 312 | "0.058: 0.437 0.430\n", 313 | "0.060: 0.432 0.427\n", 314 | "0.080: 0.501 0.463\n", 315 | "0.100: 0.446 0.463\n", 316 | "0.200: 0.206 0.066\n", 317 | "0.300: 0.000 0.000\n", 318 | "0.500: 0.000 0.000\n" 319 | ] 320 | } 321 | ], 322 | "source": [ 323 | "def get_f1_score(y_true,y_hat,threshold, average):\n", 324 | " hot_y = np.where(np.array(y_hat) > threshold, 1, 0)\n", 325 | " return f1_score(np.array(y_true), hot_y, average=average)\n", 326 | "\n", 327 | "print 'F1 scores'\n", 328 | "print 'threshold | training | dev '\n", 329 | "f1_score_average = 'micro'\n", 330 | "for threshold in [ 0.02, 0.03,0.04,0.05,0.055,0.058,0.06, 0.08, 0.1,0.2,0.3, 0.5]:\n", 331 | " train_f1 = get_f1_score(y_train, pred_train,threshold,f1_score_average)\n", 332 | " dev_f1 = get_f1_score(y_val, pred_dev,threshold,f1_score_average)\n", 333 | " print '%1.3f: %1.3f %1.3f' % (threshold,train_f1, dev_f1)" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "### Results with external embeddings = True , no additional training, top 20\n", 341 | "```\n", 342 | "F1 scores\n", 343 | "threshold | training | dev \n", 344 | "0.020: 0.337 0.329\n", 345 | "0.030: 0.360 0.353\n", 346 | "0.040: 0.365 0.374\n", 347 | "0.050: 0.372 0.375\n", 348 | "0.055: 0.370 0.377\n", 349 | "0.058: 0.369 0.375\n", 350 | "0.060: 0.368 0.375\n", 351 | "0.080: 0.348 0.361\n", 352 | "0.100: 0.309 0.319\n", 353 | "0.200: 0.198 0.208\n", 354 | "0.300: 0.157 0.138\n", 355 | "0.500: 0.000 0.000\n", 356 | "```\n", 357 | "\n", 358 | "### Results with external embeddings = False, top 20\n", 359 | "```\n", 360 | "F1 scores\n", 361 | "threshold | training | dev \n", 362 | "0.020: 0.288 0.300\n", 363 | "0.030: 0.327 0.322\n", 364 | "0.040: 0.371 0.363\n", 365 | "0.050: 0.380 0.391\n", 366 | "0.055: 0.412 0.383\n", 367 | "0.058: 0.403 0.394\n", 368 | "0.060: 0.394 0.389\n", 369 | "0.080: 0.385 0.390\n", 370 | "0.100: 0.229 0.225\n", 371 | "0.200: 0.000 0.000\n", 372 | "0.300: 0.000 0.000\n", 373 | "0.500: 0.000 0.000\n", 374 | "```\n", 375 | "\n", 376 | "### Results with external embedding and training them , top 20\n", 377 | "```\n", 378 | "F1 scores\n", 379 | "threshold | training | dev \n", 380 | "0.020: 0.334 0.333\n", 381 | "0.030: 0.362 0.360\n", 382 | "0.040: 0.366 0.374\n", 383 | "0.050: 0.373 0.380\n", 384 | "0.055: 0.374 0.382\n", 385 | "0.058: 0.376 0.376\n", 386 | "0.060: 0.376 0.378\n", 387 | "0.080: 0.387 0.371\n", 388 | "0.100: 0.366 0.350\n", 389 | "0.200: 0.179 0.171\n", 390 | "0.300: 0.020 0.020\n", 391 | "0.500: 0.000 0.000\n", 392 | "\n", 393 | "```\n", 394 | "\n", 395 | "### Results with external Embeddings = False, top 10, \n", 396 | "We can compare this setup with the LSTM published in the paper \"Applying Deep Learning to ICD-9 Multi-label Classification from Medical Records\", they got a F1-score of about 0.4168, we are getting 0.447\n", 397 | "\n", 398 | "``` \n", 399 | "F1 scores\n", 400 | "threshold | training | dev \n", 401 | "0.020: 0.399 0.407\n", 402 | "0.030: 0.399 0.407\n", 403 | "0.040: 0.399 0.407\n", 404 | "0.050: 0.408 0.413\n", 405 | "0.055: 0.433 0.420\n", 406 | "0.058: 0.437 0.430\n", 407 | "0.060: 0.432 0.427\n", 408 | "0.080: 0.501 0.463\n", 409 | "0.100: 0.446 0.463\n", 410 | "0.200: 0.206 0.066\n", 411 | "0.300: 0.000 0.000\n", 412 | "0.500: 0.000 0.000\n", 413 | "```\n", 414 | "\n" 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "metadata": { 420 | "collapsed": true 421 | }, 422 | "source": [ 423 | "## Notes:\n", 424 | "\n", 425 | "\n", 426 | "(1) There is a LSTM model by this paper: \"Applying Deep Learning to ICD-9 Multi-label Classification from Medical Records\" which did achieve a 42% F1-score. (https://cs224d.stanford.edu/reports/priyanka.pdf), but it only uses the top 10 icd9 codes. We are getting 46% (just running with 1000 notes so far)\n", 427 | "\n", 428 | "\n", 429 | "(2) The \"A Comparison of Rule-Based and Deep Learning Models for Patient Phenotyping\" study did get a 70% F1-score, but they don't use the icd9-labels but phenotypes labels they annotated themselved (via a group of medical professionals). (https://arxiv.org/abs/1703.08705). There were ONLY 10 phenotypes.\n", 430 | "\n", 431 | "The discharge summaries are labeled with ICD9-codes that are leaves in the ICD9-hierarchy (which has hundreds of ICD9-codes), then maybe these leave nodes are too specific and difficult to predict, one experiment would be to replaced all the ICD9-codes with their parent in the second or third level in the hierarchy and see if predictions work better that way. \n", 432 | "\n", 433 | "(3) our baseline with top 20 codes had a f1-score of 35% (assigning top 4 icd9 codes to all notes, using a CNN with no external embeddings is getting about 40% f1-score.. a little better than the baseline" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": null, 439 | "metadata": { 440 | "collapsed": true 441 | }, 442 | "outputs": [], 443 | "source": [] 444 | } 445 | ], 446 | "metadata": { 447 | "kernelspec": { 448 | "display_name": "Python 2", 449 | "language": "python", 450 | "name": "python2" 451 | }, 452 | "language_info": { 453 | "codemirror_mode": { 454 | "name": "ipython", 455 | "version": 2 456 | }, 457 | "file_extension": ".py", 458 | "mimetype": "text/x-python", 459 | "name": "python", 460 | "nbconvert_exporter": "python", 461 | "pygments_lexer": "ipython2", 462 | "version": "2.7.13" 463 | } 464 | }, 465 | "nbformat": 4, 466 | "nbformat_minor": 2 467 | } 468 | -------------------------------------------------------------------------------- /icd9_cnn/.ipynb_checkpoints/icd9_cnn_multilabel-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import csv\n", 12 | "import random\n", 13 | "import numpy as np\n", 14 | "from collections import Counter, defaultdict\n", 15 | "from sklearn.feature_extraction.text import *\n", 16 | "import re\n", 17 | "from tensorflow.contrib import learn\n", 18 | "import sys, os\n", 19 | "import tensorflow as tf\n", 20 | "import cnn_model\n", 21 | "import utils\n", 22 | "\n", 23 | "from sklearn.metrics import label_ranking_loss\n", 24 | "from sklearn.metrics import f1_score\n", 25 | "import shutil" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "General Sources:\n", 33 | "http://ruder.io/deep-learning-nlp-best-practices/index.html#classification" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "### Reading File" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": { 47 | "collapsed": false 48 | }, 49 | "outputs": [ 50 | { 51 | "name": "stdout", 52 | "output_type": "stream", 53 | "text": [ 54 | "Number of records in the dataset: 45837\n" 55 | ] 56 | } 57 | ], 58 | "source": [ 59 | "#with open('../../../psql_files/disch_notes_all_icd9.csv', 'rb') as csvfile:\n", 60 | "csv.field_size_limit(sys.maxsize)\n", 61 | "with open('../baseline/psql_files/dis_notes_icd9.csv', 'rb') as csvfile:\n", 62 | " discharge_notes_reader = csv.reader(csvfile)\n", 63 | " discharge_notes_list = list(discharge_notes_reader) \n", 64 | "random.shuffle(discharge_notes_list)\n", 65 | "\n", 66 | "print \"Number of records in the dataset: \", len (discharge_notes_list)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "we will take only 10,000 records to compare with NN baseline" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 3, 79 | "metadata": { 80 | "collapsed": true 81 | }, 82 | "outputs": [], 83 | "source": [ 84 | "#starting for 1,000 just for programming\n", 85 | "number_records = 1000" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 4, 91 | "metadata": { 92 | "collapsed": false 93 | }, 94 | "outputs": [ 95 | { 96 | "name": "stdout", 97 | "output_type": "stream", 98 | "text": [ 99 | "Number of discharge clinical notes: 1000\n" 100 | ] 101 | } 102 | ], 103 | "source": [ 104 | "discharge_notes_icd9 = np.asarray(discharge_notes_list[0:number_records])\n", 105 | "print 'Number of discharge clinical notes: ', len(discharge_notes_icd9)\n", 106 | "discharge_notes= discharge_notes_icd9[:,3]\n", 107 | "discharge_labels = discharge_notes_icd9[:,4]" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "## Pre Processing" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "## Stats about Notes (TODO:)\n", 122 | "* vocabulary of size\n", 123 | "* find out notes that are too large, outliers to take out (otherwise the embeddings will pad a lot of zeroes to the other note-vectors(" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "## Converting icd9 labels to vectors" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 5, 136 | "metadata": { 137 | "collapsed": true 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "#transforming list of icd_codes into a vector\n", 142 | "def get_icd9_array(icd9_codes):\n", 143 | " icd9_index_array = [0]*len(unique_icd9_codes)\n", 144 | " for icd9_code in icd9_codes.split():\n", 145 | " index = icd9_to_id [icd9_code]\n", 146 | " icd9_index_array[index] = 1\n", 147 | " return icd9_index_array" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 6, 153 | "metadata": { 154 | "collapsed": false 155 | }, 156 | "outputs": [ 157 | { 158 | "name": "stdout", 159 | "output_type": "stream", 160 | "text": [ 161 | "Counter({'4019': 428, '41401': 297, '4280': 292, '42731': 272, '2724': 211, '5849': 210, '25000': 207, '51881': 172, '53081': 148, '5990': 142, '2449': 133, '2859': 118, '486': 118, '2720': 117, '2762': 112, '496': 97, '5070': 88, '2851': 87, '99592': 78, '0389': 68})\n", 162 | " \n", 163 | "List of unique icd9 codes from all labels: ['2859', '99592', '4019', '2724', '25000', '2720', '2851', '2762', '2449', '4280', '0389', '41401', '42731', '5849', '53081', '486', '5070', '496', '51881', '5990']\n" 164 | ] 165 | } 166 | ], 167 | "source": [ 168 | "#counts by icd9_codes\n", 169 | "icd9_codes = Counter()\n", 170 | "for label in discharge_labels:\n", 171 | " for icd9_code in label.split():\n", 172 | " icd9_codes[icd9_code] += 1\n", 173 | "print icd9_codes\n", 174 | "\n", 175 | "# list of unique icd9_codes and lookups for its index in the vector\n", 176 | "unique_icd9_codes = list (icd9_codes)\n", 177 | "index_to_icd9 = dict(enumerate(unique_icd9_codes))\n", 178 | "icd9_to_id = {v:k for k,v in index_to_icd9.iteritems()}\n", 179 | "print ' '\n", 180 | "print 'List of unique icd9 codes from all labels: ', unique_icd9_codes\n", 181 | "\n", 182 | "#convert icd9 codes into ids\n", 183 | "labels_vector= list(map(get_icd9_array,discharge_labels))" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "## Pre-processing notes" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py\n", 198 | "\n", 199 | "\n", 200 | "(1) Clean the text data using the same code as the original paper.\n", 201 | "https://github.com/yoonkim/CNN_sentence\n", 202 | "\n", 203 | "(2) Pad each note to the maximum note length, which turns out to be NN. We append special tokens to all other notes to make them NN words. Padding sentences to the same length is useful because it allows us to efficiently batch our data since each example in a batch must be of the same length.\n", 204 | "(3) Build a vocabulary index and map each word to an integer between 0 and 18,765 (the vocabulary size). Each sentence becomes a vector of integers" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 7, 210 | "metadata": { 211 | "collapsed": false 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "def clean_str(string):\n", 216 | " \"\"\"\n", 217 | " Tokenization/string cleaning for all datasets except for SST.\n", 218 | " Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py\n", 219 | " \"\"\"\n", 220 | " string = re.sub(r\"[^A-Za-z0-9(),!?\\'\\`]\", \" \", string)\n", 221 | " string = re.sub(r\"\\'s\", \" \\'s\", string)\n", 222 | " string = re.sub(r\"\\'ve\", \" \\'ve\", string)\n", 223 | " string = re.sub(r\"n\\'t\", \" n\\'t\", string)\n", 224 | " string = re.sub(r\"\\'re\", \" \\'re\", string)\n", 225 | " string = re.sub(r\"\\'d\", \" \\'d\", string)\n", 226 | " string = re.sub(r\"\\'ll\", \" \\'ll\", string)\n", 227 | " string = re.sub(r\",\", \" , \", string)\n", 228 | " string = re.sub(r\"!\", \" ! \", string)\n", 229 | " string = re.sub(r\"\\(\", \" \\( \", string)\n", 230 | " string = re.sub(r\"\\)\", \" \\) \", string)\n", 231 | " string = re.sub(r\"\\?\", \" \\? \", string)\n", 232 | " string = re.sub(r\"\\s{2,}\", \" \", string)\n", 233 | " return string.strip().lower()\n", 234 | "\n", 235 | "def note_preprocessing(data_notes):\n", 236 | " notes_stripped = [s.strip() for s in data_notes]\n", 237 | " notes_clean = [clean_str(note) for note in notes_stripped ]\n", 238 | " notes_canonicalized = [\" \".join (utils.canonicalize_words(note.split(\" \"))) for note in notes_clean ]\n", 239 | " \n", 240 | " note_words_length = [len(x.split(\" \")) for x in notes_canonicalized]\n", 241 | " max_document_length = max( note_words_length) \n", 242 | " average_length = np.mean(note_words_length)\n", 243 | " return max_document_length, average_length, notes_canonicalized" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 8, 249 | "metadata": { 250 | "collapsed": false 251 | }, 252 | "outputs": [ 253 | { 254 | "name": "stdout", 255 | "output_type": "stream", 256 | "text": [ 257 | " max document length: 7047\n", 258 | "average document length: 1908.263\n", 259 | "Vocabulary_size: 23244\n" 260 | ] 261 | } 262 | ], 263 | "source": [ 264 | "#preprocess documents\n", 265 | "max_document_length, average_document_length, notes_processed = note_preprocessing(discharge_notes)\n", 266 | "\n", 267 | "\n", 268 | "print ' max document length: ', max_document_length\n", 269 | "print 'average document length: ', average_document_length\n", 270 | "\n", 271 | "#create vocabulary processor\n", 272 | "vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)\n", 273 | " \n", 274 | "# convert words to ids, and each document is padded\n", 275 | "notes_ids = np.array(list(vocab_processor.fit_transform(notes_processed)))\n", 276 | "\n", 277 | "# vocabulary size\n", 278 | "vocabulary_size = len(vocab_processor.vocabulary_)\n", 279 | "print 'Vocabulary_size: ', vocabulary_size" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 9, 285 | "metadata": { 286 | "collapsed": false 287 | }, 288 | "outputs": [], 289 | "source": [ 290 | "#notes_processed[0]" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "### question?\n", 298 | "VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV \n", 299 | "what do we do if the test data has a document with a bigger length than the max for the padding? " 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "### transforming to embeddings using word2vec\n", 307 | "\n", 308 | "From: \"A Comparison of Rule-Based and Deep Learning Models for Patient Phenotyping\"\n", 309 | "\n", 310 | "\"We pre-train our embeddings with word2vec on all discharge notes available in the MIMIC-III database. \n", 311 | "The word embeddings of all words in the text to classify are concatenated and used as input to the\n", 312 | "convolutional layer. Convolutions detect a signal from a combination of adjacent inputs. We\n", 313 | "combine multiple convolutions of different lengths to evaluate phrases that are anywhere from\n", 314 | "two to five words long,\" \n", 315 | "\n", 316 | "(tf-idf is removing negations.. embedding is taking care of mispellings.. we may need further training-tuning because of medical terms)" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "metadata": {}, 322 | "source": [ 323 | "https://code.google.com/archive/p/word2vec/\n", 324 | " \n", 325 | "Pre-trained word and phrase vectors\n", 326 | "\n", 327 | "\"We are publishing pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in [2]. The archive is available here: GoogleNews-vectors-negative300.bin.gz.\" \n", 328 | "\n", 329 | "### for now we wil train our own embeddings, but word2vec will be better" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "## Split Files" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": 9, 342 | "metadata": { 343 | "collapsed": false 344 | }, 345 | "outputs": [ 346 | { 347 | "name": "stdout", 348 | "output_type": "stream", 349 | "text": [ 350 | "Training set samples: 700\n", 351 | "Dev set samples: 150\n", 352 | "Test set samples: 150\n" 353 | ] 354 | } 355 | ], 356 | "source": [ 357 | "def split_file(data, train_frac = 0.7, dev_frac = 0.15): \n", 358 | " train_split_idx = int(train_frac * len(data))\n", 359 | " dev_split_idx = int ((train_frac + dev_frac)* len(data))\n", 360 | " train_data = data[:train_split_idx]\n", 361 | " dev_data = data[train_split_idx:dev_split_idx]\n", 362 | " test_data = data[dev_split_idx:]\n", 363 | " return train_data, dev_data, test_data\n", 364 | "\n", 365 | "\n", 366 | "train_notes, dev_notes, test_notes = split_file (notes_ids)\n", 367 | "train_labels, dev_labels, test_labels = split_file (labels_vector)\n", 368 | "print 'Training set samples:', len (train_notes)\n", 369 | "print 'Dev set samples:', len (dev_notes)\n", 370 | "print 'Test set samples:', len (test_notes)" 371 | ] 372 | }, 373 | { 374 | "cell_type": "markdown", 375 | "metadata": {}, 376 | "source": [] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "## CNN Training\n", 383 | "\n", 384 | "here is an example of a CNN to classify text.. our model will have different values for d (embedding-size, region sizes, etc)\n", 385 | "" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [ 392 | "This is the CNN used with the MIMIC discharge summaries\n", 393 | "\n", 394 | "\n", 395 | "\n", 396 | "\"For the CNN model, we used 100 filters for each of the widths 2, 3, 4, and 5. \n", 397 | "To prevent overfitting, we set the dropout probability to 0.5 and used L2-normalization to normalize word\n", 398 | "embeddings to have a max norm of 3.64 \n", 399 | "The model was trained using adadelta with an initial learning rate of 1 for 20 epochs. \n", 400 | "The CNN model was implemented using Lua and the Torch7 framework.66 \n", 401 | "All baseline models were implemented using Python with the scikit-learn library.\"" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": { 407 | "collapsed": true 408 | }, 409 | "source": [ 410 | "### sources:\n", 411 | "http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ \n", 412 | "http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/ \n", 413 | "https://github.com/dennybritz/cnn-text-classification-tf/blob/master/text_cnn.py \n", 414 | "https://www.tensorflow.org/get_started/mnist/pros \n", 415 | "https://www.tensorflow.org/api_docs/python/tf/nn/conv2d \n", 416 | " \n", 417 | " multi-label\n", 418 | " https://github.com/may-/cnn-re-tf/blob/master/cnn.py" 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": {}, 424 | "source": [ 425 | "From: \"A Comparison of Rule-Based and Deep Learning Models for Patient Phenotyping\"\n", 426 | "\n", 427 | "\"For the CNN model, we used 100 filters for each of the widths 2, 3, 4, and 5. \n", 428 | "To prevent overfitting, we set the dropout probability to 0.5 and used L2-normalization to normalize word\n", 429 | "embeddings to have a max norm of 3.64 \n", 430 | "The model was trained using adadelta with an initial learning rate of 1 for 20 epochs\"" 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 10, 436 | "metadata": { 437 | "collapsed": false 438 | }, 439 | "outputs": [], 440 | "source": [ 441 | "def run_epoch(lm, session, X, y, batch_size, dropout_keep_prob):\n", 442 | " for batch in xrange(0, X.shape[0], batch_size):\n", 443 | " # x SHAPE: [batch_size, sequence_length, embedding_size]\n", 444 | " X_batch = X[batch : batch + batch_size]\n", 445 | " y_batch = y[batch : batch + batch_size]\n", 446 | " feed_dict = {lm.input_x:X_batch,lm.input_y:y_batch,lm.dropout_keep_prob: dropout_keep_prob}\n", 447 | " #loss, train_op_value = session.run( [lm.loss,lm.train],feed_dict=feed_dict ) \n", 448 | " loss, _, step = session.run([lm.loss, lm.train_op, lm.global_step], feed_dict)\n", 449 | " if batch % 500: \n", 450 | " print 'batch: %d, loss: %5.5f' % (batch, loss) " 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 11, 456 | "metadata": { 457 | "collapsed": true 458 | }, 459 | "outputs": [], 460 | "source": [ 461 | "def predict_icd9_codes(lm, session, x_data, y_data, dropout_keep_prob=1.0):\n", 462 | " total_y_hat = []\n", 463 | " for batch in xrange(0, x_data.shape[0], batch_size):\n", 464 | " X_batch = x_data[batch : batch + batch_size]\n", 465 | " Y_batch = y_data[batch : batch + batch_size]\n", 466 | " y_hat_out = session.run(lm.y_hat, feed_dict={lm.input_x:X_batch,lm.input_y:Y_batch, lm.dropout_keep_prob: dropout_keep_prob})\n", 467 | " total_y_hat.extend(y_hat_out)\n", 468 | " return total_y_hat" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 13, 474 | "metadata": { 475 | "collapsed": false 476 | }, 477 | "outputs": [], 478 | "source": [ 479 | "#build tensorflow graphs\n", 480 | "reload(cnn_model)\n", 481 | "\n", 482 | "# Model parameters\n", 483 | "\n", 484 | "model_params = dict(vocab_size= vocabulary_size, sequence_length=max_document_length, learning_rate=0.0001,\\\n", 485 | " embedding_size=128, num_classes=20, filter_sizes=[2,3,4,5], num_filters=100)\n", 486 | "\n", 487 | "# Build and Train Model\n", 488 | "cnn = cnn_model.NNLM(**model_params)\n", 489 | "cnn.BuildCoreGraph()\n", 490 | "cnn.BuildTrainGraph()" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": 14, 496 | "metadata": { 497 | "collapsed": false 498 | }, 499 | "outputs": [], 500 | "source": [ 501 | "TF_SAVEDIR = \"tf_saved\"\n", 502 | "trained_filename = os.path.join(TF_SAVEDIR, \"cnn_trained\")" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 15, 508 | "metadata": { 509 | "collapsed": false 510 | }, 511 | "outputs": [ 512 | { 513 | "name": "stdout", 514 | "output_type": "stream", 515 | "text": [ 516 | "epoch_num: 0\n", 517 | "batch: 50, loss: 35.21418\n", 518 | "batch: 100, loss: 32.00004\n", 519 | "batch: 150, loss: 32.23874\n", 520 | "batch: 200, loss: 32.62824\n", 521 | "batch: 250, loss: 28.70796\n", 522 | "batch: 300, loss: 28.36549\n", 523 | "batch: 350, loss: 31.19997\n", 524 | "batch: 400, loss: 31.18843\n", 525 | "batch: 450, loss: 24.09070\n", 526 | "batch: 550, loss: 27.82526\n", 527 | "batch: 600, loss: 26.58137\n", 528 | "batch: 650, loss: 31.49951\n", 529 | "epoch_num: 1\n", 530 | "batch: 50, loss: 27.33090\n", 531 | "batch: 100, loss: 23.43864\n", 532 | "batch: 150, loss: 24.28109\n", 533 | "batch: 200, loss: 28.88978\n", 534 | "batch: 250, loss: 23.29307\n", 535 | "batch: 300, loss: 23.56560\n", 536 | "batch: 350, loss: 24.92994\n", 537 | "batch: 400, loss: 26.69365\n", 538 | "batch: 450, loss: 20.86471\n", 539 | "batch: 550, loss: 21.02352\n", 540 | "batch: 600, loss: 22.59895\n", 541 | "batch: 650, loss: 26.52458\n", 542 | "epoch_num: 2\n", 543 | "batch: 50, loss: 21.96159\n", 544 | "batch: 100, loss: 20.27966\n", 545 | "batch: 150, loss: 22.11069\n", 546 | "batch: 200, loss: 23.96683\n", 547 | "batch: 250, loss: 19.88365\n", 548 | "batch: 300, loss: 19.78596\n", 549 | "batch: 350, loss: 21.74492\n", 550 | "batch: 400, loss: 23.85380\n", 551 | "batch: 450, loss: 19.42990\n", 552 | "batch: 550, loss: 19.73495\n", 553 | "batch: 600, loss: 20.80687\n", 554 | "batch: 650, loss: 22.45188\n", 555 | "epoch_num: 3\n", 556 | "batch: 50, loss: 20.61851\n", 557 | "batch: 100, loss: 18.33901\n", 558 | "batch: 150, loss: 18.87777\n", 559 | "batch: 200, loss: 22.49130\n", 560 | "batch: 250, loss: 17.76616\n", 561 | "batch: 300, loss: 19.26856\n", 562 | "batch: 350, loss: 20.12720\n", 563 | "batch: 400, loss: 20.30942\n", 564 | "batch: 450, loss: 17.82849\n", 565 | "batch: 550, loss: 19.31835\n", 566 | "batch: 600, loss: 18.83955\n", 567 | "batch: 650, loss: 22.28319\n", 568 | "epoch_num: 4\n", 569 | "batch: 50, loss: 18.48457\n", 570 | "batch: 100, loss: 19.22259\n", 571 | "batch: 150, loss: 20.59698\n", 572 | "batch: 200, loss: 20.46447\n", 573 | "batch: 250, loss: 17.04944\n", 574 | "batch: 300, loss: 17.38269\n", 575 | "batch: 350, loss: 18.84311\n", 576 | "batch: 400, loss: 20.78538\n", 577 | "batch: 450, loss: 16.71252\n", 578 | "batch: 550, loss: 17.19374\n", 579 | "batch: 600, loss: 18.95580\n", 580 | "batch: 650, loss: 22.09250\n", 581 | "predicting training now \n", 582 | "predicting dev set now\n", 583 | "done!\n" 584 | ] 585 | } 586 | ], 587 | "source": [ 588 | "batch_size = 50\n", 589 | "num_epochs = 5\n", 590 | "training_dropout_keep_prob = 0.8\n", 591 | "\n", 592 | "with cnn.graph.as_default():\n", 593 | " initializer = tf.global_variables_initializer()\n", 594 | " saver = tf.train.Saver()\n", 595 | " \n", 596 | "# Clear old log directory\n", 597 | "shutil.rmtree(TF_SAVEDIR, ignore_errors=True)\n", 598 | "if not os.path.isdir(TF_SAVEDIR):\n", 599 | " os.makedirs(TF_SAVEDIR)\n", 600 | "\n", 601 | "with tf.Session(graph=cnn.graph) as session:\n", 602 | " session.run(initializer)\n", 603 | " #training\n", 604 | " for epoch_num in xrange(num_epochs):\n", 605 | " print 'epoch_num:' , epoch_num\n", 606 | " run_epoch(cnn, session, train_notes, train_labels, batch_size,dropout_keep_prob=training_dropout_keep_prob )\n", 607 | " saver.save(session, trained_filename)\n", 608 | " print 'predicting training now '\n", 609 | " train_y_hat = predict_icd9_codes(cnn, session, train_notes, train_labels) \n", 610 | " print 'predicting dev set now'\n", 611 | " dev_y_hat = predict_icd9_codes(cnn, session, dev_notes, dev_labels)\n", 612 | " print 'done!'\n", 613 | "\n" 614 | ] 615 | }, 616 | { 617 | "cell_type": "markdown", 618 | "metadata": {}, 619 | "source": [ 620 | "## Performance Evaluation\n" 621 | ] 622 | }, 623 | { 624 | "cell_type": "code", 625 | "execution_count": 16, 626 | "metadata": { 627 | "collapsed": false 628 | }, 629 | "outputs": [ 630 | { 631 | "name": "stdout", 632 | "output_type": "stream", 633 | "text": [ 634 | "Training ranking loss: 0.366064771696\n", 635 | "Development ranking loss: 0.394886934101\n" 636 | ] 637 | } 638 | ], 639 | "source": [ 640 | "# ranking loss\n", 641 | "training_ranking_loss = label_ranking_loss(train_labels, train_y_hat)\n", 642 | "print \"Training ranking loss: \", training_ranking_loss\n", 643 | "dev_ranking_loss = label_ranking_loss(dev_labels, dev_y_hat)\n", 644 | "print \"Development ranking loss: \", dev_ranking_loss" 645 | ] 646 | }, 647 | { 648 | "cell_type": "markdown", 649 | "metadata": {}, 650 | "source": [ 651 | "## TODO create a model for thresholding\n", 652 | "\n", 653 | "Large-scale Multi-label Text Classification—Revisiting Neural Networks\n", 654 | "\n", 655 | "\n", 656 | "\"3.3 Thresholding\n", 657 | "Once training of the neural network is finished, its output may be interpreted as a probability\n", 658 | "distribution p (ojx) over the labels for a given document x. The probability distribution\n", 659 | "can be used to rank labels, but additional measures are needed in order to split\n", 660 | "the ranking into relevant and irrelevant labels. For transforming the ranked list of labels\n", 661 | "into a set of binary predictions, we train a multi-label threshold predictor from training\n", 662 | "data. This sort of thresholding methods are also used in [6, 31]\n", 663 | "For each document xm, labels are sorted by the probabilities in decreasing order.\n", 664 | "Ideally, if NNs successfully learn a mapping function f , all correct (positive) labels\n", 665 | "will be placed on top of the sorted list and there should be large margin between the set\n", 666 | "of positive labels and the set of negative labels. Using F1 score as a reference measure,\n", 667 | "we calculate classification performances at every pair of successive positive labels and\n", 668 | "choose a threshold value tm that produces the best performance\"" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 17, 674 | "metadata": { 675 | "collapsed": true 676 | }, 677 | "outputs": [], 678 | "source": [ 679 | "def get_f1_score(y_true,y_hat,threshold, average):\n", 680 | " hot_y = np.where(np.array(y_hat) > threshold, 1, 0)\n", 681 | " return f1_score(np.array(y_true), hot_y, average=average)" 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": 18, 687 | "metadata": { 688 | "collapsed": false 689 | }, 690 | "outputs": [ 691 | { 692 | "name": "stdout", 693 | "output_type": "stream", 694 | "text": [ 695 | "F1 scores\n", 696 | "threshold | training | dev \n", 697 | "0.005: 0.310 0.308\n", 698 | "0.010: 0.311 0.299\n", 699 | "0.020: 0.320 0.300\n", 700 | "0.030: 0.328 0.308\n", 701 | "0.040: 0.328 0.311\n", 702 | "0.050: 0.329 0.305\n", 703 | "0.055: 0.327 0.307\n", 704 | "0.058: 0.326 0.307\n", 705 | "0.060: 0.324 0.307\n", 706 | "0.070: 0.324 0.296\n", 707 | "0.080: 0.324 0.287\n", 708 | "0.100: 0.311 0.280\n", 709 | "0.500: 0.018 0.012\n" 710 | ] 711 | } 712 | ], 713 | "source": [ 714 | "print 'F1 scores'\n", 715 | "print 'threshold | training | dev '\n", 716 | "f1_score_average = 'micro'\n", 717 | "for threshold in [ 0.005, 0.01,0.02,0.03,0.04,0.05,0.055,0.058,0.06, 0.07, 0.08, 0.1, 0.5]:\n", 718 | " train_f1 = get_f1_score(train_labels, train_y_hat,threshold,f1_score_average)\n", 719 | " dev_f1 = get_f1_score(dev_labels, dev_y_hat,threshold,f1_score_average)\n", 720 | " print '%1.3f: %1.3f %1.3f' % (threshold,train_f1, dev_f1)" 721 | ] 722 | }, 723 | { 724 | "cell_type": "markdown", 725 | "metadata": {}, 726 | "source": [ 727 | "```\n", 728 | "10,000 records, 1 epoch\n", 729 | "adam optimizer\n", 730 | "F1 scores\n", 731 | "threshold | training | dev \n", 732 | "0.005: 0.321 0.317\n", 733 | "0.010: 0.334 0.331\n", 734 | "0.020: 0.347 0.345\n", 735 | "0.030: 0.351 0.350\n", 736 | "0.040: 0.349 0.344\n", 737 | "0.050: 0.342 0.337\n", 738 | "0.055: 0.340 0.334\n", 739 | "0.058: 0.337 0.332\n", 740 | "0.060: 0.335 0.330\n", 741 | "0.070: 0.324 0.320\n", 742 | "0.080: 0.313 0.308\n", 743 | "0.100: 0.292 0.283\n", 744 | "0.500: 0.046 0.043\n", 745 | "\n", 746 | "```" 747 | ] 748 | }, 749 | { 750 | "cell_type": "markdown", 751 | "metadata": {}, 752 | "source": [ 753 | "adam optimizer with learning rate 0.0001,dropout= 0.9\n", 754 | "\n", 755 | "```\n", 756 | "F1 scores\n", 757 | "threshold | training | dev \n", 758 | "0.005: 0.298 0.292\n", 759 | "0.010: 0.291 0.291\n", 760 | "0.020: 0.304 0.301\n", 761 | "0.030: 0.313 0.309\n", 762 | "0.040: 0.323 0.307\n", 763 | "0.050: 0.328 0.305\n", 764 | "0.055: 0.325 0.301\n", 765 | "0.058: 0.325 0.297\n", 766 | "0.060: 0.327 0.294\n", 767 | "0.070: 0.324 0.288\n", 768 | "0.080: 0.316 0.275\n", 769 | "0.100: 0.306 0.264\n", 770 | "0.500: 0.007 0.004\n", 771 | "\n", 772 | "\n", 773 | "```" 774 | ] 775 | }, 776 | { 777 | "cell_type": "markdown", 778 | "metadata": {}, 779 | "source": [ 780 | "```\n", 781 | "1000 notes, top 20 labels, adadelta optimizer (but it goes wild on epoch #13) \n", 782 | "learning rate = 0.5, training-dropout = 1.0, batch_size = 50, num_epochs = 5\n", 783 | "\n", 784 | "F1 scores\n", 785 | "threshold | training | dev \n", 786 | "0.005: 0.315 0.311\n", 787 | "0.010: 0.337 0.323\n", 788 | "0.020: 0.367 0.342\n", 789 | "0.030: 0.391 0.337\n", 790 | "0.040: 0.406 0.346\n", 791 | "0.050: 0.417 0.353\n", 792 | "0.055: 0.420 0.343\n", 793 | "0.058: 0.420 0.343\n", 794 | "0.060: 0.421 0.343\n", 795 | "0.070: 0.414 0.340\n", 796 | "0.080: 0.411 0.332\n", 797 | "0.100: 0.393 0.312\n", 798 | "0.500: 0.040 0.034\n", 799 | "```" 800 | ] 801 | }, 802 | { 803 | "cell_type": "markdown", 804 | "metadata": {}, 805 | "source": [ 806 | "```\n", 807 | "1000 notes, top 20 labels, adadelta optimizer (goes wild on #epoch 13)\n", 808 | "learning rate = 0.5, training-dropout = 0.5, batch_size = 50, num_epochs = 5\n", 809 | "F1 scores\n", 810 | "threshold | training | dev \n", 811 | "0.005: 0.375 0.362\n", 812 | "0.010: 0.382 0.364\n", 813 | "0.020: 0.378 0.356\n", 814 | "0.030: 0.352 0.342\n", 815 | "0.040: 0.331 0.324\n", 816 | "0.050: 0.319 0.324\n", 817 | "0.060: 0.306 0.312\n", 818 | "0.100: 0.278 0.294\n", 819 | "0.500: 0.200 0.20\n", 820 | "```" 821 | ] 822 | }, 823 | { 824 | "cell_type": "markdown", 825 | "metadata": {}, 826 | "source": [] 827 | }, 828 | { 829 | "cell_type": "markdown", 830 | "metadata": { 831 | "collapsed": true 832 | }, 833 | "source": [ 834 | "## Thoughts so far\n", 835 | "\n", 836 | "The CNN loss gets stuck with dropout_keep = 0.5.. \n", 837 | "I change it to 0.9, no overfitting, but the dev F1 score of 36%,which is just 1% hihter than the baseline model that always predict the top 4 most common icd-9 code and to the NN Baseline.\n", 838 | "\n", 839 | "\n", 840 | "\n", 841 | "### Lessons learned: \n", 842 | "* Adadelta optimizer has problems when running more than 10 epochs, the training loss stops going down and instead goes upd wildly " 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": {}, 848 | "source": [ 849 | "## using Keras\n", 850 | "base on example: https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/sentiment_cnn.py" 851 | ] 852 | }, 853 | { 854 | "cell_type": "code", 855 | "execution_count": 19, 856 | "metadata": { 857 | "collapsed": false 858 | }, 859 | "outputs": [ 860 | { 861 | "name": "stderr", 862 | "output_type": "stream", 863 | "text": [ 864 | "Using TensorFlow backend.\n" 865 | ] 866 | } 867 | ], 868 | "source": [ 869 | "from keras.models import Sequential, Model\n", 870 | "from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding\n", 871 | "from keras.layers.merge import Concatenate" 872 | ] 873 | }, 874 | { 875 | "cell_type": "code", 876 | "execution_count": 28, 877 | "metadata": { 878 | "collapsed": false 879 | }, 880 | "outputs": [], 881 | "source": [ 882 | "#### set parameters:\n", 883 | "input_shape= (max_document_length,)\n", 884 | "embedding_dims = 128\n", 885 | "num_filters = 100\n", 886 | "filter_sizes = [2,3,4,5]\n", 887 | "training_dropout_keep_prob = 0.9\n", 888 | "num_classes=20\n", 889 | "batch_size = 50\n", 890 | "epochs = 5" 891 | ] 892 | }, 893 | { 894 | "cell_type": "code", 895 | "execution_count": 29, 896 | "metadata": { 897 | "collapsed": false 898 | }, 899 | "outputs": [ 900 | { 901 | "name": "stdout", 902 | "output_type": "stream", 903 | "text": [ 904 | "Train on 700 samples, validate on 150 samples\n", 905 | "Epoch 1/5\n", 906 | "56s - loss: 1.5223 - acc: 0.8247 - val_loss: 0.8964 - val_acc: 0.8297\n", 907 | "Epoch 2/5\n", 908 | "56s - loss: 0.6087 - acc: 0.8310 - val_loss: 0.5500 - val_acc: 0.8297\n", 909 | "Epoch 3/5\n", 910 | "56s - loss: 0.5453 - acc: 0.8310 - val_loss: 0.5508 - val_acc: 0.8297\n", 911 | "Epoch 4/5\n", 912 | "57s - loss: 0.5438 - acc: 0.8310 - val_loss: 0.5494 - val_acc: 0.8297\n", 913 | "Epoch 5/5\n", 914 | "57s - loss: 0.5394 - acc: 0.8310 - val_loss: 0.5466 - val_acc: 0.8297\n" 915 | ] 916 | }, 917 | { 918 | "data": { 919 | "text/plain": [ 920 | "" 921 | ] 922 | }, 923 | "execution_count": 29, 924 | "metadata": {}, 925 | "output_type": "execute_result" 926 | } 927 | ], 928 | "source": [ 929 | "model_input = Input(shape=input_shape)\n", 930 | "z = Embedding(vocabulary_size, embedding_dims, input_length=max_document_length , name=\"embedding\")(model_input)\n", 931 | "\n", 932 | "# Convolutional block\n", 933 | "conv_blocks = []\n", 934 | "for sz in filter_sizes:\n", 935 | " conv = Convolution1D(filters=num_filters,\n", 936 | " kernel_size=sz,\n", 937 | " padding=\"valid\",\n", 938 | " activation=\"relu\",\n", 939 | " strides=1)(z)\n", 940 | " window_pool_size = max_document_length - sz + 1 \n", 941 | " conv = MaxPooling1D(pool_size=2)(conv) #pool_size?\n", 942 | " conv = Flatten()(conv)\n", 943 | " conv_blocks.append(conv)\n", 944 | "\n", 945 | "#concatenate\n", 946 | "z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0]\n", 947 | "z = Dropout(training_dropout_keep_prob)(z)\n", 948 | "\n", 949 | "#score prediction\n", 950 | "#z = Dense(num_classes, activation=\"relu\")(z) I don't think this is necessary\n", 951 | "model_output = Dense(num_classes, activation=\"softmax\")(z)\n", 952 | "\n", 953 | "#creating model\n", 954 | "model = Model(model_input, model_output)\n", 955 | "# what to use for tf.nn.softmax_cross_entropy_with_logits?\n", 956 | "model.compile(loss=\"binary_crossentropy\", optimizer=\"adam\", metrics=[\"accuracy\"])\n", 957 | "\n", 958 | "# Train the model\n", 959 | "model.fit(train_notes, train_labels, batch_size=batch_size, epochs=epochs,\n", 960 | "validation_data=(dev_notes, dev_labels), verbose=2)" 961 | ] 962 | }, 963 | { 964 | "cell_type": "code", 965 | "execution_count": 32, 966 | "metadata": { 967 | "collapsed": false 968 | }, 969 | "outputs": [], 970 | "source": [ 971 | "pred_train = model.predict(train_notes, batch_size=50)\n", 972 | "pred_dev = model.predict(dev_notes, batch_size=50)\n" 973 | ] 974 | }, 975 | { 976 | "cell_type": "code", 977 | "execution_count": 34, 978 | "metadata": { 979 | "collapsed": false 980 | }, 981 | "outputs": [ 982 | { 983 | "name": "stdout", 984 | "output_type": "stream", 985 | "text": [ 986 | "F1 scores\n", 987 | "threshold | training | dev \n", 988 | "0.010: 0.289 0.291\n", 989 | "0.020: 0.289 0.291\n", 990 | "0.030: 0.290 0.291\n", 991 | "0.040: 0.294 0.291\n", 992 | "0.050: 0.418 0.356\n", 993 | "0.055: 0.402 0.303\n", 994 | "0.058: 0.290 0.134\n", 995 | "0.060: 0.192 0.074\n", 996 | "0.080: 0.016 0.000\n", 997 | "0.100: 0.006 0.000\n", 998 | "0.500: 0.000 0.000\n" 999 | ] 1000 | } 1001 | ], 1002 | "source": [ 1003 | "\n", 1004 | "print 'F1 scores'\n", 1005 | "print 'threshold | training | dev '\n", 1006 | "f1_score_average = 'micro'\n", 1007 | "for threshold in [ 0.01,0.02,0.03,0.04,0.05,0.055,0.058,0.06, 0.08, 0.1, 0.5]:\n", 1008 | " train_f1 = get_f1_score(train_labels, pred_train,threshold,f1_score_average)\n", 1009 | " dev_f1 = get_f1_score(dev_labels, pred_dev,threshold,f1_score_average)\n", 1010 | " print '%1.3f: %1.3f %1.3f' % (threshold,train_f1, dev_f1)" 1011 | ] 1012 | }, 1013 | { 1014 | "cell_type": "markdown", 1015 | "metadata": {}, 1016 | "source": [] 1017 | } 1018 | ], 1019 | "metadata": { 1020 | "kernelspec": { 1021 | "display_name": "Python 2", 1022 | "language": "python", 1023 | "name": "python2" 1024 | }, 1025 | "language_info": { 1026 | "codemirror_mode": { 1027 | "name": "ipython", 1028 | "version": 2 1029 | }, 1030 | "file_extension": ".py", 1031 | "mimetype": "text/x-python", 1032 | "name": "python", 1033 | "nbconvert_exporter": "python", 1034 | "pygments_lexer": "ipython2", 1035 | "version": "2.7.13" 1036 | } 1037 | }, 1038 | "nbformat": 4, 1039 | "nbformat_minor": 2 1040 | } 1041 | -------------------------------------------------------------------------------- /icd9_cnn/CNN_for_text2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/CNN_for_text2.png -------------------------------------------------------------------------------- /icd9_cnn/cnn_model.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | # core logic based on http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/ 4 | 5 | def with_self_graph(function): 6 | def wrapper(self, *args, **kwargs): 7 | with self.graph.as_default(): 8 | return function(self, *args, **kwargs) 9 | return wrapper 10 | 11 | class NNLM(object): 12 | def __init__(self, graph=None, *args, **kwargs): 13 | # Set TensorFlow graph. All TF code will work on this graph. 14 | self.graph = graph or tf.Graph() 15 | self.SetParams(*args, **kwargs) 16 | 17 | 18 | @with_self_graph 19 | def SetParams(self, vocab_size, sequence_length, embedding_size, num_classes, learning_rate, filter_sizes,num_filters,l2_reg_lambda=0.0): 20 | self.vocab_size = vocab_size 21 | self.embedding_size =embedding_size 22 | self.num_classes =num_classes 23 | self.filter_sizes = filter_sizes 24 | self.num_filters = num_filters 25 | self.l2_reg_lambda = l2_reg_lambda 26 | # sequence_length: The length of our sentences. In this example all our sentences 27 | #have the same length (59) 28 | self.sequence_length = sequence_length 29 | 30 | self.learning_rate = learning_rate 31 | 32 | 33 | # Training hyperparameters; these can be changed with feed_dict, 34 | with tf.name_scope("Training_Parameters"): 35 | self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob") 36 | 37 | # Keeping track of l2 regularization loss (optional) 38 | self.l2_loss = tf.constant(0.0) 39 | 40 | 41 | 42 | @with_self_graph 43 | def BuildCoreGraph(self): 44 | 45 | self.input_x = tf.placeholder(tf.int32, [None, self.sequence_length], name="input_x") 46 | #self.x = tf.placeholder(tf.float32,shape=[None,self.sequence_length,self.embedding_size],name="input_x") embedded already 47 | self.input_y = tf.placeholder(tf.float32, shape=[None,self.num_classes], name="input_y") 48 | 49 | # Embedding 50 | # ----------------------------------------------------------------------------- 51 | # Embedding layer 52 | with tf.device('/cpu:0'), tf.name_scope("embedding"): 53 | self.W = tf.Variable(tf.random_uniform([self.vocab_size, self.embedding_size], -1.0, 1.0), name="W") 54 | self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x) 55 | 56 | # x embedded SHAPE: [batch_size, sequence_length, embedding_size] 57 | 58 | # TensorFlow convolutional conv2d operation expects a 4-dimensional tensor 59 | # with dimensions corresponding to batch, width, height and channel. 60 | # The result of our embedding does not contain the channel dimension, so we add it manually, 61 | self.x_expanded = tf.expand_dims(self.embedded_chars, -1) 62 | #self.x_expanded .SHAPE: [batch_size, sequence_length, embedding_size, 1] 63 | 64 | # Create a convolution + maxpool layer for each filter size 65 | pooled_outputs = [] 66 | for i, filter_size in enumerate(self.filter_sizes): 67 | with tf.name_scope("conv-maxpool-%s" % filter_size): 68 | # Convolution Layer 69 | # ---------------------------------------------------------------------- 70 | # filter shape: [window_region_height, window_region_width, 71 | # number of input channels, number of filters for each region) 72 | filter_shape = [filter_size, self.embedding_size, 1, self.num_filters] 73 | 74 | # Here, W is our filter matrix. Each filter slides over the whole embedding matrix, 75 | # but varies in how many words it covers. 76 | # "VALID" padding means that we slide the filter over our sentence without padding the edges, 77 | # performing a narrow convolution that gives us an output of shape 78 | W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W") 79 | b = tf.Variable(tf.constant(0.1, shape=[self.num_filters]), name="b") 80 | conv = tf.nn.bias_add(tf.nn.conv2d( self.x_expanded, W, 81 | strides=[1, 1, 1, 1], padding="VALID", name="conv"), b) 82 | 83 | # Apply nonlinearity 84 | h = tf.nn.relu(conv, name="relu") 85 | # h.SHAPE: [1, sequence_length - filter_size + 1, 1, 1] 86 | 87 | # Maxpooling over the outputs 88 | # ------------------------------------------------------------------ 89 | conv_vector_length = self.sequence_length - filter_size + 1 90 | # The pooling ops sweep a rectangular window over the input tensor, computing a reduction operation for each window 91 | # in this case max. Each pooling op uses rectangular windows of size ksize separated by offset strides 92 | k_size = [1, conv_vector_length, 1, 1] # shape of output vector from conv 93 | pooled = tf.nn.max_pool( h, ksize=k_size, 94 | strides=[1, 1, 1, 1], padding='VALID', name="pool") 95 | # pooled. SHAPE: [batch_size, 1, 1, num_filters] 96 | # This is essentially a feature vector, where the last dimension corresponds to our features. 97 | 98 | pooled_outputs.append(pooled) 99 | 100 | # Combine all the pooled features 101 | # ----------------------------------------------------------------- 102 | # Once we have all the pooled output tensors from each filter size we combine them into one long feature vector 103 | num_filters_total = self.num_filters * len(self.filter_sizes) 104 | self.h_pool = tf.concat(pooled_outputs, 3) 105 | self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total]) 106 | # self.h_pool_flat SHAPE: batch_size, num_filters_total] 107 | 108 | # Add dropout 109 | with tf.name_scope("dropout"): 110 | self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob) 111 | 112 | # Final (unnormalized) scores and predictions 113 | with tf.name_scope("output"): 114 | W = tf.get_variable( 115 | "W", 116 | shape=[num_filters_total, self.num_classes], 117 | initializer=tf.contrib.layers.xavier_initializer()) 118 | b = tf.Variable(tf.constant(0.1, shape=[self.num_classes]), name="b") 119 | self.l2_loss += tf.nn.l2_loss(W) 120 | self.l2_loss += tf.nn.l2_loss(b) 121 | self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") 122 | #self.predictions = tf.argmax(self.scores, 1, name="predictions") 123 | 124 | #self.y_hat = tf.sigmoid(self.scores) 125 | self.y_hat = tf.nn.softmax(self.scores) 126 | 127 | # CalculateMean cross-entropy loss 128 | with tf.name_scope("loss"): 129 | losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y) 130 | self.loss = tf.reduce_mean(losses) + self.l2_reg_lambda * self.l2_loss 131 | 132 | @with_self_graph 133 | def BuildTrainGraph(self): 134 | self.global_step = tf.Variable(0, name="global_step", trainable=False) 135 | optimizer = tf.train.AdamOptimizer(self.learning_rate) 136 | #optimizer = tf.train.AdadeltaOptimizer (self.learning_rate) 137 | self.train_op = optimizer.minimize(self.loss, global_step=self.global_step) -------------------------------------------------------------------------------- /icd9_cnn/cnn_model.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/cnn_model.pyc -------------------------------------------------------------------------------- /icd9_cnn/cnn_top20_leave.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Clasifying Top 20 leaf icd-9 codes\n", 8 | "\n", 9 | "Running with the full file" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "metadata": { 16 | "collapsed": false 17 | }, 18 | "outputs": [ 19 | { 20 | "name": "stderr", 21 | "output_type": "stream", 22 | "text": [ 23 | "Using TensorFlow backend.\n" 24 | ] 25 | } 26 | ], 27 | "source": [ 28 | "%load_ext autoreload\n", 29 | "%autoreload 2\n", 30 | "# General imports\n", 31 | "import numpy as np\n", 32 | "import pandas as pd\n", 33 | "from sklearn.metrics import f1_score\n", 34 | "import sys \n", 35 | "\n", 36 | "#keras\n", 37 | "from keras.models import Sequential, Model\n", 38 | "from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding\n", 39 | "from keras.layers.merge import Concatenate\n", 40 | "\n", 41 | "# Custom functions\n", 42 | "sys.path.append(\"../pipeline\")\n", 43 | "import icd9_cnn_model\n", 44 | "import database_selection\n", 45 | "import vectorization\n", 46 | "import helpers" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "## Read Input File" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 2, 59 | "metadata": { 60 | "collapsed": true 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "df = pd.read_csv('../data/disch_notes_all_icd9.csv',\n", 65 | " names = ['HADM_ID', 'SUBJECT_ID', 'DATE', 'ICD9','TEXT'])" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "metadata": { 72 | "collapsed": false 73 | }, 74 | "outputs": [], 75 | "source": [ 76 | "N_TOP = 20 \n", 77 | "full_df, top_codes = database_selection.filter_top_codes(df, 'ICD9', N_TOP, filter_empty = True)\n", 78 | "#df = full_df.head(1000)\n", 79 | "df = full_df" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "## Vectorize Labels" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 4, 92 | "metadata": { 93 | "collapsed": false 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "#preprocess icd9 codes\n", 98 | "labels = vectorization.vectorize_icd_column(df, 'ICD9', top_codes)\n" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "## Vectorize Notes" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 5, 111 | "metadata": { 112 | "collapsed": false 113 | }, 114 | "outputs": [ 115 | { 116 | "name": "stdout", 117 | "output_type": "stream", 118 | "text": [ 119 | "Vocabulary size: 130488\n", 120 | "Average note length: 1728.09244863\n", 121 | "Max note length: 10924\n", 122 | "Final Vocabulary: 130488\n", 123 | "Final Max Sequence Length: 5000\n" 124 | ] 125 | } 126 | ], 127 | "source": [ 128 | "#preprocess notes\n", 129 | "MAX_VOCAB = None # to limit original number of words (None if no limit)\n", 130 | "MAX_SEQ_LENGTH = 5000 # to limit length of word sequence (None if no limit)\n", 131 | "df.TEXT = vectorization.clean_notes(df, 'TEXT')\n", 132 | "data, dictionary, MAX_VOCAB = vectorization.vectorize_notes(df.TEXT, MAX_VOCAB, verbose = True)\n", 133 | "data, MAX_SEQ_LENGTH = vectorization.pad_notes(data, MAX_SEQ_LENGTH)\n", 134 | "print(\"Final Vocabulary: %s\" % MAX_VOCAB)\n", 135 | "print(\"Final Max Sequence Length: %s\" % MAX_SEQ_LENGTH)" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 8, 141 | "metadata": { 142 | "collapsed": false 143 | }, 144 | "outputs": [ 145 | { 146 | "name": "stdout", 147 | "output_type": "stream", 148 | "text": [ 149 | "('Vocabulary in notes:', 130488)\n", 150 | "('Vocabulary in original embedding:', 21056)\n", 151 | "('Vocabulary intersection:', 20620)\n" 152 | ] 153 | } 154 | ], 155 | "source": [ 156 | "#creating embeddings\n", 157 | "#EMBEDDING_LOC = '../data/glove.6B.100d.txt' # location of embedding\n", 158 | "# embedding pre-trained will all MIMIC notes\n", 159 | "EMBEDDING_LOC = '../data/notes.100.txt' # location of embedding\n", 160 | "EMBEDDING_DIM = 100 # given the glove that we chose\n", 161 | "embedding_matrix, embedding_dict = vectorization.embedding_matrix(EMBEDDING_LOC,\n", 162 | " dictionary, EMBEDDING_DIM, verbose = True)\n" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## Split Files" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 6, 175 | "metadata": { 176 | "collapsed": false 177 | }, 178 | "outputs": [ 179 | { 180 | "name": "stdout", 181 | "output_type": "stream", 182 | "text": [ 183 | "('Train: ', (30794, 5000), (30794, 20))\n", 184 | "('Validation: ', (8798, 5000), (8798, 20))\n", 185 | "('Test: ', (4400, 5000), (4400, 20))\n" 186 | ] 187 | } 188 | ], 189 | "source": [ 190 | "#split sets\n", 191 | "X_train, X_val, X_test, y_train, y_val, y_test = helpers.train_val_test_split(\n", 192 | " data, labels, val_size=0.2, test_size=0.1, random_state=101)\n", 193 | "print(\"Train: \", X_train.shape, y_train.shape)\n", 194 | "print(\"Validation: \", X_val.shape, y_val.shape)\n", 195 | "print(\"Test: \", X_test.shape, y_test.shape)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 7, 201 | "metadata": { 202 | "collapsed": false 203 | }, 204 | "outputs": [], 205 | "source": [ 206 | "# Delete temporary variables to free some memory\n", 207 | "del df, data, labels" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "## CNN for text classification\n", 215 | "\n", 216 | "Based on the following papers and links:\n", 217 | "* \"Convolutional Neural Networks for Sentence Classification\" \n", 218 | "* \"A Sensitivity Analysis of (and Practitioners� Guide to) Convolutional Neural Networks for Sentence Classification\"\n", 219 | "* http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/\n", 220 | "* https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/sentiment_cnn.py" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 9, 226 | "metadata": { 227 | "collapsed": false 228 | }, 229 | "outputs": [ 230 | { 231 | "name": "stdout", 232 | "output_type": "stream", 233 | "text": [ 234 | "____________________________________________________________________________________________________\n", 235 | "Layer (type) Output Shape Param # Connected to \n", 236 | "====================================================================================================\n", 237 | "input_1 (InputLayer) (None, 5000) 0 \n", 238 | "____________________________________________________________________________________________________\n", 239 | "embedding_1 (Embedding) (None, 5000, 100) 13048900 input_1[0][0] \n", 240 | "____________________________________________________________________________________________________\n", 241 | "conv1d_1 (Conv1D) (None, 4999, 100) 20100 embedding_1[0][0] \n", 242 | "____________________________________________________________________________________________________\n", 243 | "conv1d_2 (Conv1D) (None, 4998, 100) 30100 embedding_1[0][0] \n", 244 | "____________________________________________________________________________________________________\n", 245 | "conv1d_3 (Conv1D) (None, 4997, 100) 40100 embedding_1[0][0] \n", 246 | "____________________________________________________________________________________________________\n", 247 | "conv1d_4 (Conv1D) (None, 4996, 100) 50100 embedding_1[0][0] \n", 248 | "____________________________________________________________________________________________________\n", 249 | "max_pooling1d_1 (MaxPooling1D) (None, 1, 100) 0 conv1d_1[0][0] \n", 250 | "____________________________________________________________________________________________________\n", 251 | "max_pooling1d_2 (MaxPooling1D) (None, 1, 100) 0 conv1d_2[0][0] \n", 252 | "____________________________________________________________________________________________________\n", 253 | "max_pooling1d_3 (MaxPooling1D) (None, 1, 100) 0 conv1d_3[0][0] \n", 254 | "____________________________________________________________________________________________________\n", 255 | "max_pooling1d_4 (MaxPooling1D) (None, 1, 100) 0 conv1d_4[0][0] \n", 256 | "____________________________________________________________________________________________________\n", 257 | "flatten_1 (Flatten) (None, 100) 0 max_pooling1d_1[0][0] \n", 258 | "____________________________________________________________________________________________________\n", 259 | "flatten_2 (Flatten) (None, 100) 0 max_pooling1d_2[0][0] \n", 260 | "____________________________________________________________________________________________________\n", 261 | "flatten_3 (Flatten) (None, 100) 0 max_pooling1d_3[0][0] \n", 262 | "____________________________________________________________________________________________________\n", 263 | "flatten_4 (Flatten) (None, 100) 0 max_pooling1d_4[0][0] \n", 264 | "____________________________________________________________________________________________________\n", 265 | "concatenate_1 (Concatenate) (None, 400) 0 flatten_1[0][0] \n", 266 | " flatten_2[0][0] \n", 267 | " flatten_3[0][0] \n", 268 | " flatten_4[0][0] \n", 269 | "____________________________________________________________________________________________________\n", 270 | "dropout_1 (Dropout) (None, 400) 0 concatenate_1[0][0] \n", 271 | "____________________________________________________________________________________________________\n", 272 | "dense_1 (Dense) (None, 20) 8020 dropout_1[0][0] \n", 273 | "====================================================================================================\n", 274 | "Total params: 13,197,320\n", 275 | "Trainable params: 13,197,320\n", 276 | "Non-trainable params: 0\n", 277 | "____________________________________________________________________________________________________\n", 278 | "None\n" 279 | ] 280 | } 281 | ], 282 | "source": [ 283 | "reload(icd9_cnn_model)\n", 284 | "#### build model\n", 285 | "model = icd9_cnn_model.build_icd9_cnn_model (input_seq_length=MAX_SEQ_LENGTH, max_vocab = MAX_VOCAB,\n", 286 | " external_embeddings = True,\n", 287 | " embedding_dim=EMBEDDING_DIM,embedding_matrix=embedding_matrix,\n", 288 | " num_filters = 100, filter_sizes=[2,3,4,5],\n", 289 | " training_dropout_keep_prob=0.5,\n", 290 | " num_classes=N_TOP )" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 10, 296 | "metadata": { 297 | "collapsed": false 298 | }, 299 | "outputs": [ 300 | { 301 | "name": "stdout", 302 | "output_type": "stream", 303 | "text": [ 304 | "Train on 30794 samples, validate on 8798 samples\n", 305 | "Epoch 1/5\n", 306 | "1008s - loss: 0.4447 - acc: 0.8289 - val_loss: 0.3207 - val_acc: 0.8677\n", 307 | "Epoch 2/5\n", 308 | "984s - loss: 0.3245 - acc: 0.8698 - val_loss: 0.2738 - val_acc: 0.8868\n", 309 | "Epoch 3/5\n", 310 | "981s - loss: 0.2889 - acc: 0.8835 - val_loss: 0.2522 - val_acc: 0.8978\n", 311 | "Epoch 4/5\n", 312 | "980s - loss: 0.2708 - acc: 0.8915 - val_loss: 0.2422 - val_acc: 0.9047\n", 313 | "Epoch 5/5\n", 314 | "977s - loss: 0.2605 - acc: 0.8965 - val_loss: 0.2391 - val_acc: 0.9050\n" 315 | ] 316 | }, 317 | { 318 | "data": { 319 | "text/plain": [ 320 | "" 321 | ] 322 | }, 323 | "execution_count": 10, 324 | "metadata": {}, 325 | "output_type": "execute_result" 326 | } 327 | ], 328 | "source": [ 329 | "#first 5 epochs\n", 330 | "model.fit(X_train, y_train, batch_size=50, epochs=5, validation_data=(X_val, y_val), verbose=2)" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 11, 336 | "metadata": { 337 | "collapsed": true 338 | }, 339 | "outputs": [], 340 | "source": [ 341 | "model.save('models/cnn_5_epochs_allr.h5')" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 12, 347 | "metadata": { 348 | "collapsed": false 349 | }, 350 | "outputs": [ 351 | { 352 | "name": "stdout", 353 | "output_type": "stream", 354 | "text": [ 355 | "F1 scores\n", 356 | "threshold | training | dev \n", 357 | "0.020: 0.358 0.353\n", 358 | "0.030: 0.407 0.400\n", 359 | "0.040: 0.456 0.446\n", 360 | "0.050: 0.502 0.488\n", 361 | "0.055: 0.523 0.508\n", 362 | "0.058: 0.534 0.518\n", 363 | "0.060: 0.541 0.525\n", 364 | "0.080: 0.602 0.582\n", 365 | "0.100: 0.645 0.621\n", 366 | "0.200: 0.732 0.704\n", 367 | "0.300: 0.747 0.717\n", 368 | "0.400: 0.738 0.707\n", 369 | "0.500: 0.712 0.679\n", 370 | "0.600: 0.668 0.631\n", 371 | "0.700: 0.594 0.558\n" 372 | ] 373 | } 374 | ], 375 | "source": [ 376 | "pred_train = model.predict(X_train, batch_size=200)\n", 377 | "pred_dev = model.predict(X_val, batch_size=200)\n", 378 | "helpers.show_f1_score(y_train, pred_train, y_val, pred_dev)" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 13, 384 | "metadata": { 385 | "collapsed": false 386 | }, 387 | "outputs": [ 388 | { 389 | "name": "stdout", 390 | "output_type": "stream", 391 | "text": [ 392 | "Train on 30794 samples, validate on 8798 samples\n", 393 | "Epoch 1/2\n", 394 | "834s - loss: 0.2518 - acc: 0.9010 - val_loss: 0.2371 - val_acc: 0.9076\n", 395 | "Epoch 2/2\n", 396 | "837s - loss: 0.2437 - acc: 0.9041 - val_loss: 0.2367 - val_acc: 0.9075\n" 397 | ] 398 | }, 399 | { 400 | "data": { 401 | "text/plain": [ 402 | "" 403 | ] 404 | }, 405 | "execution_count": 13, 406 | "metadata": {}, 407 | "output_type": "execute_result" 408 | } 409 | ], 410 | "source": [ 411 | "# 2 more epochs\n", 412 | "model.fit(X_train, y_train, batch_size=50, epochs=2, validation_data=(X_val, y_val), verbose=2)" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": 14, 418 | "metadata": { 419 | "collapsed": true 420 | }, 421 | "outputs": [], 422 | "source": [ 423 | "model.save('models/cnn_7_epochs_allr.h5')" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 15, 429 | "metadata": { 430 | "collapsed": false 431 | }, 432 | "outputs": [ 433 | { 434 | "name": "stdout", 435 | "output_type": "stream", 436 | "text": [ 437 | "F1 scores\n", 438 | "threshold | training | dev \n", 439 | "0.020: 0.373 0.366\n", 440 | "0.030: 0.424 0.412\n", 441 | "0.040: 0.472 0.455\n", 442 | "0.050: 0.515 0.494\n", 443 | "0.055: 0.536 0.512\n", 444 | "0.058: 0.548 0.522\n", 445 | "0.060: 0.555 0.528\n", 446 | "0.080: 0.622 0.587\n", 447 | "0.100: 0.669 0.629\n", 448 | "0.200: 0.759 0.713\n", 449 | "0.300: 0.774 0.724\n", 450 | "0.400: 0.767 0.714\n", 451 | "0.500: 0.746 0.691\n", 452 | "0.600: 0.708 0.651\n", 453 | "0.700: 0.644 0.584\n" 454 | ] 455 | } 456 | ], 457 | "source": [ 458 | "pred_train = model.predict(X_train, batch_size=200)\n", 459 | "pred_dev = model.predict(X_val, batch_size=200)\n", 460 | "helpers.show_f1_score(y_train, pred_train, y_val, pred_dev)" 461 | ] 462 | } 463 | ], 464 | "metadata": { 465 | "kernelspec": { 466 | "display_name": "Python [Root]", 467 | "language": "python", 468 | "name": "Python [Root]" 469 | }, 470 | "language_info": { 471 | "codemirror_mode": { 472 | "name": "ipython", 473 | "version": 2 474 | }, 475 | "file_extension": ".py", 476 | "mimetype": "text/x-python", 477 | "name": "python", 478 | "nbconvert_exporter": "python", 479 | "pygments_lexer": "ipython2", 480 | "version": "2.7.12" 481 | } 482 | }, 483 | "nbformat": 4, 484 | "nbformat_minor": 2 485 | } 486 | -------------------------------------------------------------------------------- /icd9_cnn/mimic_CNN_text_classification.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/mimic_CNN_text_classification.png -------------------------------------------------------------------------------- /icd9_cnn/tf_saved/cnn_trained.data-00000-of-00001: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/tf_saved/cnn_trained.data-00000-of-00001 -------------------------------------------------------------------------------- /icd9_cnn/tf_saved/cnn_trained.index: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/tf_saved/cnn_trained.index -------------------------------------------------------------------------------- /icd9_cnn/tf_saved/cnn_trained.meta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/tf_saved/cnn_trained.meta -------------------------------------------------------------------------------- /icd9_cnn/utils.py: -------------------------------------------------------------------------------- 1 | ## Code from w266 materials 2 | 3 | from IPython.display import display 4 | import itertools 5 | import numpy as np 6 | import pandas as pd 7 | import re 8 | import time 9 | 10 | def flatten(list_of_lists): 11 | """Flatten a list-of-lists into a single list.""" 12 | return list(itertools.chain.from_iterable(list_of_lists)) 13 | 14 | def pretty_print_matrix(M, rows=None, cols=None, dtype=float): 15 | """Pretty-print a matrix using Pandas. 16 | 17 | Args: 18 | M : 2D numpy array 19 | rows : list of row labels 20 | cols : list of column labels 21 | dtype : data type (float or int) 22 | """ 23 | display(pd.DataFrame(M, index=rows, columns=cols, dtype=dtype)) 24 | 25 | def pretty_timedelta(fmt="%d:%02d:%02d", since=None, until=None): 26 | """Pretty-print a timedelta, using the given format string.""" 27 | since = since or time.time() 28 | until = until or time.time() 29 | delta_s = until - since 30 | hours, remainder = divmod(delta_s, 3600) 31 | minutes, seconds = divmod(remainder, 60) 32 | return fmt % (hours, minutes, seconds) 33 | 34 | 35 | ## 36 | # Word processing functions 37 | def canonicalize_digits(word): 38 | if any([c.isalpha() for c in word]): return word 39 | word = re.sub("\d", "DG", word) 40 | if word.startswith("DG"): 41 | word = word.replace(",", "") # remove thousands separator 42 | return word 43 | 44 | def canonicalize_word(word, wordset=None, digits=True): 45 | word = word.lower() 46 | if digits: 47 | if (wordset != None) and (word in wordset): return word 48 | word = canonicalize_digits(word) # try to canonicalize numbers 49 | if (wordset == None) or (word in wordset): return word 50 | else: return "" # unknown token 51 | 52 | def canonicalize_words(words, **kw): 53 | return [canonicalize_word(word, **kw) for word in words] 54 | 55 | ## 56 | # Data loading functions 57 | import nltk 58 | import vocabulary 59 | 60 | def get_corpus(name="brown"): 61 | return nltk.corpus.__getattr__(name) 62 | 63 | def build_vocab(corpus, V=10000): 64 | token_feed = (canonicalize_word(w) for w in corpus.words()) 65 | vocab = vocabulary.Vocabulary(token_feed, size=V) 66 | return vocab 67 | 68 | def get_train_test_sents(corpus, split=0.8, shuffle=True): 69 | """Get train and test sentences. 70 | 71 | Args: 72 | corpus: nltk.corpus that supports sents() function 73 | split (double): fraction to use as training set 74 | shuffle (int or bool): seed for shuffle of input data, or False to just 75 | take the training data as the first xx% contiguously. 76 | 77 | Returns: 78 | train_sentences, test_sentences ( list(list(string)) ): the train and test 79 | splits 80 | """ 81 | sentences = np.array(corpus.sents(), dtype=object) 82 | fmt = (len(sentences), sum(map(len, sentences))) 83 | print "Loaded %d sentences (%g tokens)" % fmt 84 | 85 | if shuffle: 86 | rng = np.random.RandomState(shuffle) 87 | rng.shuffle(sentences) # in-place 88 | train_frac = 0.8 89 | split_idx = int(train_frac * len(sentences)) 90 | train_sentences = sentences[:split_idx] 91 | test_sentences = sentences[split_idx:] 92 | 93 | fmt = (len(train_sentences), sum(map(len, train_sentences))) 94 | print "Training set: %d sentences (%d tokens)" % fmt 95 | fmt = (len(test_sentences), sum(map(len, test_sentences))) 96 | print "Test set: %d sentences (%d tokens)" % fmt 97 | 98 | return train_sentences, test_sentences 99 | 100 | def preprocess_sentences(sentences, vocab): 101 | """Preprocess sentences by canonicalizing and mapping to ids. 102 | 103 | Args: 104 | sentences ( list(list(string)) ): input sentences 105 | vocab: Vocabulary object, already initialized 106 | 107 | Returns: 108 | ids ( array(int) ): flattened array of sentences, including boundary 109 | tokens. 110 | """ 111 | # Add sentence boundaries, canonicalize, and handle unknowns 112 | words = [""] + flatten(s + [""] for s in sentences) 113 | words = [canonicalize_word(w, wordset=vocab.word_to_id) 114 | for w in words] 115 | return np.array(vocab.words_to_ids(words)) 116 | 117 | ## 118 | # Use this function 119 | def load_corpus(name, split=0.8, V=10000, shuffle=0): 120 | """Load a named corpus and split train/test along sentences.""" 121 | corpus = get_corpus(name) 122 | vocab = build_vocab(corpus, V) 123 | train_sentences, test_sentences = get_train_test_sents(corpus, split, shuffle) 124 | train_ids = preprocess_sentences(train_sentences, vocab) 125 | test_ids = preprocess_sentences(test_sentences, vocab) 126 | return vocab, train_ids, test_ids 127 | 128 | ## 129 | # Use this function 130 | def batch_generator(ids, batch_size, max_time): 131 | """Convert ids to data-matrix form.""" 132 | # Clip to multiple of max_time for convenience 133 | clip_len = ((len(ids)-1) / batch_size) * batch_size 134 | input_w = ids[:clip_len] # current word 135 | target_y = ids[1:clip_len+1] # next word 136 | # Reshape so we can select columns 137 | input_w = input_w.reshape([batch_size,-1]) 138 | target_y = target_y.reshape([batch_size,-1]) 139 | 140 | # Yield batches 141 | for i in xrange(0, input_w.shape[1], max_time): 142 | yield input_w[:,i:i+max_time], target_y[:,i:i+max_time] 143 | 144 | 145 | -------------------------------------------------------------------------------- /icd9_cnn/utils.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/utils.pyc -------------------------------------------------------------------------------- /icd9_cnn/vocabulary.py: -------------------------------------------------------------------------------- 1 | ## Code from w266 class material 2 | import collections 3 | 4 | class Vocabulary(object): 5 | 6 | START_TOKEN = "" 7 | END_TOKEN = "" 8 | UNK_TOKEN = "" 9 | 10 | def __init__(self, tokens, size=None): 11 | self.unigram_counts = collections.Counter(tokens) 12 | self.num_unigrams = sum(self.unigram_counts.itervalues()) 13 | # leave space for "", "", and "" 14 | top_counts = self.unigram_counts.most_common(None if size is None else (size - 3)) 15 | vocab = ([self.START_TOKEN, self.END_TOKEN, self.UNK_TOKEN] + 16 | [w for w,c in top_counts]) 17 | 18 | # Assign an id to each word, by frequency 19 | self.id_to_word = dict(enumerate(vocab)) 20 | self.word_to_id = {v:k for k,v in self.id_to_word.iteritems()} 21 | self.size = len(self.id_to_word) 22 | if size is not None: 23 | assert(self.size <= size) 24 | 25 | # For convenience 26 | self.wordset = set(self.word_to_id.iterkeys()) 27 | 28 | # Store special IDs 29 | self.START_ID = self.word_to_id[self.START_TOKEN] 30 | self.END_ID = self.word_to_id[self.END_TOKEN] 31 | self.UNK_ID = self.word_to_id[self.UNK_TOKEN] 32 | 33 | def words_to_ids(self, words): 34 | return [self.word_to_id.get(w, self.UNK_ID) for w in words] 35 | 36 | def ids_to_words(self, ids): 37 | return [self.id_to_word[i] for i in ids] 38 | 39 | def sentence_to_ids(self, words): 40 | return [self.START_ID] + self.words_to_ids(words) + [self.END_ID] 41 | 42 | def ordered_words(self): 43 | """Return a list of words, ordered by id.""" 44 | return self.ids_to_words(range(self.size)) 45 | -------------------------------------------------------------------------------- /icd9_cnn/vocabulary.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/icd9_cnn/vocabulary.pyc -------------------------------------------------------------------------------- /pipeline/.ipynb_checkpoints/Temp Guillaume-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "deletable": true, 7 | "editable": true 8 | }, 9 | "source": [ 10 | "# Clean(er) Pipeline" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "deletable": true, 17 | "editable": true 18 | }, 19 | "source": [ 20 | "This is an attempt to merge the pipelines from Zenobia and Guillaume" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": { 26 | "deletable": true, 27 | "editable": true 28 | }, 29 | "source": [ 30 | "## Importing Modules" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 64, 36 | "metadata": { 37 | "collapsed": false, 38 | "deletable": true, 39 | "editable": true 40 | }, 41 | "outputs": [], 42 | "source": [ 43 | "# General imports\n", 44 | "import numpy as np\n", 45 | "import pandas as pd\n", 46 | "import re, nltk, string, os\n", 47 | "from sklearn.model_selection import train_test_split\n", 48 | "from sklearn.metrics import f1_score\n", 49 | "import datetime, time\n", 50 | "import matplotlib\n", 51 | "from collections import Counter\n", 52 | "from matplotlib import pyplot as plt\n", 53 | "matplotlib.style.use('ggplot')\n", 54 | "%matplotlib inline" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 65, 60 | "metadata": { 61 | "collapsed": false, 62 | "deletable": true, 63 | "editable": true 64 | }, 65 | "outputs": [], 66 | "source": [ 67 | "# NN imports\n", 68 | "# Upgrade the package called dask\n", 69 | "import keras\n", 70 | "from keras.preprocessing.text import Tokenizer\n", 71 | "from keras.preprocessing.sequence import pad_sequences\n", 72 | "from keras.layers import Input, Conv1D, MaxPooling1D, Flatten, Dense, Embedding\n", 73 | "from keras.models import Model" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 66, 79 | "metadata": { 80 | "collapsed": false, 81 | "deletable": true, 82 | "editable": true 83 | }, 84 | "outputs": [ 85 | { 86 | "name": "stdout", 87 | "output_type": "stream", 88 | "text": [ 89 | "The autoreload extension is already loaded. To reload it, use:\n", 90 | " %reload_ext autoreload\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "# Custom functions\n", 96 | "%load_ext autoreload\n", 97 | "%autoreload 2\n", 98 | "import database_selection\n", 99 | "import vectorization\n", 100 | "import helpers" 101 | ] 102 | }, 103 | { 104 | "cell_type": "markdown", 105 | "metadata": { 106 | "deletable": true, 107 | "editable": true 108 | }, 109 | "source": [ 110 | "## Select data corresponding to the top ICD codes" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": { 116 | "deletable": true, 117 | "editable": true 118 | }, 119 | "source": [ 120 | "Here, we filter for only the top `n_top` ICD codes \n", 121 | "\n", 122 | "Note: We offer the option to exclude notes that do not contain any of the top codes. However, it may actually be more rigorous to keep them, no?" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 67, 128 | "metadata": { 129 | "collapsed": false, 130 | "deletable": true, 131 | "editable": true 132 | }, 133 | "outputs": [], 134 | "source": [ 135 | "# Inputs\n", 136 | "N_TOP = 20\n", 137 | "df = pd.read_csv('../data/disch_notes_all_icd9.csv',\n", 138 | " names = ['HADM_ID', 'SUBJECT_ID', 'DATE', 'ICD9','TEXT'])\n", 139 | "df = df.head(10000) # Speeding up" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 68, 145 | "metadata": { 146 | "collapsed": false, 147 | "deletable": true, 148 | "editable": true 149 | }, 150 | "outputs": [], 151 | "source": [ 152 | "df, top_codes = database_selection.filter_top_codes(df, 'ICD9', N_TOP, filter_empty = False)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 69, 158 | "metadata": { 159 | "collapsed": false, 160 | "deletable": true, 161 | "editable": true 162 | }, 163 | "outputs": [ 164 | { 165 | "data": { 166 | "text/plain": [ 167 | "(10000, 5)" 168 | ] 169 | }, 170 | "execution_count": 69, 171 | "metadata": {}, 172 | "output_type": "execute_result" 173 | } 174 | ], 175 | "source": [ 176 | "df.shape" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": 70, 182 | "metadata": { 183 | "collapsed": false, 184 | "deletable": true, 185 | "editable": true 186 | }, 187 | "outputs": [ 188 | { 189 | "data": { 190 | "text/plain": [ 191 | "['4019', '4280', '42731', '41401', '5849']" 192 | ] 193 | }, 194 | "execution_count": 70, 195 | "metadata": {}, 196 | "output_type": "execute_result" 197 | } 198 | ], 199 | "source": [ 200 | "top_codes[0:5]" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 71, 206 | "metadata": { 207 | "collapsed": false, 208 | "deletable": true, 209 | "editable": true 210 | }, 211 | "outputs": [ 212 | { 213 | "data": { 214 | "text/html": [ 215 | "
\n", 216 | "\n", 229 | "\n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | "
HADM_IDSUBJECT_IDDATEICD9TEXT
0100001585262117-09-17 00:00:005849Admission Date: [**2117-9-11**] ...
1100003546102150-04-21 00:00:004019 2851Admission Date: [**2150-4-17**] ...
210000698952108-04-17 00:00:0051881 486Admission Date: [**2108-4-6**] Discharg...
3100007230182145-04-07 00:00:004019 486Admission Date: [**2145-3-31**] ...
41000095332162-05-21 00:00:004019 41401 25000 2720 2859Admission Date: [**2162-5-16**] ...
\n", 283 | "
" 284 | ], 285 | "text/plain": [ 286 | " HADM_ID SUBJECT_ID DATE ICD9 \\\n", 287 | "0 100001 58526 2117-09-17 00:00:00 5849 \n", 288 | "1 100003 54610 2150-04-21 00:00:00 4019 2851 \n", 289 | "2 100006 9895 2108-04-17 00:00:00 51881 486 \n", 290 | "3 100007 23018 2145-04-07 00:00:00 4019 486 \n", 291 | "4 100009 533 2162-05-21 00:00:00 4019 41401 25000 2720 2859 \n", 292 | "\n", 293 | " TEXT \n", 294 | "0 Admission Date: [**2117-9-11**] ... \n", 295 | "1 Admission Date: [**2150-4-17**] ... \n", 296 | "2 Admission Date: [**2108-4-6**] Discharg... \n", 297 | "3 Admission Date: [**2145-3-31**] ... \n", 298 | "4 Admission Date: [**2162-5-16**] ... " 299 | ] 300 | }, 301 | "execution_count": 71, 302 | "metadata": {}, 303 | "output_type": "execute_result" 304 | } 305 | ], 306 | "source": [ 307 | "df.head()" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": { 313 | "deletable": true, 314 | "editable": true 315 | }, 316 | "source": [ 317 | "## Vectorize ICD9 codes" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": { 323 | "deletable": true, 324 | "editable": true 325 | }, 326 | "source": [ 327 | "Here we vectorize and move it to an `np.array` because that is what TensorFlow prefers" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 72, 333 | "metadata": { 334 | "collapsed": false, 335 | "deletable": true, 336 | "editable": true 337 | }, 338 | "outputs": [], 339 | "source": [ 340 | "labels = vectorization.vectorize_icd_column(df, 'ICD9', top_codes)" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 73, 346 | "metadata": { 347 | "collapsed": false, 348 | "deletable": true, 349 | "editable": true 350 | }, 351 | "outputs": [ 352 | { 353 | "data": { 354 | "text/plain": [ 355 | "array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n", 356 | " [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],\n", 357 | " [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],\n", 358 | " [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],\n", 359 | " [1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])" 360 | ] 361 | }, 362 | "execution_count": 73, 363 | "metadata": {}, 364 | "output_type": "execute_result" 365 | } 366 | ], 367 | "source": [ 368 | "labels[0:5]" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": { 374 | "deletable": true, 375 | "editable": true 376 | }, 377 | "source": [ 378 | "## Clean, and write text for embedding" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 74, 384 | "metadata": { 385 | "collapsed": true, 386 | "deletable": true, 387 | "editable": true 388 | }, 389 | "outputs": [], 390 | "source": [ 391 | "# Clean\n", 392 | "df.TEXT = vectorization.clean_notes(df, 'TEXT')" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": 75, 398 | "metadata": { 399 | "collapsed": false, 400 | "deletable": true, 401 | "editable": true 402 | }, 403 | "outputs": [], 404 | "source": [ 405 | "helpers.write_col(df, 'TEXT', '../data/only_text.csv')" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": { 411 | "deletable": true, 412 | "editable": true 413 | }, 414 | "source": [ 415 | "## Vectorize text and pad sequence" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "metadata": { 421 | "deletable": true, 422 | "editable": true 423 | }, 424 | "source": [ 425 | "Here, we vectorize the text and pad with 0s so that notes appear of the same length" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": 76, 431 | "metadata": { 432 | "collapsed": true, 433 | "deletable": true, 434 | "editable": true 435 | }, 436 | "outputs": [], 437 | "source": [ 438 | "# Inputs for tokenization\n", 439 | "MAX_VOCAB = None # to limit original vocabulary to most common words (None if no limit)\n", 440 | "MAX_SEQ_LENGTH = 5000 # to limit length of word sequence (None if no limit)" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 77, 446 | "metadata": { 447 | "collapsed": false, 448 | "deletable": true, 449 | "editable": true 450 | }, 451 | "outputs": [ 452 | { 453 | "name": "stdout", 454 | "output_type": "stream", 455 | "text": [ 456 | "Vocabulary size: 60619\n", 457 | "Average note length: 1623.5809\n", 458 | "Max note length: 8725\n" 459 | ] 460 | } 461 | ], 462 | "source": [ 463 | "# Vectorize\n", 464 | "data, dictionary, MAX_VOCAB = vectorization.vectorize_notes(df.TEXT, MAX_VOCAB, verbose = True)" 465 | ] 466 | }, 467 | { 468 | "cell_type": "code", 469 | "execution_count": 78, 470 | "metadata": { 471 | "collapsed": true, 472 | "deletable": true, 473 | "editable": true 474 | }, 475 | "outputs": [], 476 | "source": [ 477 | "# Pad and turn into a matrix\n", 478 | "data, MAX_SEQ_LENGTH = vectorization.pad_notes(data, MAX_SEQ_LENGTH)" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 79, 484 | "metadata": { 485 | "collapsed": false, 486 | "deletable": true, 487 | "editable": true 488 | }, 489 | "outputs": [ 490 | { 491 | "name": "stdout", 492 | "output_type": "stream", 493 | "text": [ 494 | "Final Vocabulary: 60619\n", 495 | "Final Max Sequence Length: 5000\n" 496 | ] 497 | } 498 | ], 499 | "source": [ 500 | "print(\"Final Vocabulary: %s\" % MAX_VOCAB)\n", 501 | "print(\"Final Max Sequence Length: %s\" % MAX_SEQ_LENGTH)" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 80, 507 | "metadata": { 508 | "collapsed": false, 509 | "deletable": true, 510 | "editable": true 511 | }, 512 | "outputs": [ 513 | { 514 | "data": { 515 | "text/plain": [ 516 | "array([[ 0, 0, 0, ..., 2998, 24, 88],\n", 517 | " [ 0, 0, 0, ..., 1, 374, 35],\n", 518 | " [ 0, 0, 0, ..., 1, 1, 621],\n", 519 | " [ 0, 0, 0, ..., 32, 374, 35],\n", 520 | " [ 0, 0, 0, ..., 67, 374, 35]], dtype=int32)" 521 | ] 522 | }, 523 | "execution_count": 80, 524 | "metadata": {}, 525 | "output_type": "execute_result" 526 | } 527 | ], 528 | "source": [ 529 | "data[0:5] " 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": { 535 | "deletable": true, 536 | "editable": true 537 | }, 538 | "source": [ 539 | "## Split into Sets" 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": { 545 | "deletable": true, 546 | "editable": true 547 | }, 548 | "source": [ 549 | "Here we split into sets and free up some memory" 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": 81, 555 | "metadata": { 556 | "collapsed": false, 557 | "deletable": true, 558 | "editable": true 559 | }, 560 | "outputs": [], 561 | "source": [ 562 | "X_train, X_val, X_test, y_train, y_val, y_test = helpers.train_val_test_split(\n", 563 | " data, labels, val_size=0.2, test_size=0.1, random_state=101)" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 82, 569 | "metadata": { 570 | "collapsed": false, 571 | "deletable": true, 572 | "editable": true 573 | }, 574 | "outputs": [ 575 | { 576 | "name": "stdout", 577 | "output_type": "stream", 578 | "text": [ 579 | "Train: (6999, 5000) (6999, 20)\n", 580 | "Validation: (2000, 5000) (2000, 20)\n", 581 | "Test: (1001, 5000) (1001, 20)\n" 582 | ] 583 | } 584 | ], 585 | "source": [ 586 | "print(\"Train: \", X_train.shape, y_train.shape)\n", 587 | "print(\"Validation: \", X_val.shape, y_val.shape)\n", 588 | "print(\"Test: \", X_test.shape, y_test.shape)" 589 | ] 590 | }, 591 | { 592 | "cell_type": "code", 593 | "execution_count": 83, 594 | "metadata": { 595 | "collapsed": false, 596 | "deletable": true, 597 | "editable": true 598 | }, 599 | "outputs": [], 600 | "source": [ 601 | "# Delete temporary variables to free some memory\n", 602 | "del df, data, labels" 603 | ] 604 | }, 605 | { 606 | "cell_type": "markdown", 607 | "metadata": { 608 | "deletable": true, 609 | "editable": true 610 | }, 611 | "source": [ 612 | "## Reload Embedding Matrix" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": { 618 | "deletable": true, 619 | "editable": true 620 | }, 621 | "source": [ 622 | "Creates an embedding matrix based on a pretrained vector" 623 | ] 624 | }, 625 | { 626 | "cell_type": "markdown", 627 | "metadata": { 628 | "deletable": true, 629 | "editable": true 630 | }, 631 | "source": [ 632 | "List of pretrained vectors http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/ for embedding Google cannot be downloaded, so I used Glove: \n", 633 | "- Go to https://nlp.stanford.edu/projects/glove/\n", 634 | "- Download a pretrained model, e.g. `glove.6B.zip`, and put the unzipped files in `/data`" 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": 97, 640 | "metadata": { 641 | "collapsed": false, 642 | "deletable": true, 643 | "editable": true 644 | }, 645 | "outputs": [], 646 | "source": [ 647 | "EMBEDDING_LOC = '../data/notes.100.txt' # location of embedding\n", 648 | "#EMBEDDING_LOC = '../data/glove.6B.100d.txt' # location of embedding\n", 649 | "EMBEDDING_DIM = 100 # given the glove that we chose" 650 | ] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "execution_count": 98, 655 | "metadata": { 656 | "collapsed": false, 657 | "deletable": true, 658 | "editable": true 659 | }, 660 | "outputs": [ 661 | { 662 | "name": "stdout", 663 | "output_type": "stream", 664 | "text": [ 665 | "Vocabulary in notes: 60619\n", 666 | "Vocabulary in original embedding: 21056\n", 667 | "Vocabulary intersection: 20640\n" 668 | ] 669 | } 670 | ], 671 | "source": [ 672 | "embedding_matrix, embedding_dict = vectorization.embedding_matrix(EMBEDDING_LOC,\n", 673 | " dictionary, EMBEDDING_DIM,\n", 674 | " verbose = True, sigma = None)" 675 | ] 676 | }, 677 | { 678 | "cell_type": "code", 679 | "execution_count": 99, 680 | "metadata": { 681 | "collapsed": false, 682 | "deletable": true, 683 | "editable": true 684 | }, 685 | "outputs": [ 686 | { 687 | "data": { 688 | "text/plain": [ 689 | "(60620, 100)" 690 | ] 691 | }, 692 | "execution_count": 99, 693 | "metadata": {}, 694 | "output_type": "execute_result" 695 | } 696 | ], 697 | "source": [ 698 | "embedding_matrix.shape" 699 | ] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "execution_count": 100, 704 | "metadata": { 705 | "collapsed": true, 706 | "deletable": true, 707 | "editable": true 708 | }, 709 | "outputs": [], 710 | "source": [ 711 | "del embedding_dict" 712 | ] 713 | }, 714 | { 715 | "cell_type": "markdown", 716 | "metadata": { 717 | "deletable": true, 718 | "editable": true 719 | }, 720 | "source": [ 721 | "## Simple Neural Network" 722 | ] 723 | }, 724 | { 725 | "cell_type": "markdown", 726 | "metadata": { 727 | "deletable": true, 728 | "editable": true 729 | }, 730 | "source": [ 731 | "Simple Neural to show that it works\n", 732 | "- softmax with categorical cross entropy and adam gave f1 = 0.1696042216358839\n", 733 | "- sigmoid with glove original embedding gave 0.27048167970358172" 734 | ] 735 | }, 736 | { 737 | "cell_type": "code", 738 | "execution_count": 101, 739 | "metadata": { 740 | "collapsed": true, 741 | "deletable": true, 742 | "editable": true 743 | }, 744 | "outputs": [], 745 | "source": [ 746 | "EMBEDDING_TRAINABLE = True" 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": 102, 752 | "metadata": { 753 | "collapsed": false, 754 | "deletable": true, 755 | "editable": true 756 | }, 757 | "outputs": [], 758 | "source": [ 759 | "# We build the embedding layer separately because it's a little more complex than the others\n", 760 | "embedding_layer = Embedding(len(dictionary) + 1,\n", 761 | " EMBEDDING_DIM,\n", 762 | " weights=[embedding_matrix],\n", 763 | " input_length=MAX_SEQ_LENGTH,\n", 764 | " trainable=EMBEDDING_TRAINABLE)" 765 | ] 766 | }, 767 | { 768 | "cell_type": "code", 769 | "execution_count": 103, 770 | "metadata": { 771 | "collapsed": false, 772 | "deletable": true, 773 | "editable": true 774 | }, 775 | "outputs": [], 776 | "source": [ 777 | "sequence_input = Input(shape=(MAX_SEQ_LENGTH,), dtype='int32')\n", 778 | "embedded_sequences = embedding_layer(sequence_input)\n", 779 | "embedded_sequences = Flatten()(embedded_sequences)\n", 780 | "preds = Dense(len(top_codes), activation='sigmoid')(embedded_sequences)\n", 781 | "\n", 782 | "model = Model(sequence_input, preds)\n", 783 | "model.compile(loss='binary_crossentropy',\n", 784 | " optimizer='rmsprop',\n", 785 | " metrics=['acc'])" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": 104, 791 | "metadata": { 792 | "collapsed": false, 793 | "deletable": true, 794 | "editable": true 795 | }, 796 | "outputs": [ 797 | { 798 | "name": "stdout", 799 | "output_type": "stream", 800 | "text": [ 801 | "Train on 6999 samples, validate on 2000 samples\n", 802 | "Epoch 1/2\n", 803 | "6999/6999 [==============================] - 81s - loss: 2.1248 - acc: 0.8452 - val_loss: 2.1141 - val_acc: 0.8571\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n", 804 | "Epoch 2/2\n", 805 | "6999/6999 [==============================] - 61s - loss: 1.8115 - acc: 0.8536 - val_loss: 1.8044 - val_acc: 0.8483\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\n" 806 | ] 807 | }, 808 | { 809 | "data": { 810 | "text/plain": [ 811 | "" 812 | ] 813 | }, 814 | "execution_count": 104, 815 | "metadata": {}, 816 | "output_type": "execute_result" 817 | } 818 | ], 819 | "source": [ 820 | "model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=2, batch_size=128)" 821 | ] 822 | }, 823 | { 824 | "cell_type": "code", 825 | "execution_count": 105, 826 | "metadata": { 827 | "collapsed": true, 828 | "deletable": true, 829 | "editable": true 830 | }, 831 | "outputs": [], 832 | "source": [ 833 | "pred_val = model.predict(X_val, batch_size=128)" 834 | ] 835 | }, 836 | { 837 | "cell_type": "code", 838 | "execution_count": 106, 839 | "metadata": { 840 | "collapsed": false, 841 | "deletable": true, 842 | "editable": true 843 | }, 844 | "outputs": [ 845 | { 846 | "data": { 847 | "text/plain": [ 848 | "1.0" 849 | ] 850 | }, 851 | "execution_count": 106, 852 | "metadata": {}, 853 | "output_type": "execute_result" 854 | } 855 | ], 856 | "source": [ 857 | "np.max(pred_val)" 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": 107, 863 | "metadata": { 864 | "collapsed": false, 865 | "deletable": true, 866 | "editable": true 867 | }, 868 | "outputs": [ 869 | { 870 | "data": { 871 | "text/plain": [ 872 | "0.2786826774462014" 873 | ] 874 | }, 875 | "execution_count": 107, 876 | "metadata": {}, 877 | "output_type": "execute_result" 878 | } 879 | ], 880 | "source": [ 881 | "f1_score(y_val, np.where(pred_val>0.5, 1, 0), average = 'micro')" 882 | ] 883 | }, 884 | { 885 | "cell_type": "code", 886 | "execution_count": null, 887 | "metadata": { 888 | "collapsed": true 889 | }, 890 | "outputs": [], 891 | "source": [] 892 | } 893 | ], 894 | "metadata": { 895 | "kernelspec": { 896 | "display_name": "Python 2", 897 | "language": "python", 898 | "name": "python2" 899 | }, 900 | "language_info": { 901 | "codemirror_mode": { 902 | "name": "ipython", 903 | "version": 2 904 | }, 905 | "file_extension": ".py", 906 | "mimetype": "text/x-python", 907 | "name": "python", 908 | "nbconvert_exporter": "python", 909 | "pygments_lexer": "ipython2", 910 | "version": "2.7.13" 911 | } 912 | }, 913 | "nbformat": 4, 914 | "nbformat_minor": 2 915 | } 916 | -------------------------------------------------------------------------------- /pipeline/__pycache__/database_selection.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/pipeline/__pycache__/database_selection.cpython-35.pyc -------------------------------------------------------------------------------- /pipeline/__pycache__/helpers.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/pipeline/__pycache__/helpers.cpython-35.pyc -------------------------------------------------------------------------------- /pipeline/__pycache__/vectorization.cpython-35.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/pipeline/__pycache__/vectorization.cpython-35.pyc -------------------------------------------------------------------------------- /pipeline/attention_util.py: -------------------------------------------------------------------------------- 1 | from keras.layers import Dense, Input 2 | from keras.layers import Merge,TimeDistributed 3 | from keras.layers.merge import Concatenate 4 | from keras.layers.core import * 5 | from keras.layers import merge, dot, add 6 | from keras import backend as K 7 | # based on paper: Hierarchical Attention networks for document classification 8 | # starting code from: 9 | # https://groups.google.com/forum/#!msg/keras-users/IWK9opMFavQ/AITppppfAgAJ 10 | 11 | # note: there is a lot of sample codes in the internet that do not work, and their authors do mention that, 12 | # they don't see a difference when applying the attention mechanism 13 | # 14 | # I did have to review closely the formulas presented on the papers about Attention to figure it out what type of 15 | # code will actually work 16 | # Author: Zenobia Liendo 17 | 18 | def attention_layer(inputs, TIME_STEPS,lstm_units, i='1'): 19 | 20 | # inputs.shape = (batch_size, time_steps, input_dim) 21 | #(3) u_it: we first feed the word annotation through a one-layer MLP to get the hidden representation u_it 22 | inputs= Dropout(0.5)(inputs) 23 | u_it = TimeDistributed(Dense(lstm_units, activation='tanh', 24 | kernel_regularizer=regularizers.l2(0.0001), 25 | name='u_it'+i))(inputs) 26 | 27 | u_it= Dropout(0.5)(u_it) 28 | # (4) alpha_it: then we measure the importance of x as the similarity of u_it with a x level 29 | # context vector u_w and get a normalized importance weight alpha_it through a softmax function 30 | # The word context vector uw is randomly initialized and jointly learned during the training process. 31 | #alpha_it = TimeDistributed(Dense(TIME_STEPS, activation='softmax',use_bias=False))(u_it) 32 | att = TimeDistributed(Dense(1, 33 | kernel_regularizer=regularizers.l2(0.0001), 34 | bias=False))(u_it) 35 | att = Reshape((TIME_STEPS,))(att) 36 | att = Activation('softmax', name='alpha_it_softmax'+i)(att) 37 | 38 | 39 | # (5) s_i: After that, we compute the sentence vector s_i 40 | # as a weighted sum of the word annotations based on the weights alpha_it. 41 | s_i =merge([att, inputs], mode='dot', dot_axes=(1,1), name='s_i_dot'+i) 42 | 43 | 44 | return s_i -------------------------------------------------------------------------------- /pipeline/database_selection.py: -------------------------------------------------------------------------------- 1 | ###This file contains the functions reextracting the database only for the top ICD codes 2 | # Author: Zenobia Liendo 3 | 4 | import numpy as np 5 | import pandas as pd 6 | from collections import Counter 7 | 8 | 9 | def find_top_codes(df, col_name, n): 10 | """ Find the top codes from a columns of strings 11 | Returns a list of strings to make sure codes are treated as classes down the line """ 12 | string_total = df[col_name].str.cat(sep=' ') 13 | counter_total = Counter(string_total.split(' ')) 14 | return [word for word, word_count in counter_total.most_common(n)] 15 | 16 | def select_codes_in_string(string, top_codes): 17 | """ Creates a sring of the codes which are both in the original string 18 | and in the top codes list """ 19 | r = '' 20 | for code in top_codes: 21 | if code in string: 22 | r += ' ' + code 23 | return r.strip() 24 | 25 | def filter_top_codes(df, col_name, n, filter_empty = True): 26 | """ Creates a dataframe with the codes column containing only the top codes 27 | and filters out the lines without any of the top codes if True 28 | 29 | Note: we may actually want to keep even the empty lines """ 30 | r = df.copy() 31 | top_codes = find_top_codes(r, col_name, n) 32 | r[col_name] = r[col_name].apply(lambda x: select_codes_in_string(x, top_codes)) 33 | if filter_empty: 34 | r = r.loc[r[col_name] != ''] 35 | return r, top_codes 36 | -------------------------------------------------------------------------------- /pipeline/hatt_model.py: -------------------------------------------------------------------------------- 1 | from keras.models import Sequential, Model 2 | from keras.layers import Dense, Flatten, Input, Convolution1D 3 | from keras.layers import Conv1D, MaxPooling1D, Embedding, Merge, Dropout, LSTM, GRU, Bidirectional, TimeDistributed 4 | from keras.layers.merge import Concatenate 5 | from keras.layers.core import * 6 | from keras.layers import merge, dot, add 7 | from keras import backend as K 8 | from keras import optimizers 9 | 10 | import attention_util 11 | 12 | # based on paper: Hierarchical Attention networks for document classification 13 | # starting code from: 14 | # * https://github.com/richliao/textClassifier/blob/master/textClassifierHATT.py 15 | # but the github sources above had misteakes in the attention layer (IMO) that had been corrected here 16 | # Author: Zenobia Liendo 17 | 18 | def build_hierarhical_att_model(MAX_SENTS, MAX_SENT_LENGTH, embedding_matrix, 19 | max_vocab, embedding_dim, 20 | num_classes,training_dropout): 21 | 22 | # WORDS in one SENTENCE LAYER 23 | #----------------------------------------- 24 | #Embedding 25 | # note_input [sentences, words_in_a_sentence] 26 | sentence_input = Input(shape=(MAX_SENT_LENGTH,), dtype='int32') 27 | # use embedding_matrix 28 | # (1) embed the words to vectors through an embedding matrix 29 | embedded_sequences = Embedding(max_vocab + 1, 30 | embedding_dim, 31 | weights=[embedding_matrix], 32 | input_length=MAX_SENT_LENGTH, embeddings_regularizer=regularizers.l2(0.0001), 33 | trainable=True)(sentence_input) 34 | #embedded_sequences = Embedding(max_vocab + 1, 35 | # embedding_dim, 36 | # input_length=MAX_SENT_LENGTH, embeddings_regularizer=regularizers.l2(0.0001), 37 | # name="embedding")(sentence_input) 38 | 39 | # (2) GRU to get annotations of words by summarizing information 40 | # h_it: We obtain an annotation for a given word by concatenating the forward hidden state and 41 | # backward hidden state 42 | gru_dim = 50 43 | #h_it_sentence_vector = Bidirectional(GRU(gru_dim, return_sequences=True))(embedded_sequences) 44 | h_it_sentence_vector = Bidirectional(LSTM(gru_dim, return_sequences=True))(embedded_sequences) 45 | 46 | # Attention layer 47 | # Not all words contribute equally to the representation of the sentence meaning. 48 | # Hence, we introduce attention mechanism to extract such words that are important to the meaning of the 49 | # sentence and aggregate the representation of those informative words to form a sentence vector 50 | words_attention_vector = attention_util.attention_layer(h_it_sentence_vector,MAX_SENT_LENGTH,gru_dim) 51 | 52 | # Keras model that process words in one sentence 53 | sentEncoder = Model(sentence_input, words_attention_vector) 54 | 55 | print sentEncoder.summary() 56 | 57 | # SENTENCE LAYER 58 | #--------------------------------------------------------------------------------------------------------------------- 59 | note_input = Input(shape=(MAX_SENTS,MAX_SENT_LENGTH), dtype='int32') 60 | # TimeDistributes wrapper applies a layer to every temporal slice of an input. 61 | # The input should be at least 3D, and the dimension of index one will be considered to be the temporal dimension 62 | # Here the sentEncoder is applied to each input record (a note) 63 | note_encoder = TimeDistributed(sentEncoder)(note_input) 64 | #document_vector = Bidirectional(GRU(gru_dim, return_sequences=True))(note_encoder) 65 | document_vector = Bidirectional(LSTM(gru_dim, return_sequences=True))(note_encoder) 66 | 67 | #attention layer 68 | sentences_attention_vector = attention_util.attention_layer(document_vector,MAX_SENTS,gru_dim) 69 | 70 | # output layer 71 | z = Dropout(training_dropout)(sentences_attention_vector) 72 | preds = Dense(num_classes, activation='sigmoid', name='preds')(z) 73 | 74 | #model 75 | model = Model(note_input, preds) 76 | 77 | model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) 78 | #sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=False) 79 | #model.compile(loss="binary_crossentropy", optimizer="sgd", metrics=["accuracy"]) 80 | 81 | print("model fitting - Hierachical Attention GRU") 82 | print model.summary() 83 | 84 | return model -------------------------------------------------------------------------------- /pipeline/helpers.py: -------------------------------------------------------------------------------- 1 | ### This contains helper functions 2 | # Author: Zenobia Liendo 3 | 4 | import numpy as np 5 | import pandas as pd 6 | from sklearn.model_selection import train_test_split 7 | from sklearn.metrics import f1_score 8 | 9 | 10 | 11 | def train_val_test_split(X, y, val_size=0.2, test_size=0.2, random_state=101): 12 | """Splits the input and labels into 3 sets""" 13 | X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=(val_size+test_size), random_state=random_state) 14 | X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=test_size/(val_size+test_size), random_state=random_state) 15 | return X_train, X_val, X_test, y_train, y_val, y_test 16 | 17 | 18 | def replace_with_grandparent_codes(string_codes, ICD9_FIRST_LEVEL): 19 | """replace_with_grandparent_codes takes a list of ICD9 codes and 20 | returns the list of their grandparents ICD9 code in the first level of the ICD9 hierarchy""" 21 | ICD9_RANGES = [x.split('-') for x in ICD9_FIRST_LEVEL] 22 | resulting_codes = [] 23 | for code in string_codes.split(' '): 24 | for i,gparent_range in enumerate(ICD9_RANGES): 25 | range = gparent_range[1] if len(gparent_range) == 2 else gparent_range[0] 26 | if code[0:3] <= range: 27 | resulting_codes.append(ICD9_FIRST_LEVEL[i]) 28 | break 29 | return ' '.join (set(resulting_codes)) 30 | 31 | 32 | def write_col(df, col_name, fname): 33 | df[col_name].to_csv(fname) 34 | 35 | 36 | def get_f1_score(y_true,y_hat,threshold, average): 37 | hot_y = np.where(np.array(y_hat) > threshold, 1, 0) 38 | return f1_score(np.array(y_true), hot_y, average=average) 39 | 40 | def show_f1_score(y_train, pred_train, y_val, pred_dev): 41 | print('F1 scores') 42 | print('threshold | training | dev ') 43 | f1_score_average = 'micro' 44 | for threshold in [ 0.02, 0.03,0.04,0.05,0.055,0.058,0.06, 0.08, 0.1,0.2,0.3, 0.4, 0.5, 0.6,0.7]: 45 | train_f1 = get_f1_score(y_train, pred_train,threshold,f1_score_average) 46 | dev_f1 = get_f1_score(y_val, pred_dev,threshold,f1_score_average) 47 | print('%1.3f: %1.3f %1.3f' % (threshold,train_f1, dev_f1)) -------------------------------------------------------------------------------- /pipeline/icd9_cnn_att.py: -------------------------------------------------------------------------------- 1 | from keras.models import Sequential, Model 2 | from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding 3 | from keras.layers.merge import Concatenate 4 | from keras import regularizers 5 | import attention_util 6 | from keras import optimizers 7 | 8 | # Author: Zenobia Liendo 9 | 10 | ''' code based on: 11 | https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/sentiment_cnn.py 12 | http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/ 13 | https://github.com/dennybritz/cnn-text-classification-tf/blob/master/text_cnn.py 14 | ''' 15 | 16 | def build_icd9_cnn_model(input_seq_length, 17 | max_vocab, external_embeddings, embedding_dim, embedding_matrix, 18 | num_filters, filter_sizes, 19 | training_dropout, 20 | num_classes,trainable_Embeddings=True): 21 | #Embedding 22 | model_input = Input(shape=(input_seq_length, )) 23 | if external_embeddings: 24 | # use embedding_matrix 25 | z = Embedding(max_vocab + 1, 26 | embedding_dim, 27 | weights=[embedding_matrix], 28 | input_length=input_seq_length, 29 | trainable=trainable_Embeddings,embeddings_regularizer=regularizers.l2(0.0001))(model_input) 30 | else: 31 | # train embeddings 32 | z = Embedding(max_vocab + 1, 33 | embedding_dim, 34 | input_length=input_seq_length, embeddings_regularizer=regularizers.l2(0.0001), 35 | name="embedding")(model_input) 36 | 37 | #z = Dropout(0.1)(z) 38 | # Convolutional block 39 | conv_blocks = [] 40 | for i,sz in enumerate(filter_sizes): 41 | conv = Convolution1D(filters=num_filters, 42 | kernel_size=sz, 43 | padding="valid",kernel_regularizer=regularizers.l2(0.001), 44 | activation="relu", 45 | strides=1)(z) 46 | window_pool_size = input_seq_length - sz + 1 47 | #conv = MaxPooling1D(pool_size=window_pool_size)(conv) 48 | words_attention_vector = attention_util.attention_layer(conv, window_pool_size,50,str(i)) 49 | #conv = Flatten()(words_attention_vector) 50 | conv_blocks.append(words_attention_vector) 51 | 52 | #concatenate 53 | z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0] 54 | 55 | #score prediction 56 | #z = Dense(200, activation="relu")(z) #to avoid overfitting? I don't think this is necessary 57 | z = Dropout(training_dropout)(z) 58 | #model_output = Dense(num_classes, activation="softmax")(z) 59 | model_output = Dense(num_classes, kernel_regularizer=regularizers.l2(0.0001), 60 | activation="sigmoid")(z) 61 | 62 | #creating model 63 | model = Model(model_input, model_output) 64 | # what to use for tf.nn.softmax_cross_entropy_with_logits? 65 | adam_op = optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0) 66 | model.compile(loss="binary_crossentropy", optimizer=adam_op, metrics=["accuracy"]) 67 | #model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) 68 | #model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]) 69 | 70 | print model.summary() 71 | 72 | return model -------------------------------------------------------------------------------- /pipeline/icd9_cnn_model.py: -------------------------------------------------------------------------------- 1 | from keras.models import Sequential, Model 2 | from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding 3 | from keras.layers.merge import Concatenate 4 | from keras import regularizers 5 | 6 | # Author: Zenobia Liendo 7 | 8 | ''' code based on: 9 | https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/sentiment_cnn.py 10 | http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/ 11 | https://github.com/dennybritz/cnn-text-classification-tf/blob/master/text_cnn.py 12 | ''' 13 | 14 | def build_icd9_cnn_model(input_seq_length, 15 | max_vocab, external_embeddings, embedding_dim, embedding_matrix, 16 | num_filters, filter_sizes, 17 | training_dropout_keep_prob, 18 | num_classes): 19 | #Embedding 20 | model_input = Input(shape=(input_seq_length, )) 21 | if external_embeddings: 22 | # use embedding_matrix 23 | z = Embedding(max_vocab + 1, 24 | embedding_dim, 25 | weights=[embedding_matrix], 26 | input_length=input_seq_length, 27 | trainable=True)(model_input) 28 | else: 29 | # train embeddings 30 | z = Embedding(max_vocab + 1, 31 | embedding_dim, 32 | input_length=input_seq_length, embeddings_regularizer=regularizers.l2(0.0001), 33 | name="embedding")(model_input) 34 | 35 | # Convolutional block 36 | conv_blocks = [] 37 | for sz in filter_sizes: 38 | conv = Convolution1D(filters=num_filters, 39 | kernel_size=sz, 40 | padding="valid", 41 | activation="relu", 42 | strides=1)(z) 43 | window_pool_size = input_seq_length - sz + 1 44 | conv = MaxPooling1D(pool_size=window_pool_size)(conv) 45 | conv = Flatten()(conv) 46 | conv_blocks.append(conv) 47 | 48 | #concatenate 49 | z = Concatenate()(conv_blocks) if len(conv_blocks) > 1 else conv_blocks[0] 50 | z = Dropout(training_dropout_keep_prob)(z) 51 | 52 | #score prediction 53 | #z = Dense(num_classes, activation="relu")(z) I don't think this is necessary 54 | #model_output = Dense(num_classes, activation="softmax")(z) 55 | model_output = Dense(num_classes, activation="sigmoid")(z) 56 | 57 | #creating model 58 | model = Model(model_input, model_output) 59 | # what to use for tf.nn.softmax_cross_entropy_with_logits? 60 | model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) 61 | #model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]) 62 | 63 | print model.summary() 64 | 65 | return model -------------------------------------------------------------------------------- /pipeline/icd9_lstm_att_model.py: -------------------------------------------------------------------------------- 1 | from keras.models import Model 2 | from keras.layers import Dense, Dropout, Flatten, Input, Embedding,Bidirectional 3 | from keras.layers.merge import Concatenate 4 | from keras.layers import LSTM 5 | from keras.layers import MaxPooling1D, Embedding, Merge, Dropout, LSTM, Bidirectional 6 | from keras.layers.merge import Concatenate 7 | from keras.layers.core import * 8 | from keras.layers import merge, dot, add 9 | from keras import backend as K 10 | import attention_util 11 | 12 | # Author: Zenobia Liendo 13 | 14 | def build_lstm_att_model(input_seq_length, 15 | max_vocab, external_embeddings, embedding_trainable, embedding_dim, embedding_matrix, 16 | training_dropout_keep_prob,num_classes): 17 | #Embedding 18 | model_input = Input(shape=(input_seq_length, )) 19 | if external_embeddings: 20 | # use embedding_matrix 21 | z = Embedding(max_vocab + 1, 22 | embedding_dim, 23 | weights=[embedding_matrix], 24 | input_length=input_seq_length, 25 | trainable=embedding_trainable,name = "Embeddng")(model_input) 26 | else: 27 | # train embeddings 28 | z = Embedding(max_vocab + 1, 29 | embedding_dim, 30 | input_length=input_seq_length, 31 | name="Embedding")(model_input) 32 | 33 | # LSTM 34 | lstm_units= 50 35 | l_lstm = LSTM(lstm_units,return_sequences=True)(z) 36 | 37 | #attention 38 | words_attention_vector = attention_util.attention_layer(l_lstm,input_seq_length,lstm_units) 39 | 40 | #score prediction 41 | z = Dropout(training_dropout_keep_prob)(words_attention_vector) 42 | model_output = Dense(num_classes, activation="sigmoid", name="Output_Layer")(z) 43 | 44 | #creating model 45 | model = Model(model_input, model_output) 46 | # what to use for tf.nn.softmax_cross_entropy_with_logits? 47 | model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) 48 | #model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]) 49 | 50 | print model.summary() 51 | 52 | return model 53 | -------------------------------------------------------------------------------- /pipeline/icd9_lstm_cnn.py: -------------------------------------------------------------------------------- 1 | #%matplotlib inline 2 | # General imports 3 | import numpy as np 4 | import pandas as pd 5 | from sklearn.metrics import f1_score 6 | import random 7 | from collections import Counter, defaultdict 8 | from operator import itemgetter 9 | import matplotlib.pyplot as plt 10 | 11 | 12 | #keras 13 | from keras.models import Sequential, Model 14 | from keras.layers import Dense, Dropout, Flatten, Input, MaxPooling1D, Convolution1D, Embedding 15 | from keras.layers.merge import Concatenate 16 | from keras.models import load_model 17 | from IPython.display import SVG 18 | from keras.utils.vis_utils import model_to_dot 19 | 20 | # Custom functions 21 | #%load_ext autoreload 22 | #%autoreload 2 23 | import database_selection 24 | import vectorization 25 | import helpers 26 | import icd9_cnn_model 27 | import lstm_model 28 | import icd9_lstm_att_model 29 | 30 | 31 | # Author: Zenobia Liendo 32 | 33 | df = pd.read_csv('../data/disch_notes_all_icd9.csv', 34 | names = ['HADM_ID', 'SUBJECT_ID', 'DATE', 'ICD9','TEXT']) 35 | 36 | ICD9_FIRST_LEVEL = [ 37 | '001-139','140-239','240-279','290-319', '320-389', '390-459','460-519', '520-579', '580-629', 38 | '630-679', '680-709','710-739', '760-779', '780-789', '790-796', '797', '798', '799', '800-999' ] 39 | N_TOP = len(ICD9_FIRST_LEVEL) 40 | # replacing leave ICD9 codes with the grandparents 41 | df['ICD9'] = df['ICD9'].apply(lambda x: helpers.replace_with_grandparent_codes(x,ICD9_FIRST_LEVEL)) 42 | 43 | #counts by icd9_codes 44 | icd9_codes = Counter() 45 | for label in df['ICD9']: 46 | for icd9_code in label.split(): 47 | icd9_codes[icd9_code] += 1 48 | number_icd9_first_level = len (icd9_codes) 49 | 50 | top_codes = ICD9_FIRST_LEVEL 51 | labels = vectorization.vectorize_icd_column(df, 'ICD9', top_codes) 52 | 53 | #preprocess notes 54 | MAX_VOCAB = None # to limit original number of words (None if no limit) 55 | MAX_SEQ_LENGTH = 5000 # to limit length of word sequence (None if no limit) 56 | df.TEXT = vectorization.clean_notes(df, 'TEXT') 57 | data_vectorized, dictionary, MAX_VOCAB = vectorization.vectorize_notes(df.TEXT, MAX_VOCAB, verbose = True) 58 | data, MAX_SEQ_LENGTH = vectorization.pad_notes(data_vectorized, MAX_SEQ_LENGTH) 59 | 60 | EMBEDDING_DIM = 100 # given the glove that we chose 61 | EMBEDDING_MATRIX= [] 62 | 63 | #creating glove embeddings 64 | EMBEDDING_LOC = '../data/glove.6B.100d.txt' # location of embedding 65 | EMBEDDING_MATRIX, embedding_dict = vectorization.embedding_matrix(EMBEDDING_LOC, 66 | dictionary, EMBEDDING_DIM, verbose = True, sigma=True) 67 | 68 | #split sets 69 | X_train, X_val, X_test, y_train, y_val, y_test = helpers.train_val_test_split( 70 | data, labels, val_size=0.2, test_size=0.1, random_state=101) 71 | print("Train: ", X_train.shape, y_train.shape) 72 | print("Validation: ", X_val.shape, y_val.shape) 73 | print("Test: ", X_test.shape, y_test.shape) 74 | 75 | # Delete temporary variables to free some memory 76 | del df, data, labels 77 | 78 | # finding out the top icd9 codes 79 | top_4_icd9 = icd9_codes.most_common(4) 80 | print "most common 4 icd9_codes: ", top_4_icd9 81 | top_4_icd9_label = ' '.join(code for code,count in top_4_icd9 ) 82 | print 'label for the top 4 icd9 codes: ', top_4_icd9_label 83 | 84 | #converting ICD9 prediction to a vector 85 | top4_icd9_vector = vectorization.vectorize_icd_string(top_4_icd9_label, ICD9_FIRST_LEVEL) 86 | 87 | ## assign icd9_prediction_vector to every discharge 88 | train_y_hat_baseline = [top4_icd9_vector]* len (y_train) 89 | dev_y_hat_baseline = [top4_icd9_vector]* len (y_val) 90 | 91 | reload(lstm_model) 92 | ##### build model 93 | l_model = lstm_model.build_lstm_model (input_seq_length=MAX_SEQ_LENGTH, max_vocab = MAX_VOCAB, 94 | external_embeddings = True, embedding_trainable =True, 95 | embedding_dim=EMBEDDING_DIM,embedding_matrix=EMBEDDING_MATRIX, 96 | num_classes=N_TOP ) 97 | 98 | l_model.fit(X_train, y_train, batch_size=50, epochs=10, validation_data=(X_val, y_val), verbose=1) 99 | pred_train = l_model.predict(X_train, batch_size=100) 100 | pred_dev = l_model.predict(X_val, batch_size=100) 101 | helpers.show_f1_score(y_train, pred_train, y_val, pred_dev) 102 | 103 | reload(icd9_lstm_att_model) 104 | #### build model 105 | latt_model = icd9_lstm_att_model.build_lstm_att_model (input_seq_length=MAX_SEQ_LENGTH, max_vocab = MAX_VOCAB, 106 | external_embeddings = True, embedding_trainable =True, 107 | embedding_dim=EMBEDDING_DIM,embedding_matrix=EMBEDDING_MATRIX, 108 | num_classes=N_TOP ) 109 | 110 | #model_lst_att_fit = latt_model.fit(X_train, y_train, batch_size=50, epochs=1, validation_data=(X_val, y_val), verbose=1) 111 | 112 | model_lst_att_fit = latt_model.fit(X_train, y_train, batch_size=50, epochs=10, validation_data=(X_val, y_val), verbose=1) 113 | pred_train = latt_model.predict(X_train, batch_size=100) 114 | pred_dev = latt_model.predict(X_val, batch_size=100) 115 | helpers.show_f1_score(y_train, pred_train, y_val, pred_dev) 116 | latt_model.save('models/latt_model_5_epochs_5k.h5') 117 | 118 | 119 | reload(icd9_cnn_model) 120 | #### build model 121 | model = icd9_cnn_model.build_icd9_cnn_model (input_seq_length=MAX_SEQ_LENGTH, max_vocab = MAX_VOCAB, 122 | external_embeddings = False, 123 | embedding_dim=EMBEDDING_DIM,embedding_matrix=EMBEDDING_MATRIX, 124 | num_filters = 100, filter_sizes=[2,3,4,5], 125 | training_dropout_keep_prob=0.5, 126 | num_classes=N_TOP ) 127 | 128 | 129 | 130 | model.fit(X_train, y_train, batch_size=50, epochs=20, validation_data=(X_val, y_val), verbose=2) 131 | 132 | pred_train = model.predict(X_train, batch_size=50) 133 | pred_dev = model.predict(X_val, batch_size=50) 134 | # perform evaluation 135 | helpers.show_f1_score(y_train, pred_train, y_val, pred_dev) 136 | 137 | model.save('models/cnn_20_epochs.h5') 138 | 139 | 140 | 141 | -------------------------------------------------------------------------------- /pipeline/lstm_model.py: -------------------------------------------------------------------------------- 1 | from keras.models import Model 2 | from keras.layers import Dense, Dropout, Flatten, Input, Embedding,Bidirectional 3 | from keras.layers.merge import Concatenate 4 | from keras.layers import LSTM 5 | 6 | # Author: Zenobia Liendo 7 | def build_lstm_model(input_seq_length, 8 | max_vocab, external_embeddings, embedding_trainable, embedding_dim, embedding_matrix, 9 | training_dropout_keep_prob, num_classes): 10 | #Embedding 11 | model_input = Input(shape=(input_seq_length, )) 12 | if external_embeddings: 13 | # use embedding_matrix 14 | z = Embedding(max_vocab + 1, 15 | embedding_dim, 16 | weights=[embedding_matrix], 17 | input_length=input_seq_length, 18 | trainable=embedding_trainable)(model_input) 19 | else: 20 | # train embeddings 21 | z = Embedding(max_vocab + 1, 22 | embedding_dim, 23 | input_length=input_seq_length, 24 | name="embedding")(model_input) 25 | 26 | # LSTM 27 | l_lstm = LSTM(50)(z) 28 | 29 | z = Dropout(training_dropout_keep_prob)(l_lstm) 30 | 31 | #score prediction 32 | model_output = Dense(num_classes, activation="sigmoid")(z) 33 | 34 | #creating model 35 | model = Model(model_input, model_output) 36 | # what to use for tf.nn.softmax_cross_entropy_with_logits? 37 | model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) 38 | #model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]) 39 | 40 | print model.summary() 41 | 42 | return model -------------------------------------------------------------------------------- /pipeline/vectorization.py: -------------------------------------------------------------------------------- 1 | ### This file contains the functions necessary to vectorize the ICD labels and text inputs 2 | # Author: Guillaume De Roo 3 | import numpy as np 4 | import pandas as pd 5 | import re 6 | import keras 7 | from keras.preprocessing.text import Tokenizer 8 | from keras.preprocessing.sequence import pad_sequences 9 | 10 | 11 | # Vectorize ICD codes 12 | 13 | def vectorize_icd_string(x, code_list): 14 | """Takes a string with ICD codes and returns an array of the right of 0/1""" 15 | r = [] 16 | for code in code_list: 17 | if code in x: r.append(1) 18 | else: r.append(0) 19 | return np.asarray(r) 20 | 21 | def vectorize_icd_column(df, col_name, code_list): 22 | """Takes a column and applies the """ 23 | r = df[col_name].apply(lambda x: vectorize_icd_string(x, code_list)) 24 | r = np.transpose(np.column_stack(r)) 25 | return r 26 | 27 | 28 | # Clean Text 29 | 30 | def clean_str(string): 31 | """Cleaning of notes""" 32 | 33 | """ Cleaning from Guillaume """ 34 | string = string.lower() 35 | string = string.replace("\n", " ") # remove the lines 36 | string = re.sub("\[\*\*.*?\*\*\]", "", string) # remove the things inside the [** **] 37 | string = re.sub("[^a-zA-Z0-9\ \']+", " ", string) 38 | 39 | """ Tokenization/string cleaning for all datasets except for SST. 40 | Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py 41 | """ 42 | #string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) 43 | string = re.sub(r"\'s", " \'s", string) 44 | string = re.sub(r"\'ve", " \'ve", string) 45 | string = re.sub(r"n\'t", " n\'t", string) 46 | string = re.sub(r"\'re", " \'re", string) 47 | string = re.sub(r"\'d", " \'d", string) 48 | string = re.sub(r"\'ll", " \'ll", string) 49 | #string = re.sub(r",", " , ", string) 50 | #string = re.sub(r"!", " ! ", string) 51 | #string = re.sub(r"\(", " \( ", string) 52 | #string = re.sub(r"\)", " \) ", string) 53 | #string = re.sub(r"\?", " \? ", string) 54 | string = re.sub(r"\s{2,}", " ", string) 55 | 56 | """ Canonize numbers""" 57 | string = re.sub(r"(\d+)", "DG", string) 58 | 59 | return string.strip() 60 | 61 | def clean_notes(df, col_name): 62 | r = df[col_name].apply(lambda x: clean_str(x)) 63 | return r 64 | 65 | 66 | # Vectorize and Pad notes Text 67 | 68 | def vectorize_notes(col, MAX_NB_WORDS, verbose = True): 69 | """Takes a note column and encodes it into a series of integer 70 | Also returns the dictionnary mapping the word to the integer""" 71 | tokenizer = Tokenizer(num_words = MAX_NB_WORDS) 72 | tokenizer.fit_on_texts(col) 73 | data = tokenizer.texts_to_sequences(col) 74 | note_length = [len(x) for x in data] 75 | vocab = tokenizer.word_index 76 | MAX_VOCAB = len(vocab) 77 | if verbose: 78 | print('Vocabulary size: %s' % MAX_VOCAB) 79 | print('Average note length: %s' % np.mean(note_length)) 80 | print('Max note length: %s' % np.max(note_length)) 81 | return data, vocab, MAX_VOCAB 82 | 83 | def pad_notes(data, MAX_SEQ_LENGTH): 84 | data = pad_sequences(data, maxlen = MAX_SEQ_LENGTH) 85 | return data, data.shape[1] 86 | 87 | 88 | # Creates an embedding Matrix 89 | # Based on https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html 90 | 91 | def embedding_matrix(f_name, dictionary, EMBEDDING_DIM, verbose = True, sigma = None): 92 | """Takes a pre-trained embedding and adapts it to the dictionary at hand 93 | Words not found will be all-zeros in the matrix""" 94 | 95 | # Dictionary of words from the pre trained embedding 96 | pretrained_dict = {} 97 | with open(f_name, 'r') as f: 98 | for line in f: 99 | values = line.split() 100 | word = values[0] 101 | coefs = np.asarray(values[1:], dtype='float32') 102 | pretrained_dict[word] = coefs 103 | 104 | # Default values for absent words 105 | if sigma: 106 | pretrained_matrix = sigma * np.random.rand(len(dictionary) + 1, EMBEDDING_DIM) 107 | else: 108 | pretrained_matrix = np.zeros((len(dictionary) + 1, EMBEDDING_DIM)) 109 | 110 | # Substitution of default values by pretrained values when applicable 111 | for word, i in dictionary.items(): 112 | vector = pretrained_dict.get(word) 113 | if vector is not None: 114 | pretrained_matrix[i] = vector 115 | 116 | if verbose: 117 | print('Vocabulary in notes:', len(dictionary)) 118 | print('Vocabulary in original embedding:', len(pretrained_dict)) 119 | inter = list( set(dictionary.keys()) & set(pretrained_dict.keys()) ) 120 | print('Vocabulary intersection:', len(inter)) 121 | 122 | return pretrained_matrix, pretrained_dict 123 | -------------------------------------------------------------------------------- /pre_processing/MIMICERdiagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/pre_processing/MIMICERdiagram.png -------------------------------------------------------------------------------- /pre_processing/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## ICD9-codes 4 | 5 | 6 | ER database diagram of tables used in this project 7 | 8 | 9 | ![database diagram of tables used in this project](MIMICERdiagram.png) 10 | 11 | The DIAGNOSES_ICD table contains the ICD9-codes assigned to a hospital admission. There could be many ICD9-codes assigned to one admission. 12 | There are 57,786 admissions in this table, with a total of 651,047 ICD9-codes assigned. 13 | 14 | Related previous research didn't work with all ICD9-codes but only with the most used in diagnoses reports[2] [3]. We will not consider ICD9-codes that start with "E" (additional information indicating the cause of injury or adverse event) nor "V" (codes used when the visit is due to circumstances other than disease or injury, e.g.: new born to indicate birth status) 15 | 16 | * We identify the top 20 labels based on number of patients with that label. 17 | * We then remove all patients who don’t have at least one of these labels, 18 | * and then filter the set of labels for each patient to only include these labels. 19 | 20 | As a result, we get 45,293 admissions with 152,299 icd9-codes (only including the ones in the top 20) 21 | 22 | Here is the list of the top 20 ICD9-codes that will be used in the baseline 23 | 24 | ``` 25 | select icd9_code, COUNT(DISTINCT SUBJECT_ID) subjects_qty 26 | from diagnoses_icd where SUBSTRING(icd9_code from 1 for 1) != 'V' 27 | group by icd9_code order by subjects_qty 28 | desc limit 20; 29 | 30 | icd9_code | subjects_qty 31 | -----------+-------- 32 | 4019 | 17613 33 | 41401 | 10775 34 | 42731 | 10271 35 | 4280 | 9843 36 | 5849 | 7687 37 | 2724 | 7465 38 | 25000 | 7370 39 | 51881 | 6719 40 | 5990 | 5779 41 | 2720 | 5335 42 | 53081 | 5272 43 | 2859 | 4993 44 | 486 | 4423 45 | 2851 | 4241 46 | 2762 | 4177 47 | 2449 | 3819 48 | 496 | 3592 49 | 99592 | 3560 50 | 0389 | 3433 51 | 5070 | 3396 52 | (20 rows) 53 | 54 | 55 | ``` 56 | 57 | The file containing the list of admissions resulting of the filtering above is in this file: baseline\psql_files\diagnoses_icd_codes.csv 58 | (note: we removed all files containing MIMIC data because they need authorization by MIMIC to access) 59 | 60 | Here is the sql that created that file 61 | 62 | ``` 63 | select hadm_id, max(subject_id) subject_id, string_agg(icd9_code, ' ') icd9_codes 64 | from diagnoses_icd 65 | where icd9_code IN ( select icd9_code from (select icd9_code, COUNT(DISTINCT SUBJECT_ID) subjects_qty 66 | from diagnoses_icd where SUBSTRING(icd9_code from 1 for 1) != 'V' 67 | group by icd9_code order by subjects_qty desc limit 20) as icd9_subject_list) 68 | group by hadm_id; 69 | ``` 70 | 71 | (note: for the final model, do we consider the code's description and/or hierarchy?) 72 | 73 | ## Clinical Notes 74 | 75 | The clinical notes related to an admission are located in the NOTEEVENTS and ADMISSION tables. 76 | The ADMISSION table has only one type of clinical note, the one related to the preliminary diagnoses done during admission. 77 | The NOTEEVENTS contains all the other type of clinical notes, here is a list of these types 78 | ``` 79 | mimic=# select category from noteevents group by category; 80 | category 81 | ------------------- 82 | ECG 83 | Respiratory 84 | Discharge summary 85 | Radiology 86 | Rehab Services 87 | Nursing/other 88 | Nutrition 89 | Pharmacy 90 | Social Work 91 | Case Management 92 | Physician 93 | General 94 | Nursing 95 | Echo 96 | Consult 97 | (15 rows) 98 | 99 | ``` 100 | The baseline will ONLY use the 'Discharge Summary' clinical notes. (note: we may use the other clinical notes for the final project) 101 | An example of one discharge summary note can be found at: baseline/psql_files/discharge_note_sample.out 102 | (we removde all data related files since they need granted authorization by MIMIC) 103 | 104 | 105 | It looks like the discharge summary can have Addendum, we will not include Addendums for the baseline 106 | ``` 107 | mimic=# select category, description from noteevents where category = 'Discharge summary' group by category, description; 108 | category | description 109 | -------------------+------------- 110 | Discharge summary | Report 111 | Discharge summary | Addendum 112 | (2 rows) 113 | ``` 114 | 115 | It looks like there are duplicates entries for some admissions, for example, 116 | ``` 117 | mimic=# select HADM_ID, SUBJECT_ID, CHARTDATE, CGID, ISERROR, substring(TEXT from 1 for 20) 118 | mimic-# from noteevents 119 | mimic-# where HADM_ID = '178053' and noteevents.category = 'Discharge summary' and noteevents.DESCRIPTION = 'Report'; 120 | 121 | hadm_id | subject_id | chartdate | cgid | iserror | substring 122 | ---------+------------+---------------------+------+---------+---------------------- 123 | 178053 | 18976 | 2120-11-28 00:00:00 | | | Admission Date: [** 124 | 178053 | 18976 | 2120-11-28 00:00:00 | | | Admission Date: [** 125 | 178053 | 18976 | 2120-12-16 00:00:00 | | | Admission Date: [** 126 | 178053 | 18976 | 2120-12-16 00:00:00 | | | Admission Date: [** 127 | 178053 | 18976 | 2120-11-26 00:00:00 | | | Admission Date: [** 128 | 129 | (5 rows) 130 | ``` 131 | in this case the clinical notes are different, seems the did multiple entries for the same discharge, and the discharge happened in two different dates with the same admission_id (that looks like a mistake since a patient returning should get a new admission_id) 132 | 133 | We will handle this situation for the final model. 134 | 135 | ## Joining information from the clinical notes and the corresponding ICD9 codes 136 | 137 | This is the sql statement that created a table with a join from the discharge summary notes and its ICD_CODES that are in the top 20. 138 | ``` 139 | DROP TABLE IF EXISTS W266_DISCHARGE_NOTE_ICD9_CODES; 140 | CREATE TABLE W266_DISCHARGE_NOTE_ICD9_CODES AS 141 | 142 | select noteevents.HADM_ID, dianoses_top20_icd.SUBJECT_ID, noteevents.CHARTDATE, noteevents.TEXT, dianoses_top20_icd.ICD9_CODES 143 | from noteevents 144 | JOIN 145 | (select HADM_ID, max(SUBJECT_ID) SUBJECT_ID, string_agg(ICD9_CODE, ' ') ICD9_CODES 146 | from diagnoses_icd 147 | where ICD9_CODE IN ( select ICD9_CODE from (select ICD9_CODE, COUNT(DISTINCT SUBJECT_ID) subjects_qty 148 | from diagnoses_icd 149 | where SUBSTRING(ICD9_CODE from 1 for 1) != 'V' 150 | group by ICD9_CODE order by subjects_qty 151 | desc limit 20) as icd9_subject_list) 152 | group by HADM_ID ) as dianoses_top20_icd 153 | ON (noteevents.HADM_ID = dianoses_top20_icd.HADM_ID) 154 | where noteevents.category = 'Discharge summary' and noteevents.DESCRIPTION = 'Report'; 155 | 156 | 157 | CREATE INDEX W266_DISCHARGE_NOTE_ICD9_CODES_index 158 | ON W266_DISCHARGE_NOTE_ICD9_CODES(HADM_ID) ; 159 | 160 | 161 | ``` 162 | 163 | the file is about 474 MB 164 | 165 | ## Representing clinical notes for the baseline 166 | 167 | Previous research represents this documents as bag-of-words vectors [1]. In particular, it takes the 10,000 tokens with the largest tf-idf scores from the training. 168 | 169 | (note for the final model: we could use here POS tagging, parsing and entity recognition) 170 | 171 | 172 | ## References 173 | [1] Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics 174 | [2] Applying Deep Learning to ICD-9 Multi-label Classification from Medical Records 175 | [3] ICD-9 Coding of Discharge Summaries 176 | [4] Large-scale Multi-label Text Classification - Revisiting Neural Networks 177 | -------------------------------------------------------------------------------- /pre_processing/psql_files/create_discharge_notes_all_icd9: -------------------------------------------------------------------------------- 1 | DROP TABLE IF EXISTS W266_DISCHARGE_NOTE_ICD9_CODES_B; 2 | CREATE TABLE W266_DISCHARGE_NOTE_ICD9_CODES_B AS 3 | 4 | select notes.HADM_ID, diagnoses_icd9.SUBJECT_ID, notes.CHARTDATE, diagnoses_icd9.ICD9_CODES, notes.NOTE_TEXT 5 | from 6 | (select HADM_ID,max(SUBJECT_ID) , MIN( CHARTDATE) CHARTDATE ,string_agg(TEXT , ' ' ORDER BY CHARTDATE) NOTE_TEXT 7 | from noteevents 8 | where category = 'Discharge summary' 9 | group by HADM_ID ) as notes 10 | JOIN 11 | ( select HADM_ID, max(SUBJECT_ID) SUBJECT_ID, string_agg(ICD9_CODE, ' ') ICD9_CODES 12 | from diagnoses_icd 13 | where SUBSTRING(ICD9_CODE from 1 for 1) != 'V' and SUBSTRING(ICD9_CODE from 1 for 1) != 'E' 14 | group by HADM_ID ) as diagnoses_icd9 15 | ON (notes.HADM_ID = diagnoses_icd9.HADM_ID); 16 | 17 | 18 | CREATE INDEX W266_DISCHARGE_NOTE_ICD9_B_CODES_index 19 | ON W266_DISCHARGE_NOTE_ICD9_CODES_B(HADM_ID) ; 20 | -------------------------------------------------------------------------------- /pre_processing/psql_files/query01_top_icd9_codes.sql: -------------------------------------------------------------------------------- 1 | -- We identify the top 20 labels based on number of patients with that label. 2 | 3 | select icd9_code, COUNT(DISTINCT SUBJECT_ID) subjects_qty 4 | from diagnoses_icd 5 | where SUBSTRING(icd9_code from 1 for 1) != 'V' 6 | group by icd9_code 7 | order by subjects_qty 8 | desc limit 20; 9 | 10 | 11 | -------------------------------------------------------------------------------- /pre_processing/psql_files/query02_filter_diagnoses_by_icd9_code.sql: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/pre_processing/psql_files/query02_filter_diagnoses_by_icd9_code.sql -------------------------------------------------------------------------------- /pre_processing/psql_files/query03C_icd9_codes_by_admission_create_table.sql: -------------------------------------------------------------------------------- 1 | DROP TABLE IF EXISTS W266_DIAGNOSES_TOP_ICD9_CODES; 2 | CREATE TABLE W266_DIAGNOSES_TOP_ICD9_CODES AS 3 | 4 | select hadm_id, max(subject_id) subject_id, string_agg(icd9_code, ' ') icd9_codes 5 | from diagnoses_icd 6 | where icd9_code IN ( select icd9_code from (select icd9_code, COUNT(DISTINCT SUBJECT_ID) subjects_qty 7 | from diagnoses_icd 8 | where SUBSTRING(icd9_code from 1 for 1) != 'V' 9 | group by icd9_code 10 | order by subjects_qty 11 | desc limit 20) as icd9_subject_list) 12 | group by hadm_id; 13 | 14 | CREATE INDEX W266_DIAGNOSES_TOP_ICD9_CODES_index 15 | ON W266_DIAGNOSES_TOP_ICD9_CODES (HADM_ID) ; -------------------------------------------------------------------------------- /pre_processing/psql_files/query03_icd9_codes_by_admission.sql: -------------------------------------------------------------------------------- 1 | --select hadm_id, max(subject_id), string_agg(icd9_code, ',') from diagnoses_icd where hadm_id = '145834' group by hadm_id 2 | 3 | -- aggregates icd9-codes in one row 4 | -- generates diagnoses_icd_codes.csv 5 | 6 | select hadm_id, max(subject_id) subject_id, string_agg(icd9_code, ' ') icd9_codes 7 | from diagnoses_icd 8 | where icd9_code IN ( select icd9_code from (select icd9_code, COUNT(DISTINCT SUBJECT_ID) subjects_qty 9 | from diagnoses_icd 10 | where SUBSTRING(icd9_code from 1 for 1) != 'V' 11 | group by icd9_code 12 | order by subjects_qty 13 | desc limit 20) as icd9_subject_list) 14 | group by hadm_id; -------------------------------------------------------------------------------- /pre_processing/psql_files/query04_filtering_discharge_summary_notes.sql: -------------------------------------------------------------------------------- 1 | select noteevents.HADM_ID, dianoses_top20_icd.SUBJECT_ID, noteevents.CHARTDATE, noteevents.TEXT, dianoses_top20_icd.ICD9_CODES 2 | from noteevents 3 | JOIN 4 | (select HADM_ID, max(SUBJECT_ID) SUBJECT_ID, string_agg(ICD9_CODE, ' ') ICD9_CODES 5 | from diagnoses_icd 6 | where ICD9_CODE IN ( select ICD9_CODE from (select ICD9_CODE, COUNT(DISTINCT SUBJECT_ID) subjects_qty 7 | from diagnoses_icd 8 | where SUBSTRING(ICD9_CODE from 1 for 1) != 'V' 9 | group by ICD9_CODE order by subjects_qty 10 | desc limit 20) as icd9_subject_list) 11 | group by HADM_ID ) as dianoses_top20_icd 12 | ON (noteevents.HADM_ID = dianoses_top20_icd.HADM_ID) 13 | where noteevents.category = 'Discharge summary' and noteevents.DESCRIPTION = 'Report'; 14 | 15 | 16 | -------------------------------------------------------------------------------- /pre_processing/psql_files/query05_discharge_notes_icd9_create_table.sql: -------------------------------------------------------------------------------- 1 | DROP TABLE IF EXISTS W266_DISCHARGE_NOTE_ICD9_CODES; 2 | CREATE TABLE W266_DISCHARGE_NOTE_ICD9_CODES AS 3 | 4 | select noteevents.HADM_ID, dianoses_top20_icd.SUBJECT_ID, noteevents.CHARTDATE, noteevents.TEXT, dianoses_top20_icd.ICD9_CODES 5 | from noteevents 6 | JOIN 7 | (select HADM_ID, max(SUBJECT_ID) SUBJECT_ID, string_agg(ICD9_CODE, ' ') ICD9_CODES 8 | from diagnoses_icd 9 | where ICD9_CODE IN ( select ICD9_CODE from (select ICD9_CODE, COUNT(DISTINCT SUBJECT_ID) subjects_qty 10 | from diagnoses_icd 11 | where SUBSTRING(ICD9_CODE from 1 for 1) != 'V' 12 | group by ICD9_CODE order by subjects_qty 13 | desc limit 20) as icd9_subject_list) 14 | group by HADM_ID ) as dianoses_top20_icd 15 | ON (noteevents.HADM_ID = dianoses_top20_icd.HADM_ID) 16 | where noteevents.category = 'Discharge summary' and noteevents.DESCRIPTION = 'Report'; 17 | 18 | 19 | CREATE INDEX W266_DISCHARGE_NOTE_ICD9_CODES_index 20 | ON W266_DISCHARGE_NOTE_ICD9_CODES(HADM_ID) ; 21 | 22 | 23 | 24 | -------------------------------------------------------------------------------- /pre_processing/psql_files/query06_export_w266_table.sql: -------------------------------------------------------------------------------- 1 | select HADM_ID, SUBJECT_ID, CHARTDATE, regexp_replace(TEXT, E'[\\n\\r]+', ' ', 'g' ), ICD9_CODES 2 | from W266_DISCHARGE_NOTE_ICD9_CODES; 3 | -------------------------------------------------------------------------------- /pre_processing/psql_files/query_all_discharge_notes: -------------------------------------------------------------------------------- 1 | select HADM_ID, SUBJECT_ID , CHARTDATE , DESCRIPTION, regexp_replace(TEXT, E'[\\n\\r]+', ' ', 'g' ) 2 | from noteevents 3 | where category = 'Discharge summary' 4 | -------------------------------------------------------------------------------- /pre_processing/psql_files/query_icd9_codes: -------------------------------------------------------------------------------- 1 | select HADM_ID, max(SUBJECT_ID) SUBJECT_ID, string_agg(ICD9_CODE, ' ') ICD9_CODES 2 | from diagnoses_icd 3 | where SUBSTRING(ICD9_CODE from 1 for 1) != 'V' and SUBSTRING(ICD9_CODE from 1 for 1) != 'E' 4 | group by HADM_ID 5 | -------------------------------------------------------------------------------- /pre_processing/psql_files/top_icd9_codes.txt: -------------------------------------------------------------------------------- 1 | top 20 labels based on number of patients with that label 2 | (not considering icd9 codes that start with "V" nor "E" because they refer to information that is not diagnosis) 3 | 4 | 5 | icd9_code | icd9_q 6 | -----------+-------- 7 | 4019 | 17613 8 | 41401 | 10775 9 | 42731 | 10271 10 | 4280 | 9843 11 | 5849 | 7687 12 | 2724 | 7465 13 | 25000 | 7370 14 | 51881 | 6719 15 | 5990 | 5779 16 | 2720 | 5335 17 | 53081 | 5272 18 | 2859 | 4993 19 | 486 | 4423 20 | 2851 | 4241 21 | 2762 | 4177 22 | 2449 | 3819 23 | 496 | 3592 24 | 99592 | 3560 25 | 0389 | 3433 26 | 5070 | 3396 27 | (20 rows) -------------------------------------------------------------------------------- /w266FinalReport_ICD_9_Classification.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zliendo/AI_MedicalNotes/87e1e8eb574b30c801b951e87b5edccc10d2a231/w266FinalReport_ICD_9_Classification.pdf --------------------------------------------------------------------------------