├── LICENSE ├── process_mimic.py ├── README.md ├── testDoctorAI.py └── doctorAI.py /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016, mp2893 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | * Neither the name of Doctor AI nor the names of its 15 | contributors may be used to endorse or promote products derived from 16 | this software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 19 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 20 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | -------------------------------------------------------------------------------- /process_mimic.py: -------------------------------------------------------------------------------- 1 | # This script processes MIMIC-III dataset and builds longitudinal diagnosis records for patients with at least two visits. 2 | # The output data are cPickled, and suitable for training Doctor AI or RETAIN 3 | # Written by Edward Choi (mp2893@gatech.edu) 4 | # Usage: Put this script to the foler where MIMIC-III CSV files are located. Then execute the below command. 5 | # python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv 6 | 7 | # Output files 8 | # .pids: List of unique Patient IDs. Used for intermediate processing 9 | # .dates: List of List of Python datetime objects. The outer List is for each patient. The inner List is for each visit made by each patient 10 | # .seqs: List of List of List of integer diagnosis codes. The outer List is for each patient. The middle List contains visits made by each patient. The inner List contains the integer diagnosis codes that occurred in each visit 11 | # .types: Python dictionary that maps string diagnosis codes to integer diagnosis codes. 12 | 13 | import sys 14 | import cPickle as pickle 15 | from datetime import datetime 16 | 17 | def convert_to_icd9(dxStr): 18 | if dxStr.startswith('E'): 19 | if len(dxStr) > 4: return dxStr[:4] + '.' + dxStr[4:] 20 | else: return dxStr 21 | else: 22 | if len(dxStr) > 3: return dxStr[:3] + '.' + dxStr[3:] 23 | else: return dxStr 24 | 25 | def convert_to_3digit_icd9(dxStr): 26 | if dxStr.startswith('E'): 27 | if len(dxStr) > 4: return dxStr[:4] 28 | else: return dxStr 29 | else: 30 | if len(dxStr) > 3: return dxStr[:3] 31 | else: return dxStr 32 | 33 | if __name__ == '__main__': 34 | admissionFile = sys.argv[1] 35 | diagnosisFile = sys.argv[2] 36 | outFile = sys.argv[3] 37 | 38 | print 'Building pid-admission mapping, admission-date mapping' 39 | pidAdmMap = {} 40 | admDateMap = {} 41 | infd = open(admissionFile, 'r') 42 | infd.readline() 43 | for line in infd: 44 | tokens = line.strip().split(',') 45 | pid = int(tokens[1]) 46 | admId = int(tokens[2]) 47 | admTime = datetime.strptime(tokens[3], '%Y-%m-%d %H:%M:%S') 48 | admDateMap[admId] = admTime 49 | if pid in pidAdmMap: pidAdmMap[pid].append(admId) 50 | else: pidAdmMap[pid] = [admId] 51 | infd.close() 52 | 53 | print 'Building admission-dxList mapping' 54 | admDxMap = {} 55 | infd = open(diagnosisFile, 'r') 56 | infd.readline() 57 | for line in infd: 58 | tokens = line.strip().split(',') 59 | admId = int(tokens[2]) 60 | dxStr = 'D_' + convert_to_icd9(tokens[4][1:-1]) ############## Uncomment this line and comment the line below, if you want to use the entire ICD9 digits. 61 | #dxStr = 'D_' + convert_to_3digit_icd9(tokens[4][1:-1]) 62 | if admId in admDxMap: admDxMap[admId].append(dxStr) 63 | else: admDxMap[admId] = [dxStr] 64 | infd.close() 65 | 66 | print 'Building pid-sortedVisits mapping' 67 | pidSeqMap = {} 68 | for pid, admIdList in pidAdmMap.iteritems(): 69 | if len(admIdList) < 2: continue 70 | sortedList = sorted([(admDateMap[admId], admDxMap[admId]) for admId in admIdList]) 71 | pidSeqMap[pid] = sortedList 72 | 73 | print 'Building pids, dates, strSeqs' 74 | pids = [] 75 | dates = [] 76 | seqs = [] 77 | for pid, visits in pidSeqMap.iteritems(): 78 | pids.append(pid) 79 | seq = [] 80 | date = [] 81 | for visit in visits: 82 | date.append(visit[0]) 83 | seq.append(visit[1]) 84 | dates.append(date) 85 | seqs.append(seq) 86 | 87 | print 'Converting strSeqs to intSeqs, and making types' 88 | types = {} 89 | newSeqs = [] 90 | for patient in seqs: 91 | newPatient = [] 92 | for visit in patient: 93 | newVisit = [] 94 | for code in visit: 95 | if code in types: 96 | newVisit.append(types[code]) 97 | else: 98 | types[code] = len(types) 99 | newVisit.append(types[code]) 100 | newPatient.append(newVisit) 101 | newSeqs.append(newPatient) 102 | 103 | pickle.dump(pids, open(outFile+'.pids', 'wb'), -1) 104 | pickle.dump(dates, open(outFile+'.dates', 'wb'), -1) 105 | pickle.dump(newSeqs, open(outFile+'.seqs', 'wb'), -1) 106 | pickle.dump(types, open(outFile+'.types', 'wb'), -1) 107 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Doctor AI 2 | ========================================= 3 | 4 | Doctor AI is a automatic diagnosis machine that predicts medical codes that occur in the next visit, while also predicting the time duration until the next visit. 5 | 6 | #### Relevant Publications 7 | 8 | Doctor AI implements an algorithm introduced in the following: 9 | 10 | Doctor AI: Predicting Clinical Events via Recurrent Neural Networks 11 | Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, Jimeng Sun 12 | arXiv preprint arXiv:1511.05942 13 | 14 | Medical Concept Representation Learning from Electronic Health Records and its Application on Heart Failure Prediction 15 | Edward Choi, Andy Schuetz, Walter F. Stewart, Jimeng Sun 16 | arXiv preprint arXiv:1602.03686 17 | 18 | #### Running Doctor AI 19 | 20 | **STEP 1: Installation** 21 | 22 | 1. Install [python](https://www.python.org/), [Theano](http://deeplearning.net/software/theano/index.html). We use Python 2.7, Theano 0.7. Theano can be easily installed in Ubuntu as suggested [here](http://deeplearning.net/software/theano/install_ubuntu.html#install-ubuntu) 23 | 24 | 2. If you plan to use GPU computation, install [CUDA](https://developer.nvidia.com/cuda-downloads) 25 | 26 | 3. Download/clone the Doctor AI code 27 | 28 | **STEP 2: Preparing training data** 29 | 30 | 0. You can use "process_mimic.py" to process MIMIC-III dataset and generate a suitable training dataset for Doctor AI. Place the script to the same location where the MIMIC-III CSV files are located, and run the script. Instructions are described inside the script. However, I recommend the readers to read the following steps to understand the structure of the training data and learn how to prepare their own dataset. 31 | 32 | 1. Doctor AI's training dataset needs to be a Python Pickled list of list of list. Each list corresponds to patients, visits, and medical codes (e.g. diagnosis codes, medication codes, procedure codes, etc.) 33 | First, medical codes need to be converted to an integer. Then a single visit can be seen as a list of integers. Then a patient can be seen as a list of visits. 34 | For example, [5,8,15] means the patient was assigned with code 5, 8, and 15 at a certain visit. 35 | If a patient made two visits [1,2,3] and [4,5,6,7], it can be converted to a list of list [[1,2,3], [4,5,6,7]]. 36 | Multiple patients can be represented as [[[1,2,3], [4,5,6,7]], [[2,4], [8,3,1], [3]]], which means there are two patients where the first patient made two visits and the second patient made three visits. 37 | This list of list of list needs to be pickled using cPickle. We will refer to this file as the "visit file". 38 | 39 | 2. The total number of unique medical codes is required to run Doctor AI. 40 | For example, if the dataset is using 14,000 diagnosis codes and 11,000 procedure codes, the total number is 25,000. 41 | 42 | 3. The label dataset (let us call this "label file") needs to have the same format as the "visit file". 43 | The important thing is, time steps of both "label file" and "visit file" need to match. DO NOT train Doctor AI with labels that is one time step ahead of the visits. It is tempting since Doctor AI predicts the labels of the next visit. But it is internally taken care of. 44 | You can use the "visit file" as the "label file" if you want Doctor AI to predict the exact codes. 45 | Or you can use a grouped codes as the "label file" if you are okay with reasonable predictions and want to save time. 46 | For example, ICD9 diagnosis codes can be grouped into 283 categories by using [CCS](https://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp) groupers. 47 | We STRONGLY recommend you do this, because the number of medical codes can be as high as tens of thousands, 48 | which can cause not only low predictive performance but also memory issues. (The high-end GPUs typically have only 12GB of VRAM) 49 | 50 | 4. Same as step 2, you will need to remember the total number of unique codes in the "label file". 51 | If you are using "visit file" as the "label file", than the number of unique codes will be the same, of course. 52 | 53 | 5. The "visit file" and "label file" need to have 3 sets respectively: training set, validation set, and test set. 54 | The file extension must be ".train", ".valid", and ".test" respectivley. 55 | For example, if you want to use a file named "my_visit_sequences" as the "visit file", then Doctor AI will try to load "my_visit_sequences.train", "my_visit_sequences.valid", and "my_visit_sequences.test". 56 | This is also true for the "label file" 57 | 58 | 5. You can use the time duration between visits as an additional source of information. Let us call this "time file". 59 | "time file" needs to be prepared as a Python Pickled List of List. Each list corresponds to patients and the duration between each visit. 60 | For example, given a "visit file" [[[1,2,3], [4,5,6,7]], [[2,4], [8,3,1], [3]]], its corresponding "time file" should look like [[0, 15], [0, 45, 23]]. 61 | Of course, the numbers are fake, but the important thing is that the duration for the first visit needs to be zero. 62 | Use "--time\_file" option to use "time file" 63 | Remember that the ".train", ".valid", ".test" rule also applies to the "time file" as well. 64 | 65 | **Additional: Predicting time duration until next visit** 66 | In addtion to predicting the codes of the next visit, you can make Doctor AI predict the time duration until next visit. 67 | Use "--predict\_time" option to do this. And obviously, predicting time requires the "time file". 68 | Time prediction also comes with many hyperparameters such as "--tradeoff", "--L2\_time", "--use\_log\_time". 69 | Refer to "--help" for more detailed information 70 | 71 | **Additional: Using your own medical code representations** 72 | Doctor AI internally learns vector representation of medical codes while training. These vectors are initialized with random values of course. 73 | You can, however, also provide medical code representations, if you have one. (They can be easily trained by using Skip-gram like algorithms.) 74 | If you want to provide the medical code representations, it has to be a list of list (basically a matrix) of N rows and M columns where N is the number of unique codes in your "visit file" and M is the size of the code representations. 75 | Specify the path to your code representation file using "--embed\_file". 76 | For more details regarding the training of medical code representations and using them for predictive tasks, please refer to the second paper of the "Related Publication" section. 77 | Additionally even if you provided your own medical code representations, you can re-train (a.k.a fine-tune) them as you train Doctor AI. 78 | Use "--embed\_finetune" option to do this. If you are not providing your own medical code representations, Doctor AI will use randomly initialized one, which obviously requires this fine-tuning process. Since the default is to use the fine-tuning, you do not need to worry about this. 79 | 80 | **STEP 3: Running Doctor AI** 81 | 82 | 1. The minimum input you need to run Doctor AI is the "visit file", the number of unique medical codes in the "visit file", 83 | the "label file", the number of unique medical codes in the "label file", and the output path. The output path is where the learned weights will be saved. 84 | `python doctorAI.py <# codes in the visit file>