├── requirements.txt ├── LICENSE ├── mimic-iii-los.rmd ├── .gitignore ├── 07_project_work.ipynb ├── README.md ├── 03_summary_statistics.ipynb ├── 01_explore_patients.ipynb ├── 05_mortality_prediction.ipynb ├── 06_aki_project.ipynb ├── 02_severity_of_illness.ipynb └── mimic-iii-tutorial.ipynb /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.18.1 2 | pandas==0.25.1 3 | scikit-learn==0.22.1 4 | matplotlib==3.1.0 5 | tableone==0.6.6 -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 MIT Laboratory for Computational Physiology 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /mimic-iii-los.rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Length of stay in the ICU" 3 | author: "tom pollard" 4 | description: "Length of stay in the ICU for patients in MIMIC-III" 5 | output: pdf_document 6 | date: "10/10/2017" 7 | --- 8 | 9 | ```{r setup, include = FALSE} 10 | knitr::opts_chunk$set(echo = TRUE) 11 | # install.packages("ggplot2") 12 | # install.packages("bigrquery") 13 | library("ggplot2") 14 | library("bigrquery") 15 | ``` 16 | 17 | 18 | ```{r dbconnect, include=FALSE} 19 | # Load configuration settings 20 | project_id <- "hack-aotearoa" 21 | options(httr_oauth_cache=TRUE) 22 | run_query <- function(query){ 23 | data <- query_exec(query, project=project_id, use_legacy_sql = FALSE) 24 | return(data) 25 | } 26 | ``` 27 | 28 | 29 | ```{r load_data, include=FALSE} 30 | sql_query <- "SELECT i.subject_id, i.hadm_id, i.los 31 | FROM `physionet-data.mimiciii_demo.icustays` i;" 32 | data <- run_query(sql_query) 33 | head(data) 34 | ``` 35 | 36 | This document shows how RMarkdown can be used to create a reproducible analysis using MIMIC-III (version 1.4). Let's calculate the median length of stay in the ICU and then include this value in our document. 37 | 38 | ```{r calculate_mean_los, include=FALSE} 39 | avg_los <- median(data$los, na.rm=TRUE) 40 | rounded_avg_los <- round(avg_los, digits = 2) 41 | ``` 42 | 43 | So the median length of stay in the ICU is `r avg_los` days. Rounded to two decimal places, this is `r rounded_avg_los` days. We can plot the distribution of length of stay using the qplot function: 44 | 45 | 46 | ```{r plot_los, echo=FALSE, include=TRUE, warning = FALSE} 47 | qplot(data$los, geom="histogram",xlim=c(0,25), binwidth = 1, 48 | xlab = "Length of stay in the ICU, days.",fill=I("#FF9999"), col=I("white")) 49 | ``` -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | 106 | # MacOS file 107 | .DS_Store 108 | -------------------------------------------------------------------------------- /07_project_work.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "view-in-github" 8 | }, 9 | "source": [ 10 | "\"Open" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "colab_type": "text", 17 | "id": "GsWKSUPhN3es" 18 | }, 19 | "source": [ 20 | "# eICU Collaborative Research Database\n", 21 | "\n", 22 | "# Notebook 7: Project work\n", 23 | "\n", 24 | "This notebook is intended as a starting point for future projects." 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "colab_type": "text", 31 | "id": "Gbgg16I9OIKI" 32 | }, 33 | "source": [ 34 | "## Load libraries and connect to the database" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 0, 40 | "metadata": { 41 | "colab": {}, 42 | "colab_type": "code", 43 | "id": "fUXc8SjTOFMJ" 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "# Import libraries\n", 48 | "import numpy as np\n", 49 | "import os\n", 50 | "import pandas as pd\n", 51 | "import matplotlib.pyplot as plt\n", 52 | "import matplotlib.patches as patches\n", 53 | "import matplotlib.path as path\n", 54 | "\n", 55 | "# Make pandas dataframes prettier\n", 56 | "from IPython.display import display, HTML\n", 57 | "\n", 58 | "# Access data using Google BigQuery.\n", 59 | "from google.colab import auth\n", 60 | "from google.cloud import bigquery" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 0, 66 | "metadata": { 67 | "colab": {}, 68 | "colab_type": "code", 69 | "id": "tACn8gYaOJqc" 70 | }, 71 | "outputs": [], 72 | "source": [ 73 | "# authenticate\n", 74 | "auth.authenticate_user()" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 0, 80 | "metadata": { 81 | "colab": {}, 82 | "colab_type": "code", 83 | "id": "xGzQLXJAOK-c" 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "# Set up environment variables\n", 88 | "project_id='hack-aotearoa'\n", 89 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": { 95 | "colab_type": "text", 96 | "id": "2lQ03IAjORL8" 97 | }, 98 | "source": [ 99 | "## Choose a project, or try your own!\n", 100 | "\n", 101 | "- Congestive heart failure is a common illness for ICU patients, but the severity of the illness can vary substantially. Are there distinct subgroups among patients admitted with congestive heart failure? For example, are patients with preserved ejection fraction substantially different than those without?\n", 102 | "- Sepsis is a life-threatening condition usually associated with infection - but little research investigates how septic patients vary based on the source of the infection. As APACHE diagnoses are organ specific (e.g. SEPSISPULM, SEPSISGI, SEPSISUTI), can we find any substantial differences among septic patients based upon the initial location of the infection?\n", 103 | "- Lab measurements take up to 6 hours to measure and can be costly. Can we predict a future lab measurement based upon previous measures and simultaneous non-invasive measurements?" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 0, 109 | "metadata": { 110 | "colab": {}, 111 | "colab_type": "code", 112 | "id": "K7znlFcTONlT" 113 | }, 114 | "outputs": [], 115 | "source": [] 116 | } 117 | ], 118 | "metadata": { 119 | "colab": { 120 | "collapsed_sections": [], 121 | "include_colab_link": true, 122 | "name": "07-project-work", 123 | "provenance": [], 124 | "version": "0.3.2" 125 | }, 126 | "kernelspec": { 127 | "display_name": "Python 3", 128 | "language": "python", 129 | "name": "python3" 130 | }, 131 | "language_info": { 132 | "codemirror_mode": { 133 | "name": "ipython", 134 | "version": 3 135 | }, 136 | "file_extension": ".py", 137 | "mimetype": "text/x-python", 138 | "name": "python", 139 | "nbconvert_exporter": "python", 140 | "pygments_lexer": "ipython3", 141 | "version": "3.7.4" 142 | } 143 | }, 144 | "nbformat": 4, 145 | "nbformat_minor": 1 146 | } 147 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Hack Aotearoa 2023 (March 17th to March 19th 2023) 2 | 3 | This repository contains resources for [Hack Aotearoa 2023](http://hackaotearoa.nz) 4 | 5 | ## Contents 6 | 7 | 1. Getting started 8 | 2. Documentation 9 | 3. Databases on BigQuery 10 | 4. Analysing data with Google Colab 11 | 5. Python notebooks that we prepared earlier 12 | 6. An example in R 13 | 7. Sample projects 14 | 8. Miscellaneous resources! 15 | 16 | 17 | ## 1. Getting started 18 | 19 | The datasets are hosted on Google Cloud, which requires a Gmail account to manage permissions. 20 | 21 | 1. Create a [Gmail account](https://www.google.com/gmail/about/), if you don't already have one. It will be used to manage your access to the resources. 22 | 2. Complete the form at: https://uoaevents.eventsair.com/hack-aotearoa-2023/dua2023/Site/Register. 23 | 24 | ## 2. Documentation 25 | 26 | - MIMIC Clinical Database: https://mimic.physionet.org/ 27 | - eICU Collaborative Research Database: https://eicu-crd.mit.edu/ 28 | - MIMIC Code Repository: https://github.com/MIT-LCP/mimic-code (code for reuse!) 29 | 30 | ## 3. Databases on BigQuery 31 | 32 | BigQuery is a database system that makes it easy to explore data with Structured Query Language ("SQL"). There are several datasets on BigQuery available for you to explore, including `eicu_crd` (the eICU Collaborative Research Database) and `mimiciii_clinical` (the MIMIC-III Clinical Database). 33 | 34 | You will also find "derived" databases, which include tables derived from the original data using the code in the [eICU](https://github.com/MIT-LCP/eicu-code) and [MIMIC](https://github.com/MIT-LCP/mimic-code) code repositories. These are helpful if you are looking for something like a sepsis cohort or first day vital signs. 35 | 36 | 1. [Open BigQuery](https://console.cloud.google.com/bigquery?project=physionet-data). 37 | 2. At the top of the console, select `hack-aotearoa` as the project. This indicates the account used for billing. 38 | 39 | 3. You should be able preview the data available on these projects using the graphical interface. 40 | 4. Now try running a query. For example, try counting the number of rows in the demo eICU patient table: 41 | 42 | ```SQL 43 | SELECT count(*) 44 | FROM `physionet-data.eicu_crd_demo.patient` 45 | ``` 46 | 47 | ## 4. Analysing data with Google Colab 48 | 49 | Python is an increasingly popular programming language for analysing data. We will explore the data using Python notebooks, which allow code and text to be combined into executable documents. First, try opening a blank document using the link below: 50 | 51 | - [https://colab.research.google.com/](https://colab.research.google.com/) 52 | 53 | ## 5. Python notebooks that we prepared earlier 54 | 55 | Several tutorials are provided below. Requirements for these notebooks are: (1) you have a Gmail account and (2) your Gmail address has been added to the appropriate Google Group by the workshop hosts. 56 | 57 | Notebook 1 (eICU): Exploring the patient table. Open In Colab 58 | 59 | Notebook 2 (eICU): Severity of illness. Open In Colab 60 | 61 | Notebook 3 (eICU): Summary statistics. Open In Colab 62 | 63 | Notebook 4 (eICU): Timeseries. Open In Colab 64 | 65 | Notebook 5 (eICU): Mortality prediction. Open In Colab 66 | 67 | Notebook 6 (eICU): Acute kidney injury. Open In Colab 68 | 69 | Notebook 7 (eICU): Project work. Open In Colab 70 | 71 | Notebook 8 (MIMIC): MIMIC-III tutorial. Open In Colab 72 | 73 | Notebook 9 (MIMIC): Weekend effect on mortality. Open In Colab 74 | 75 | Notebook 10 (MIMIC): Mortality in septic patients. Open In Colab 76 | 77 | ## 6. An example in R 78 | 79 | If you prefer working in R, then you can connect to Google Cloud from your code in a similar way: 80 | 81 | - https://github.com/MIT-LCP/hack-aotearoa/blob/master/mimic-iii-los.rmd 82 | 83 | ## 7. Sample projects 84 | 85 | These papers and repositories may be helpful for reference. They are **not** perfect! Code may be untidy, poorly documented, buggy, outdated etc. Think about how they can be improved, adapted, etc. For example, you could: 86 | 87 | - replicate the study on a different dataset (e.g. MIMIC vs eICU) 88 | - improve the methodology 89 | 90 | 1. The association between mortality among patients admitted to the intensive care unit on a weekend compared to a weekday 91 | 92 | - Python Notebook: https://github.com/MIT-LCP/bhi-bsn-challenge/blob/master/challenge-demo.ipynb 93 | - R Markdown Notebook: https://github.com/MIT-LCP/bhi-bsn-challenge/blob/master/rmarkdown_example_notebook.Rmd 94 | - More reading: https://physionet.org/content/bhi-2018-challenge/1.0/ 95 | 96 | 2. Predicting in-hospital mortality of intensive care patients using decision trees. 97 | 98 | - Python Notebook: https://github.com/MIT-LCP/2019_aarhus_critical_data/blob/master/tutorials/eicu/05-prediction.ipynb 99 | 100 | 3. Comparison of methods for identifying patients with sepsis. 101 | 102 | - Code: https://github.com/alistairewj/sepsis3-mimic 103 | - Paper: https://www.ncbi.nlm.nih.gov/pubmed/29303796 104 | 105 | 4. Evaluating the reproducibility of mortality prediction studies that use the MIMIC-III database. 106 | 107 | - Code: https://github.com/alistairewj/reproducibility-mimic/blob/master/notebooks/reproducibility.ipynb 108 | - Paper: http://proceedings.mlr.press/v68/johnson17a.html 109 | 110 | 5. Optimising treatment of sepsis with reinforcement learning 111 | 112 | - Code: https://github.com/matthieukomorowski/AI_Clinician 113 | - Paper: https://www.nature.com/articles/s41591-018-0213-5 114 | 115 | 6. Association of hypokalemia with an increased risk for medically treated arrhythmia 116 | 117 | - Code: https://github.com/nus-mornin-lab/PotassiumAA 118 | - Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0217432 119 | 120 | ## 8. Resources 121 | 122 | - Tutorial on decision trees for mortality prediction: https://carpentries-incubator.github.io/machine-learning-trees-python/ 123 | - Tutorial on responsible machine learning: https://carpentries-incubator.github.io/machine-learning-responsible-python/ 124 | 125 | -------------------------------------------------------------------------------- /03_summary_statistics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "view-in-github" 8 | }, 9 | "source": [ 10 | "\"Open" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "colab_type": "text", 17 | "id": "1G_TVh1ybQkl" 18 | }, 19 | "source": [ 20 | "# eICU Collaborative Research Database\n", 21 | "\n", 22 | "# Notebook 3: Summary statistics\n", 23 | "\n", 24 | "This notebook shows how summary statistics can be computed for a patient cohort using the `tableone` package. Usage instructions for tableone are at: https://pypi.org/project/tableone/\n" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "colab_type": "text", 31 | "id": "L9XF77F2bnee" 32 | }, 33 | "source": [ 34 | "## Load libraries and connect to the database" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": { 41 | "colab": {}, 42 | "colab_type": "code", 43 | "id": "wXiSE558bn_w" 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "# Import libraries\n", 48 | "import numpy as np\n", 49 | "import os\n", 50 | "import pandas as pd\n", 51 | "import matplotlib.pyplot as plt\n", 52 | "\n", 53 | "# Make pandas dataframes prettier\n", 54 | "from IPython.display import display, HTML\n", 55 | "\n", 56 | "# Access data using Google BigQuery.\n", 57 | "from google.colab import auth\n", 58 | "from google.cloud import bigquery" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": { 65 | "colab": {}, 66 | "colab_type": "code", 67 | "id": "pLGnLAy-bsKb" 68 | }, 69 | "outputs": [], 70 | "source": [ 71 | "# authenticate\n", 72 | "auth.authenticate_user()" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": { 79 | "colab": {}, 80 | "colab_type": "code", 81 | "id": "PUjFDFdobszs" 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "# Set up environment variables\n", 86 | "project_id='hack-aotearoa'\n", 87 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": { 93 | "colab_type": "text", 94 | "id": "iWDUCA5Nb5BK" 95 | }, 96 | "source": [ 97 | "## Install and load the `tableone` package\n", 98 | "\n", 99 | "The tableone package can be used to compute summary statistics for a patient cohort. Unlike the previous packages, it is not installed by default in Colab, so will need to install it first." 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": { 106 | "colab": {}, 107 | "colab_type": "code", 108 | "id": "F9doCgtscOJd" 109 | }, 110 | "outputs": [], 111 | "source": [ 112 | "!pip install tableone" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": { 119 | "colab": {}, 120 | "colab_type": "code", 121 | "id": "SDI_Q7W0b4Le" 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "# Import the tableone class\n", 126 | "from tableone import TableOne" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": { 132 | "colab_type": "text", 133 | "id": "14TU4lcrdD7I" 134 | }, 135 | "source": [ 136 | "## Load the patient cohort\n", 137 | "\n", 138 | "In this example, we will load all data from the patient data, and link it to APACHE data to provide richer summary information." 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": { 145 | "colab": {}, 146 | "colab_type": "code", 147 | "id": "HF5WF5EObwfw" 148 | }, 149 | "outputs": [], 150 | "source": [ 151 | "# Link the patient and apachepatientresult tables on patientunitstayid\n", 152 | "# using an inner join.\n", 153 | "%%bigquery cohort\n", 154 | "\n", 155 | "SELECT p.unitadmitsource, p.gender, p.age, p.ethnicity, p.admissionweight, \n", 156 | " p.unittype, p.unitstaytype, a.acutephysiologyscore,\n", 157 | " a.apachescore, a.actualiculos, a.actualhospitalmortality,\n", 158 | " a.unabridgedunitlos, a.unabridgedhosplos\n", 159 | "FROM `physionet-data.eicu_crd_demo.patient` p\n", 160 | "INNER JOIN `physionet-data.eicu_crd_demo.apachepatientresult` a\n", 161 | "ON p.patientunitstayid = a.patientunitstayid\n", 162 | "WHERE apacheversion LIKE 'IVa'" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": null, 168 | "metadata": { 169 | "colab": {}, 170 | "colab_type": "code", 171 | "id": "k3hURHFihHNA" 172 | }, 173 | "outputs": [], 174 | "source": [ 175 | "cohort.head()" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": { 181 | "colab_type": "text", 182 | "id": "qnG8dVb2iHSn" 183 | }, 184 | "source": [ 185 | "## Calculate summary statistics\n", 186 | "\n", 187 | "Before summarizing the data, we will need to convert the ages to numerical values." 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": { 194 | "colab": {}, 195 | "colab_type": "code", 196 | "id": "oKHpqwAPkx6U" 197 | }, 198 | "outputs": [], 199 | "source": [ 200 | "cohort['agenum'] = pd.to_numeric(cohort['age'], errors='coerce')\n", 201 | "\n", 202 | "# temporary bug fix: deal with data type issue!\n", 203 | "cohort = cohort.astype({\"apachescore\": \"int64\",\n", 204 | " \"actualiculos\": \"int64\"})" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": { 211 | "colab": {}, 212 | "colab_type": "code", 213 | "id": "FQT-u8EXhXRG" 214 | }, 215 | "outputs": [], 216 | "source": [ 217 | "columns = ['unitadmitsource', 'gender', 'agenum', 'ethnicity',\n", 218 | " 'admissionweight','unittype','unitstaytype',\n", 219 | " 'acutephysiologyscore','apachescore','actualiculos',\n", 220 | " 'unabridgedunitlos','unabridgedhosplos']" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": null, 226 | "metadata": { 227 | "colab": {}, 228 | "colab_type": "code", 229 | "id": "3ETr3NCzielL" 230 | }, 231 | "outputs": [], 232 | "source": [ 233 | "TableOne(cohort, columns=columns, labels={'agenum': 'age'}, \n", 234 | " groupby='actualhospitalmortality',\n", 235 | " label_suffix=True, limit=4)" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": { 241 | "colab_type": "text", 242 | "id": "LCBcpJ9bZpDp" 243 | }, 244 | "source": [ 245 | "## Questions\n", 246 | "\n", 247 | "- Are the severity of illness measures higher in the survival or non-survival group?\n", 248 | "- What issues suggest that some of the summary statistics might be misleading?\n", 249 | "- How might you address these issues?" 250 | ] 251 | }, 252 | { 253 | "cell_type": "markdown", 254 | "metadata": { 255 | "colab_type": "text", 256 | "id": "2_8z1CIVahWg" 257 | }, 258 | "source": [ 259 | "## Visualizing the data\n", 260 | "\n", 261 | "Plotting the distribution of each variable by group level via histograms, kernel density estimates and boxplots is a crucial component to data analysis pipelines. Vizualisation is often is the only way to detect problematic variables in many real-life scenarios. We'll review a couple of the variables." 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": null, 267 | "metadata": { 268 | "colab": {}, 269 | "colab_type": "code", 270 | "id": "81yp2bSUigzh" 271 | }, 272 | "outputs": [], 273 | "source": [ 274 | "# Plot distributions to review possible multimodality\n", 275 | "cohort[['acutephysiologyscore','agenum']].dropna().plot.kde(figsize=[12,8])\n", 276 | "plt.legend(['APS Score', 'Age (years)'])\n", 277 | "plt.xlim([-30,250])" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": { 283 | "colab_type": "text", 284 | "id": "kZDUZB5sdhhU" 285 | }, 286 | "source": [ 287 | "## Questions\n", 288 | "\n", 289 | "- Do the plots change your view on how these variable should be reported?" 290 | ] 291 | } 292 | ], 293 | "metadata": { 294 | "colab": { 295 | "collapsed_sections": [], 296 | "include_colab_link": true, 297 | "name": "03-summary-statistics", 298 | "provenance": [], 299 | "version": "0.3.2" 300 | }, 301 | "kernelspec": { 302 | "display_name": "Python 3", 303 | "language": "python", 304 | "name": "python3" 305 | }, 306 | "language_info": { 307 | "codemirror_mode": { 308 | "name": "ipython", 309 | "version": 3 310 | }, 311 | "file_extension": ".py", 312 | "mimetype": "text/x-python", 313 | "name": "python", 314 | "nbconvert_exporter": "python", 315 | "pygments_lexer": "ipython3", 316 | "version": "3.7.4" 317 | } 318 | }, 319 | "nbformat": 4, 320 | "nbformat_minor": 1 321 | } 322 | -------------------------------------------------------------------------------- /01_explore_patients.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "view-in-github" 8 | }, 9 | "source": [ 10 | "\"Open" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "colab_type": "text", 17 | "id": "NCI19_Ix7xuI" 18 | }, 19 | "source": [ 20 | "# eICU Collaborative Research Database\n", 21 | "\n", 22 | "# Notebook 1: Exploring the patient table\n", 23 | "\n", 24 | "The aim of this notebook is to get set up with access to a demo version of the [eICU Collaborative Research Database](http://eicu-crd.mit.edu/). The demo is a subset of the full database, limited to ~1000 patients.\n", 25 | "\n", 26 | "We begin by exploring the `patient` table, which contains patient demographics and admission and discharge details for hospital and ICU stays. For more detail, see: http://eicu-crd.mit.edu/eicutables/patient/" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "## Prerequisites\n", 34 | "\n", 35 | "- If you do not have a Gmail account, please create one at http://www.gmail.com. \n", 36 | "- If you have not yet signed the data use agreement (DUA) sent by the organizers, please do so now to get access to the dataset." 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": { 42 | "colab_type": "text", 43 | "id": "l_CmlcBu8Wei" 44 | }, 45 | "source": [ 46 | "## Load libraries and connect to the data\n", 47 | "\n", 48 | "Run the following cells to import some libraries and then connect to the database." 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": { 55 | "colab": {}, 56 | "colab_type": "code", 57 | "id": "3WQsJiAj8B5L" 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "# Import libraries\n", 62 | "import numpy as np\n", 63 | "import os\n", 64 | "import pandas as pd\n", 65 | "import matplotlib.pyplot as plt\n", 66 | "import matplotlib.patches as patches\n", 67 | "import matplotlib.path as path\n", 68 | "\n", 69 | "# Make pandas dataframes prettier\n", 70 | "from IPython.display import display, HTML\n", 71 | "\n", 72 | "# Access data using Google BigQuery.\n", 73 | "from google.colab import auth\n", 74 | "from google.cloud import bigquery" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": { 80 | "colab_type": "text", 81 | "id": "Ld59KZ0W9E4v" 82 | }, 83 | "source": [ 84 | "Before running any queries, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter." 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": { 91 | "colab": {}, 92 | "colab_type": "code", 93 | "id": "ABh4hMt288yg" 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "auth.authenticate_user()" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": { 103 | "colab_type": "text", 104 | "id": "BPoHP2a8_eni" 105 | }, 106 | "source": [ 107 | "We'll also set the project details." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": { 114 | "colab": {}, 115 | "colab_type": "code", 116 | "id": "P0fdtVMa_di9" 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "project_id='hack-aotearoa'\n", 121 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": { 127 | "colab_type": "text", 128 | "id": "5bHZALFP9VN1" 129 | }, 130 | "source": [ 131 | "# \"Querying\" our database with SQL\n", 132 | "\n", 133 | "Now we can start exploring the data. We'll begin by running a simple query to load all columns of the `patient` table to a Pandas DataFrame. The query is written in SQL, a common language for extracting data from databases. The structure of an SQL query is:\n", 134 | "\n", 135 | "```sql\n", 136 | "SELECT \n", 137 | "FROM \n", 138 | "WHERE \n", 139 | "```\n", 140 | "\n", 141 | "`*` is a wildcard that indicates all columns" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "# BigQuery\n", 149 | "\n", 150 | "Our dataset is stored on BigQuery, Google's database engine. We can run our query on the database using some special (\"magic\") [BigQuery syntax](https://googleapis.dev/python/bigquery/latest/magics.html)." 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": { 157 | "colab": {}, 158 | "colab_type": "code", 159 | "id": "RE-UZAPG_rHq" 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "%%bigquery patient\n", 164 | "\n", 165 | "SELECT *\n", 166 | "FROM `physionet-data.eicu_crd_demo.patient`" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": { 172 | "colab_type": "text", 173 | "id": "YbnkcCZxBkdK" 174 | }, 175 | "source": [ 176 | "We have now assigned the output to our query to a variable called `patient`. Let's use the `head` method to view the first few rows of our data." 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": { 183 | "colab": {}, 184 | "colab_type": "code", 185 | "id": "GZph0FPDASEs" 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "# view the top few rows of the patient data\n", 190 | "patient.head()" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": { 196 | "colab_type": "text", 197 | "id": "TlxaXLevC_Rz" 198 | }, 199 | "source": [ 200 | "## Questions\n", 201 | "\n", 202 | "- What does `patientunitstayid` represent? (hint, see: http://eicu-crd.mit.edu/eicutables/patient/)\n", 203 | "- What does `patienthealthsystemstayid` represent?\n", 204 | "- What does `uniquepid` represent?" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": { 211 | "colab": {}, 212 | "colab_type": "code", 213 | "id": "2rLY0WyCBzp9" 214 | }, 215 | "outputs": [], 216 | "source": [ 217 | "# select a limited number of columns to view\n", 218 | "columns = ['uniquepid', 'patientunitstayid','gender','age','unitdischargestatus']\n", 219 | "patient[columns].head()" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": { 225 | "colab_type": "text", 226 | "id": "FSdS2hS4EWtb" 227 | }, 228 | "source": [ 229 | "- Try running the following query, which lists unique values in the age column. What do you notice?" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": null, 235 | "metadata": { 236 | "colab": {}, 237 | "colab_type": "code", 238 | "id": "0Aom69ftDxBN" 239 | }, 240 | "outputs": [], 241 | "source": [ 242 | "# what are the unique values for age?\n", 243 | "age_col = 'age'\n", 244 | "patient[age_col].sort_values().unique()" 245 | ] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": { 250 | "colab_type": "text", 251 | "id": "Y_qJL94jE0k8" 252 | }, 253 | "source": [ 254 | "- Try plotting a histogram of ages using the command in the cell below. What happens? Why?" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": { 261 | "colab": {}, 262 | "colab_type": "code", 263 | "id": "1zad3Gr4D4LE" 264 | }, 265 | "outputs": [], 266 | "source": [ 267 | "# try plotting a histogram of ages\n", 268 | "patient[age_col].plot(kind='hist', bins=15)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": { 274 | "colab_type": "text", 275 | "id": "xIdwVEEPF25H" 276 | }, 277 | "source": [ 278 | "Let's create a new column named `age_num`, then try again." 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": null, 284 | "metadata": { 285 | "colab": {}, 286 | "colab_type": "code", 287 | "id": "-rwc-28oFF6R" 288 | }, 289 | "outputs": [], 290 | "source": [ 291 | "# create a column containing numerical ages\n", 292 | "# If ‘coerce’, then invalid parsing will be set as NaN\n", 293 | "agenum_col = 'age_num'\n", 294 | "patient[agenum_col] = pd.to_numeric(patient[age_col], errors='coerce')\n", 295 | "patient[agenum_col].sort_values().unique()" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": null, 301 | "metadata": { 302 | "colab": {}, 303 | "colab_type": "code", 304 | "id": "uTFMqqWqFMjG" 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "patient[agenum_col].plot(kind='hist', bins=15)" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": { 314 | "colab_type": "text", 315 | "id": "FrbR8rV3GlR1" 316 | }, 317 | "source": [ 318 | "## Questions\n", 319 | "\n", 320 | "- Use the `mean()` method to find the average age. Why do we expect this to be lower than the true mean?\n", 321 | "- In the same way that you use `mean()`, you can use `describe()`, `max()`, and `min()`. Look at the admission heights (`admissionheight`) of patients in cm. What issue do you see? How can you deal with this issue?" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": null, 327 | "metadata": { 328 | "colab": {}, 329 | "colab_type": "code", 330 | "id": "TPps13DZG6Ac" 331 | }, 332 | "outputs": [], 333 | "source": [ 334 | "adheight_col = 'admissionheight'\n", 335 | "patient[adheight_col].describe()" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": { 342 | "colab": {}, 343 | "colab_type": "code", 344 | "id": "9jhV9xQoGRJq" 345 | }, 346 | "outputs": [], 347 | "source": [ 348 | "# set threshold\n", 349 | "adheight_col = 'admissionheight'\n", 350 | "patient[patient[adheight_col] < 10] = None" 351 | ] 352 | } 353 | ], 354 | "metadata": { 355 | "colab": { 356 | "collapsed_sections": [], 357 | "include_colab_link": true, 358 | "name": "01-explore-patient-table", 359 | "provenance": [], 360 | "version": "0.3.2" 361 | }, 362 | "kernelspec": { 363 | "display_name": "Python 3", 364 | "language": "python", 365 | "name": "python3" 366 | }, 367 | "language_info": { 368 | "codemirror_mode": { 369 | "name": "ipython", 370 | "version": 3 371 | }, 372 | "file_extension": ".py", 373 | "mimetype": "text/x-python", 374 | "name": "python", 375 | "nbconvert_exporter": "python", 376 | "pygments_lexer": "ipython3", 377 | "version": "3.7.4" 378 | } 379 | }, 380 | "nbformat": 4, 381 | "nbformat_minor": 1 382 | } 383 | -------------------------------------------------------------------------------- /05_mortality_prediction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "view-in-github" 8 | }, 9 | "source": [ 10 | "\"Open" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "colab_type": "text", 17 | "id": "T3wdKZCPklNq" 18 | }, 19 | "source": [ 20 | "# eICU Collaborative Research Database\n", 21 | "\n", 22 | "# Notebook 5: Mortality prediction\n", 23 | "\n", 24 | "This notebook explores how a logistic regression can be trained to predict in-hospital mortality of patients.\n" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "colab_type": "text", 31 | "id": "rG3HrM7GkwCH" 32 | }, 33 | "source": [ 34 | "## Load libraries and connect to the database" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 0, 40 | "metadata": { 41 | "colab": {}, 42 | "colab_type": "code", 43 | "id": "s-MoFA6NkkbZ" 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "# Import libraries\n", 48 | "import numpy as np\n", 49 | "import os\n", 50 | "import pandas as pd\n", 51 | "import matplotlib.pyplot as plt\n", 52 | "\n", 53 | "# model building\n", 54 | "from sklearn.model_selection import train_test_split\n", 55 | "from sklearn.linear_model import LogisticRegression\n", 56 | "from sklearn.pipeline import Pipeline\n", 57 | "from sklearn import preprocessing\n", 58 | "from sklearn import metrics\n", 59 | "from sklearn import impute\n", 60 | "\n", 61 | "# Make pandas dataframes prettier\n", 62 | "from IPython.display import display, HTML\n", 63 | "\n", 64 | "# Access data using Google BigQuery.\n", 65 | "from google.colab import auth\n", 66 | "from google.cloud import bigquery" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 0, 72 | "metadata": { 73 | "colab": {}, 74 | "colab_type": "code", 75 | "id": "jyBV_Q9DkyD3" 76 | }, 77 | "outputs": [], 78 | "source": [ 79 | "# authenticate\n", 80 | "auth.authenticate_user()" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 0, 86 | "metadata": { 87 | "colab": {}, 88 | "colab_type": "code", 89 | "id": "cF1udJKhkzYq" 90 | }, 91 | "outputs": [], 92 | "source": [ 93 | "# Set up environment variables\n", 94 | "project_id='hack-aotearoa'\n", 95 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": { 101 | "colab_type": "text", 102 | "id": "LgcRCqxCk3HC" 103 | }, 104 | "source": [ 105 | "## Load the patient cohort\n", 106 | "\n", 107 | "In this example, we will load all data from the patient data, and link it to APACHE data to provide richer summary information." 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 0, 113 | "metadata": { 114 | "colab": {}, 115 | "colab_type": "code", 116 | "id": "ReCl7-aek1-k" 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "# Link the patient and apachepatientresult tables on patientunitstayid\n", 121 | "# using an inner join.\n", 122 | "%%bigquery cohort\n", 123 | "\n", 124 | "SELECT p.unitadmitsource, p.gender, p.age, p.admissionweight, \n", 125 | " p.unittype, p.unitstaytype, a.acutephysiologyscore,\n", 126 | " a.apachescore, a.actualhospitalmortality\n", 127 | "FROM `physionet-data.eicu_crd_demo.patient` p\n", 128 | "INNER JOIN `physionet-data.eicu_crd_demo.apachepatientresult` a\n", 129 | "ON p.patientunitstayid = a.patientunitstayid\n", 130 | "WHERE apacheversion LIKE 'IVa'" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": 0, 136 | "metadata": { 137 | "colab": {}, 138 | "colab_type": "code", 139 | "id": "yxLctVBpk9sO" 140 | }, 141 | "outputs": [], 142 | "source": [ 143 | "cohort.head()" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": { 149 | "colab_type": "text", 150 | "id": "NPlwRV2buYb1" 151 | }, 152 | "source": [ 153 | "## Prepare the data for analysis" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 0, 159 | "metadata": { 160 | "colab": {}, 161 | "colab_type": "code", 162 | "id": "v3OJ4LDvueKu" 163 | }, 164 | "outputs": [], 165 | "source": [ 166 | "# review the data dataset\n", 167 | "print(cohort.info())" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 0, 173 | "metadata": { 174 | "colab": {}, 175 | "colab_type": "code", 176 | "id": "s4wQ6o_RvLph" 177 | }, 178 | "outputs": [], 179 | "source": [ 180 | "# Encode the categorical data\n", 181 | "encoder = preprocessing.LabelEncoder()\n", 182 | "cohort['gender_code'] = encoder.fit_transform(cohort['gender'])\n", 183 | "cohort['admissionweight_code'] = encoder.fit_transform(cohort['admissionweight'])\n", 184 | "cohort['unittype_code'] = encoder.fit_transform(cohort['unittype'])\n", 185 | "cohort['apachescore_code'] = encoder.fit_transform(cohort['apachescore'])\n", 186 | "cohort['actualhospitalmortality_code'] = encoder.fit_transform(cohort['actualhospitalmortality'])" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 0, 192 | "metadata": { 193 | "colab": {}, 194 | "colab_type": "code", 195 | "id": "4ogi_ns-ylnP" 196 | }, 197 | "outputs": [], 198 | "source": [ 199 | "# Handle the deidentified ages\n", 200 | "cohort['agenum'] = pd.to_numeric(cohort['age'], downcast='integer', errors='coerce')" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 0, 206 | "metadata": { 207 | "colab": {}, 208 | "colab_type": "code", 209 | "id": "77M0QJQ5wcPQ" 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "# Preview the encoded data\n", 214 | "cohort[['gender','gender_code']].head()" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 0, 220 | "metadata": { 221 | "colab": {}, 222 | "colab_type": "code", 223 | "id": "GqvwTNPN3KZz" 224 | }, 225 | "outputs": [], 226 | "source": [ 227 | "# Check the outcome variable\n", 228 | "cohort['actualhospitalmortality_code'].unique()" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": { 234 | "colab_type": "text", 235 | "id": "ze7y5J4Ioz8u" 236 | }, 237 | "source": [ 238 | "## Create our train and test sets" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 0, 244 | "metadata": { 245 | "colab": {}, 246 | "colab_type": "code", 247 | "id": "i5zXkn_AlDJW" 248 | }, 249 | "outputs": [], 250 | "source": [ 251 | "predictors = ['gender_code','agenum','apachescore_code','unittype_code',\n", 252 | " 'admissionweight_code']\n", 253 | "outcome = 'actualhospitalmortality_code'\n", 254 | "\n", 255 | "X = cohort[predictors]\n", 256 | "y = cohort[outcome]" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 0, 262 | "metadata": { 263 | "colab": {}, 264 | "colab_type": "code", 265 | "id": "IHhIgDUwocmA" 266 | }, 267 | "outputs": [], 268 | "source": [ 269 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 0, 275 | "metadata": { 276 | "colab": {}, 277 | "colab_type": "code", 278 | "id": "NvQWkuY6nkZ8" 279 | }, 280 | "outputs": [], 281 | "source": [ 282 | "# Review the number of cases in each set\n", 283 | "print(\"Train data: {}\".format(len(X_train)))\n", 284 | "print(\"Test data: {}\".format(len(X_test)))\n" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": { 290 | "colab_type": "text", 291 | "id": "b2waK5qBqanC" 292 | }, 293 | "source": [ 294 | "## Build the model" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 0, 300 | "metadata": { 301 | "colab": {}, 302 | "colab_type": "code", 303 | "id": "EVoWS8HX3Sek" 304 | }, 305 | "outputs": [], 306 | "source": [ 307 | "# Create an instance of the model\n", 308 | "model = LogisticRegression(solver='lbfgs')\n" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 0, 314 | "metadata": { 315 | "colab": {}, 316 | "colab_type": "code", 317 | "id": "sCWwZnc30ahA" 318 | }, 319 | "outputs": [], 320 | "source": [ 321 | "# Impute missing values and scale using a pipeline\n", 322 | "estimator = Pipeline([(\"imputer\", impute.SimpleImputer(missing_values=np.nan, strategy=\"mean\")),\n", 323 | " (\"scaler\", preprocessing.StandardScaler()),\n", 324 | " (\"logistic_regression\", model)])\n" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 0, 330 | "metadata": { 331 | "colab": {}, 332 | "colab_type": "code", 333 | "id": "k_-lrq1A0rCh" 334 | }, 335 | "outputs": [], 336 | "source": [ 337 | "# Fit the model to the training data\n", 338 | "estimator.fit(X_train, y_train)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "markdown", 343 | "metadata": { 344 | "colab_type": "text", 345 | "id": "NMDmhhnlzr9Z" 346 | }, 347 | "source": [ 348 | "## Testing" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 0, 354 | "metadata": { 355 | "colab": {}, 356 | "colab_type": "code", 357 | "id": "Pf4G20-VGdvm" 358 | }, 359 | "outputs": [], 360 | "source": [ 361 | "y_pred = estimator.predict(X_test)\n", 362 | "print('Accuracy of logistic regression classifier on the test set: {:.2f}'.format(estimator.score(X_test, y_test)))" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 0, 368 | "metadata": { 369 | "colab": {}, 370 | "colab_type": "code", 371 | "id": "akOjuNz8qgMd" 372 | }, 373 | "outputs": [], 374 | "source": [ 375 | "print(metrics.classification_report(y_test, y_pred))" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": 0, 381 | "metadata": { 382 | "colab": {}, 383 | "colab_type": "code", 384 | "id": "gH1w0RBH1JjT" 385 | }, 386 | "outputs": [], 387 | "source": [ 388 | "logit_roc_auc = metrics.roc_auc_score(y_test, estimator.predict(X_test))\n", 389 | "fpr, tpr, thresholds = metrics.roc_curve(y_test, estimator.predict_proba(X_test)[:,1])\n", 390 | "\n", 391 | "plt.figure()\n", 392 | "plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)\n", 393 | "plt.plot([0, 1], [0, 1],'r--')\n", 394 | "plt.xlim([0.0, 1.0])\n", 395 | "plt.ylim([0.0, 1.05])\n", 396 | "\n", 397 | "plt.xlabel('False Positives')\n", 398 | "plt.ylabel('True Positives')\n", 399 | "plt.title('Receiver operating characteristic')\n", 400 | "plt.legend(loc=\"lower right\")\n", 401 | "plt.savefig('Log_ROC')\n", 402 | "plt.show()" 403 | ] 404 | } 405 | ], 406 | "metadata": { 407 | "colab": { 408 | "collapsed_sections": [], 409 | "include_colab_link": true, 410 | "name": "05-mortality-prediction", 411 | "provenance": [], 412 | "version": "0.3.2" 413 | }, 414 | "kernelspec": { 415 | "display_name": "Python 3", 416 | "language": "python", 417 | "name": "python3" 418 | }, 419 | "language_info": { 420 | "codemirror_mode": { 421 | "name": "ipython", 422 | "version": 3 423 | }, 424 | "file_extension": ".py", 425 | "mimetype": "text/x-python", 426 | "name": "python", 427 | "nbconvert_exporter": "python", 428 | "pygments_lexer": "ipython3", 429 | "version": "3.7.4" 430 | } 431 | }, 432 | "nbformat": 4, 433 | "nbformat_minor": 1 434 | } 435 | -------------------------------------------------------------------------------- /06_aki_project.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "view-in-github" 8 | }, 9 | "source": [ 10 | "\"Open" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "colab_type": "text", 17 | "id": "7HXE-dGyLHYa" 18 | }, 19 | "source": [ 20 | "# eICU Collaborative Research Database\n", 21 | "\n", 22 | "# Notebook 6: An example project\n", 23 | "\n", 24 | "This notebook introduces a project focused on acute kidney injury, quantifying differences between patients with and without the condition." 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "colab_type": "text", 31 | "id": "2hTYG_w4Lzfg" 32 | }, 33 | "source": [ 34 | "## Load libraries and connect to the database" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 0, 40 | "metadata": { 41 | "colab": {}, 42 | "colab_type": "code", 43 | "id": "_Z_G2UCcLoii" 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "# Import libraries\n", 48 | "import numpy as np\n", 49 | "import os\n", 50 | "import pandas as pd\n", 51 | "import matplotlib.pyplot as plt\n", 52 | "import matplotlib.patches as patches\n", 53 | "import matplotlib.path as path\n", 54 | "\n", 55 | "# Make pandas dataframes prettier\n", 56 | "from IPython.display import display, HTML\n", 57 | "\n", 58 | "# Access data using Google BigQuery.\n", 59 | "from google.colab import auth\n", 60 | "from google.cloud import bigquery" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 0, 66 | "metadata": { 67 | "colab": {}, 68 | "colab_type": "code", 69 | "id": "1f3Ahq0hL1xv" 70 | }, 71 | "outputs": [], 72 | "source": [ 73 | "# authenticate\n", 74 | "auth.authenticate_user()" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 0, 80 | "metadata": { 81 | "colab": {}, 82 | "colab_type": "code", 83 | "id": "DbwAi_e2L3eO" 84 | }, 85 | "outputs": [], 86 | "source": [ 87 | "# Set up environment variables\n", 88 | "project_id='hack-aotearoa'\n", 89 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "metadata": { 95 | "colab_type": "text", 96 | "id": "Pown2uTtL9kz" 97 | }, 98 | "source": [ 99 | "## Define the cohort\n", 100 | "\n", 101 | "Our first step is to define the patient population we are interested in. For this project, we'd like to identify those patients with any past history of renal failure and compare them with the remaining patients.\n", 102 | "\n", 103 | "First, we extract all patient unit stays from the patient table.\n" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 0, 109 | "metadata": { 110 | "colab": {}, 111 | "colab_type": "code", 112 | "id": "Qa3TYl2PL7-i" 113 | }, 114 | "outputs": [], 115 | "source": [ 116 | "# Link the patient and apachepatientresult tables on patientunitstayid\n", 117 | "# using an inner join.\n", 118 | "%%bigquery patient\n", 119 | "\n", 120 | "SELECT *\n", 121 | "FROM `physionet-data.eicu_crd_demo.patient`" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": { 127 | "colab_type": "text", 128 | "id": "NiJ7V6QBMUuX" 129 | }, 130 | "source": [ 131 | "Now we investigate the pasthistory table, and look at all the mentions of past history which contain the phrase 'Renal (R)' - note we use % as they are wildcard characters for SQL.\n", 132 | "\n" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "%%bigquery ph\n", 142 | "\n", 143 | "SELECT pasthistorypath, count(*) as n\n", 144 | "FROM `physionet-data.eicu_crd_demo.pasthistory`\n", 145 | "WHERE pasthistorypath LIKE '%Renal (R)%'\n", 146 | "GROUP BY pasthistorypath\n", 147 | "ORDER BY n DESC;" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 0, 153 | "metadata": { 154 | "colab": {}, 155 | "colab_type": "code", 156 | "id": "V4JrlnenMSJ-" 157 | }, 158 | "outputs": [], 159 | "source": [ 160 | "for row in ph.iterrows():\n", 161 | " r = row[1]\n", 162 | " print('{:3g} - {:20s}'.format(r['n'],r['pasthistorypath'][48:]))" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": { 168 | "colab_type": "text", 169 | "id": "oLA3lat9MirI" 170 | }, 171 | "source": [ 172 | "These all seem like reasonable surrogates for renal insufficiency (note: for a real clinical study, you'd want to be a lot more thorough!).\n", 173 | "\n" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 0, 179 | "metadata": { 180 | "colab": {}, 181 | "colab_type": "code", 182 | "id": "fygnwv0OMfZg" 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "# identify patients with insufficiency\n", 187 | "%%bigquery df_have_crf\n", 188 | "\n", 189 | "SELECT DISTINCT patientunitstayid\n", 190 | "FROM `physionet-data.eicu_crd_demo.pasthistory`\n", 191 | "WHERE pasthistorypath LIKE '%Renal (R)%'" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "df_have_crf['crf'] = 1" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 0, 206 | "metadata": { 207 | "colab": {}, 208 | "colab_type": "code", 209 | "id": "JUZ60JVnMpFd" 210 | }, 211 | "outputs": [], 212 | "source": [ 213 | "# merge the data above into our original dataframe\n", 214 | "df = patient.merge(df_have_crf, \n", 215 | " how='left', \n", 216 | " left_on='patientunitstayid', \n", 217 | " right_on='patientunitstayid')\n", 218 | "\n", 219 | "df.head()" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": 0, 225 | "metadata": { 226 | "colab": {}, 227 | "colab_type": "code", 228 | "id": "idLXqNGrMvQF" 229 | }, 230 | "outputs": [], 231 | "source": [ 232 | "# impute 0s for the missing CRF values\n", 233 | "df.fillna(value=0,inplace=True)\n", 234 | "df.head()" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 0, 240 | "metadata": { 241 | "colab": {}, 242 | "colab_type": "code", 243 | "id": "dRCZ1KoTM7uw" 244 | }, 245 | "outputs": [], 246 | "source": [ 247 | "# set patientunitstayid as the index - convenient for indexing later\n", 248 | "df.set_index('patientunitstayid',inplace=True)" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": { 254 | "colab_type": "text", 255 | "id": "GPDavuXVM_0G" 256 | }, 257 | "source": [ 258 | "## Load creatinine from lab table\n" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 0, 264 | "metadata": { 265 | "colab": {}, 266 | "colab_type": "code", 267 | "id": "XGHj_sVJM96D" 268 | }, 269 | "outputs": [], 270 | "source": [ 271 | "%%bigquery lab\n", 272 | "\n", 273 | "SELECT patientunitstayid, labresult\n", 274 | "FROM `physionet-data.eicu_crd_demo.lab`\n", 275 | "WHERE labname = 'creatinine'" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": {}, 282 | "outputs": [], 283 | "source": [ 284 | "# set patientunitstayid as the index\n", 285 | "lab.set_index('patientunitstayid', inplace=True)" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 0, 291 | "metadata": { 292 | "colab": {}, 293 | "colab_type": "code", 294 | "id": "swgESA5TNGiS" 295 | }, 296 | "outputs": [], 297 | "source": [ 298 | "# get first creatinine by grouping by the index (level=0)\n", 299 | "cr_first = lab.groupby(level=0).first()\n", 300 | "\n", 301 | "# similarly get maximum creatinine\n", 302 | "cr_max = lab.groupby(level=0).max()" 303 | ] 304 | }, 305 | { 306 | "cell_type": "markdown", 307 | "metadata": { 308 | "colab_type": "text", 309 | "id": "PYEWT8IfNQry" 310 | }, 311 | "source": [ 312 | "## Plot distributions of creatinine in both groups\n" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 0, 318 | "metadata": { 319 | "colab": {}, 320 | "colab_type": "code", 321 | "id": "5N4XEIbcNO9Q" 322 | }, 323 | "outputs": [], 324 | "source": [ 325 | "plt.figure(figsize=[10,6])\n", 326 | "\n", 327 | "xi = np.arange(0,10,0.1)\n", 328 | "\n", 329 | "# get patients who had CRF and plot a histogram\n", 330 | "idx = df.loc[df['crf']==1,:].index\n", 331 | "plt.hist( cr_first.loc[idx,'labresult'].dropna(), bins=xi, label='With CRF' )\n", 332 | "\n", 333 | "# get patients who did not have CRF\n", 334 | "idx = df.loc[df['crf']==0,:].index\n", 335 | "plt.hist( cr_first.loc[idx,'labresult'].dropna(), alpha=0.5, bins=xi, label='No CRF' )\n", 336 | "\n", 337 | "plt.legend()\n", 338 | "\n", 339 | "plt.show()" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": { 345 | "colab_type": "text", 346 | "id": "UFIOPs8TNYJV" 347 | }, 348 | "source": [ 349 | "While it appears that patients in the red group have higher creatinines, we have far more patients in the blue group (no CRF) than in the red group (have CRF). To alleviate this and allow a fairer comparison, we can normalize the histogram.\n", 350 | "\n" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 0, 356 | "metadata": { 357 | "colab": {}, 358 | "colab_type": "code", 359 | "id": "5P3B_uUeNUiR" 360 | }, 361 | "outputs": [], 362 | "source": [ 363 | "plt.figure(figsize=[10,6])\n", 364 | "\n", 365 | "xi = np.arange(0,10,0.1)\n", 366 | "\n", 367 | "# get patients who had CRF and plot a histogram\n", 368 | "idx = df.loc[df['crf']==1,:].index\n", 369 | "plt.hist( cr_first.loc[idx,'labresult'].dropna(), bins=xi, normed=True,\n", 370 | " label='With CRF' )\n", 371 | "\n", 372 | "# get patients who did not have CRF\n", 373 | "idx = df.loc[df['crf']==0,:].index\n", 374 | "plt.hist( cr_first.loc[idx,'labresult'].dropna(), alpha=0.5, bins=xi, normed=True,\n", 375 | " label='No CRF' )\n", 376 | "\n", 377 | "plt.legend()\n", 378 | "\n", 379 | "plt.show()" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "metadata": { 385 | "colab_type": "text", 386 | "id": "oagJ1XQ8NcKg" 387 | }, 388 | "source": [ 389 | "Here we can very clearly see that the first creatinine measured is a lot higher for patients with some baseline kidney dysfunction when compared to those without. Let's try it with the highest value.\n", 390 | "\n" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 0, 396 | "metadata": { 397 | "colab": {}, 398 | "colab_type": "code", 399 | "id": "-aBkodMGNZ_O" 400 | }, 401 | "outputs": [], 402 | "source": [ 403 | "plt.figure(figsize=[10,6])\n", 404 | "\n", 405 | "xi = np.arange(0,10,0.1)\n", 406 | "\n", 407 | "# get patients who had CRF and plot a histogram\n", 408 | "idx = df.loc[df['crf']==1,:].index\n", 409 | "plt.hist( cr_max.loc[idx,'labresult'].dropna(), bins=xi, normed=True,\n", 410 | " label='With CRF' )\n", 411 | "\n", 412 | "# get patients who did not have CRF\n", 413 | "idx = df.loc[df['crf']==0,:].index\n", 414 | "plt.hist( cr_max.loc[idx,'labresult'].dropna(), alpha=0.5, bins=xi, normed=True,\n", 415 | " label='No CRF' )\n", 416 | "\n", 417 | "plt.legend()\n", 418 | "\n", 419 | "plt.show()" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": { 425 | "colab_type": "text", 426 | "id": "F2upj7XaNf7b" 427 | }, 428 | "source": [ 429 | "Unsuprisingly, a very similar story!" 430 | ] 431 | } 432 | ], 433 | "metadata": { 434 | "colab": { 435 | "collapsed_sections": [], 436 | "include_colab_link": true, 437 | "name": "06-aki-project.ipynb", 438 | "provenance": [], 439 | "version": "0.3.2" 440 | }, 441 | "kernelspec": { 442 | "display_name": "Python 3", 443 | "language": "python", 444 | "name": "python3" 445 | }, 446 | "language_info": { 447 | "codemirror_mode": { 448 | "name": "ipython", 449 | "version": 3 450 | }, 451 | "file_extension": ".py", 452 | "mimetype": "text/x-python", 453 | "name": "python", 454 | "nbconvert_exporter": "python", 455 | "pygments_lexer": "ipython3", 456 | "version": "3.7.4" 457 | } 458 | }, 459 | "nbformat": 4, 460 | "nbformat_minor": 1 461 | } 462 | -------------------------------------------------------------------------------- /02_severity_of_illness.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "colab_type": "text", 7 | "id": "view-in-github" 8 | }, 9 | "source": [ 10 | "\"Open" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "colab_type": "text", 17 | "id": "y4AOVdliM8gm" 18 | }, 19 | "source": [ 20 | "# eICU Collaborative Research Database\n", 21 | "\n", 22 | "# Notebook 2: Severity of illness\n", 23 | "\n", 24 | "This notebook introduces high level admission details relating to a single patient stay, using the following tables:\n", 25 | "\n", 26 | "- patient\n", 27 | "- admissiondx\n", 28 | "- apacheapsvar\n", 29 | "- apachepredvar\n", 30 | "- apachepatientresult\n", 31 | "\n" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": { 37 | "colab_type": "text", 38 | "id": "e0lUnIkYOyv4" 39 | }, 40 | "source": [ 41 | "## Load libraries and connect to the database" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 0, 47 | "metadata": { 48 | "colab": {}, 49 | "colab_type": "code", 50 | "id": "SJ6l1i3fOL4j" 51 | }, 52 | "outputs": [], 53 | "source": [ 54 | "# Import libraries\n", 55 | "import numpy as np\n", 56 | "import os\n", 57 | "import pandas as pd\n", 58 | "import matplotlib.pyplot as plt\n", 59 | "import matplotlib.patches as patches\n", 60 | "import matplotlib.path as path\n", 61 | "\n", 62 | "# Make pandas dataframes prettier\n", 63 | "from IPython.display import display, HTML\n", 64 | "\n", 65 | "# Access data using Google BigQuery.\n", 66 | "from google.colab import auth\n", 67 | "from google.cloud import bigquery" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 0, 73 | "metadata": { 74 | "colab": {}, 75 | "colab_type": "code", 76 | "id": "TE4JYS8aO-69" 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "# authenticate\n", 81 | "auth.authenticate_user()" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 0, 87 | "metadata": { 88 | "colab": {}, 89 | "colab_type": "code", 90 | "id": "oVavf-ujPOAv" 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "# Set up environment variables\n", 95 | "project_id='hack-aotearoa'\n", 96 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": { 102 | "colab_type": "text", 103 | "id": "a1CAI3GjQYE0" 104 | }, 105 | "source": [ 106 | "## Selecting a single patient stay¶\n", 107 | "\n", 108 | "As we have seen, the patient table includes general information about the patient admissions (for example, demographics, admission and discharge details). See: http://eicu-crd.mit.edu/eicutables/patient/\n", 109 | "\n", 110 | "## Questions\n", 111 | "\n", 112 | "Use your knowledge from the previous notebook and the online documentation (http://eicu-crd.mit.edu/) to answer the following questions:\n", 113 | "\n", 114 | "- Which column in the patient table is distinct for each stay in the ICU (similar to `icustay_id` in MIMIC-III)?\n", 115 | "- Which column is unique for each patient (similar to `subject_id` in MIMIC-III)?" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 0, 121 | "metadata": { 122 | "colab": {}, 123 | "colab_type": "code", 124 | "id": "R6huFICkSQAd" 125 | }, 126 | "outputs": [], 127 | "source": [ 128 | "# view distinct ids\n", 129 | "%%bigquery\n", 130 | "\n", 131 | "SELECT DISTINCT(patientunitstayid)\n", 132 | "FROM `physionet-data.eicu_crd_demo.patient`" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 0, 138 | "metadata": { 139 | "colab": {}, 140 | "colab_type": "code", 141 | "id": "yEBIFRBqRo4y" 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "# set the where clause to select the stay of interest\n", 146 | "%%bigquery patient\n", 147 | "\n", 148 | "SELECT *\n", 149 | "FROM `physionet-data.eicu_crd_demo.patient`\n", 150 | "WHERE patientunitstayid = " 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 0, 156 | "metadata": { 157 | "colab": {}, 158 | "colab_type": "code", 159 | "id": "LjIL2XR6TAyp" 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "patient" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": { 169 | "colab_type": "text", 170 | "id": "QSbKYqF0TQ1n" 171 | }, 172 | "source": [ 173 | "## Questions\n", 174 | "\n", 175 | "- Which type of unit was the patient admitted to? Hint: Try `patient['unittype']` or `patient.unittype`\n", 176 | "- What year was the patient discharged from the ICU? Hint: You can view the table columns with `patient.columns`\n", 177 | "- What was the status of the patient upon discharge from the unit?" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": { 183 | "colab_type": "text", 184 | "id": "izaH0XwwUxDD" 185 | }, 186 | "source": [ 187 | "## The admissiondx table\n", 188 | "\n", 189 | "The `admissiondx` table contains the primary diagnosis for admission to the ICU according to the APACHE scoring criteria. For more detail, see: http://eicu-crd.mit.edu/eicutables/admissiondx/" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": 0, 195 | "metadata": { 196 | "colab": {}, 197 | "colab_type": "code", 198 | "id": "dlj3UCDTTEjj" 199 | }, 200 | "outputs": [], 201 | "source": [ 202 | "# set the where clause to select the stay of interest\n", 203 | "%%bigquery admissiondx\n", 204 | "\n", 205 | "SELECT *\n", 206 | "FROM `physionet-data.eicu_crd_demo.admissiondx`\n", 207 | "WHERE patientunitstayid = " 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 0, 213 | "metadata": { 214 | "colab": {}, 215 | "colab_type": "code", 216 | "id": "3wdEHFLJVMKm" 217 | }, 218 | "outputs": [], 219 | "source": [ 220 | "# View the columns in this data\n", 221 | "admissiondx.columns" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 0, 227 | "metadata": { 228 | "colab": {}, 229 | "colab_type": "code", 230 | "id": "tbOA44lAVNLr" 231 | }, 232 | "outputs": [], 233 | "source": [ 234 | "# View the data\n", 235 | "admissiondx.head()" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 0, 241 | "metadata": { 242 | "colab": {}, 243 | "colab_type": "code", 244 | "id": "Hc0y4ueOVWOk" 245 | }, 246 | "outputs": [], 247 | "source": [ 248 | "# Set the display options to avoid truncating the text\n", 249 | "pd.set_option('display.max_colwidth', -1)\n", 250 | "admissiondx.admitdxpath" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": { 256 | "colab_type": "text", 257 | "id": "mSb_BrgvWDdD" 258 | }, 259 | "source": [ 260 | "## Questions\n", 261 | "\n", 262 | "- What was the primary reason for admission?\n", 263 | "- How soon after admission to the ICU was the diagnoses recorded in eCareManager? Hint: The `offset` columns indicate the time in minutes after admission to the ICU. " 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": { 269 | "colab_type": "text", 270 | "id": "rd3Tw6_kWwlS" 271 | }, 272 | "source": [ 273 | "## The apacheapsvar table\n", 274 | "\n", 275 | "The apacheapsvar table contains the variables used to calculate the Acute Physiology Score (APS) III for patients. APS-III is an established method of summarizing patient severity of illness on admission to the ICU, taking the \"worst\" observations for a patient in a 24 hour period.\n", 276 | "\n", 277 | "The score is part of the Acute Physiology Age Chronic Health Evaluation (APACHE) system of equations for predicting outcomes for ICU patients. See: http://eicu-crd.mit.edu/eicutables/apacheApsVar/" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": 0, 283 | "metadata": { 284 | "colab": {}, 285 | "colab_type": "code", 286 | "id": "fXOzR5XWVdNa" 287 | }, 288 | "outputs": [], 289 | "source": [ 290 | "# set the where clause to select the stay of interest\n", 291 | "%%bigquery apacheapsvar\n", 292 | "\n", 293 | "SELECT *\n", 294 | "FROM `physionet-data.eicu_crd_demo.apacheapsvar`\n", 295 | "WHERE patientunitstayid = " 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 0, 301 | "metadata": { 302 | "colab": {}, 303 | "colab_type": "code", 304 | "id": "mL_lVORdXDIg" 305 | }, 306 | "outputs": [], 307 | "source": [ 308 | "apacheapsvar.head()" 309 | ] 310 | }, 311 | { 312 | "cell_type": "markdown", 313 | "metadata": { 314 | "colab_type": "text", 315 | "id": "8x_Z8q4jXH7D" 316 | }, 317 | "source": [ 318 | "## Questions\n", 319 | "\n", 320 | "- What was the 'worst' heart rate recorded for the patient during the scoring period?\n", 321 | "- Was the patient oriented and able to converse normally on the day of admission? (hint: the verbal element refers to the Glasgow Coma Scale)." 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": { 327 | "colab_type": "text", 328 | "id": "XplJvhIYX432" 329 | }, 330 | "source": [ 331 | "# apachepredvar table\n", 332 | "\n", 333 | "The apachepredvar table provides variables underlying the APACHE predictions. Acute Physiology Age Chronic Health Evaluation (APACHE) consists of a groups of equations used for predicting outcomes in critically ill patients. See: http://eicu-crd.mit.edu/eicutables/apachePredVar/" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 0, 339 | "metadata": { 340 | "colab": {}, 341 | "colab_type": "code", 342 | "id": "iAIFESy9XFhC" 343 | }, 344 | "outputs": [], 345 | "source": [ 346 | "# set the where clause to select the stay of interest\n", 347 | "%%bigquery apachepredvar\n", 348 | "\n", 349 | "SELECT *\n", 350 | "FROM `physionet-data.eicu_crd_demo.apachepredvar`\n", 351 | "WHERE patientunitstayid = " 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": 0, 357 | "metadata": { 358 | "colab": {}, 359 | "colab_type": "code", 360 | "id": "LAu7G72cYEY1" 361 | }, 362 | "outputs": [], 363 | "source": [ 364 | "apachepredvar.columns" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": { 370 | "colab_type": "text", 371 | "id": "IEaS6L9OY0vJ" 372 | }, 373 | "source": [ 374 | "## Questions\n", 375 | "\n", 376 | "- Was the patient ventilated during (APACHE) day 1 of their stay?\n", 377 | "- Is the patient recorded as having diabetes?" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": { 383 | "colab_type": "text", 384 | "id": "nrTEkjxqZD2l" 385 | }, 386 | "source": [ 387 | "# `apachepatientresult` table\n", 388 | "\n", 389 | "The `apachepatientresult` table provides predictions made by the APACHE score (versions IV and IVa), including probability of mortality, length of stay, and ventilation days. See: http://eicu-crd.mit.edu/eicutables/apachePatientResult/" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 0, 395 | "metadata": { 396 | "colab": {}, 397 | "colab_type": "code", 398 | "id": "M2RCJNBgZOJ2" 399 | }, 400 | "outputs": [], 401 | "source": [ 402 | "# set the where clause to select the stay of interest\n", 403 | "%%bigquery apachepatientresult\n", 404 | "\n", 405 | "SELECT *\n", 406 | "FROM `physionet-data.eicu_crd_demo.apachepatientresult`\n", 407 | "WHERE patientunitstayid = " 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 0, 413 | "metadata": { 414 | "colab": {}, 415 | "colab_type": "code", 416 | "id": "4whVaOP1Za8f" 417 | }, 418 | "outputs": [], 419 | "source": [ 420 | "apachepatientresult" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": { 426 | "colab_type": "text", 427 | "id": "5YO_GQcNZUWR" 428 | }, 429 | "source": [ 430 | "## Questions\n", 431 | "\n", 432 | "- What versions of the APACHE score are computed?\n", 433 | "- How many days during the stay was the patient ventilated?\n", 434 | "- How long was the patient predicted to stay in hospital?\n", 435 | "- Was this prediction close to the truth?" 436 | ] 437 | } 438 | ], 439 | "metadata": { 440 | "colab": { 441 | "collapsed_sections": [], 442 | "include_colab_link": true, 443 | "name": "02-severity-of-illness", 444 | "provenance": [], 445 | "version": "0.3.2" 446 | }, 447 | "kernelspec": { 448 | "display_name": "Python 3", 449 | "language": "python", 450 | "name": "python3" 451 | }, 452 | "language_info": { 453 | "codemirror_mode": { 454 | "name": "ipython", 455 | "version": 3 456 | }, 457 | "file_extension": ".py", 458 | "mimetype": "text/x-python", 459 | "name": "python", 460 | "nbconvert_exporter": "python", 461 | "pygments_lexer": "ipython3", 462 | "version": "3.7.4" 463 | } 464 | }, 465 | "nbformat": 4, 466 | "nbformat_minor": 1 467 | } 468 | -------------------------------------------------------------------------------- /mimic-iii-tutorial.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","metadata":{"colab_type":"text","id":"6fr_A5J1tVFQ"},"source":["Copyright 2019 Google Inc.\n","\n","Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use\n","this file except in compliance with the License. You may obtain a copy of the\n","License at\n","\n","> https://www.apache.org/licenses/LICENSE-2.0\n","\n","Unless required by applicable law or agreed to in writing, software distributed\n","under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR\n","CONDITIONS OF ANY KIND, either express or implied. See the License for the\n","specific language governing permissions and limitations under the License."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"QT2sukPDtrWQ"},"source":["# Datathon Tutorial\n","\n","The aim of this tutorial is to get you familiarized with BigQuery to\n","query/filter/aggregate/export data with Python.\n","\n","## Prerequisites\n","\n","You should already have had a valid Gmail account registered with the datathon\n","organizers. * If you do not have a Gmail account, you can create one at\n","http://www.gmail.com. You need to notify datathon organizers to register your\n","new account for data access. * If you have not yet signed the data use agreement\n","(DUA) sent by the organizers, please do so immediately to get access to the\n","MIMIC-III dataset."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"xks2nlrtt-um"},"source":["## Setup\n","\n","To be able to run the queries in this tutorial, you need to create a copy of\n","this Colab notebook by clicking \"File\" > \"Save a copy in Drive...\" menu. You can\n","share your copy with your teammates by clicking on the \"SHARE\" button on the\n","top-right corner of your Colab notebook copy. Everyone with \"Edit\" permission is\n","able to modify the notebook at the same time, so it is a great way for team\n","collaboration. Before running any cell in this colab, please make sure there is\n","a green check mark before \"CONNECTED\" on top right corner, if not, please click\n","\"CONNECTED\" button to connect to a random backend.\n","\n","Now that you have done the initial setup, let us start playing with the data.\n","First, you need to run some initialization code. You can run the following cell\n","by clicking on the triangle button when you hover over the [ ] space on the\n","top-left corner of the code cell below."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"rS9g4-r7uohe","vscode":{"languageId":"python"}},"outputs":[],"source":["# Import libraries\n","import numpy as np\n","import os\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","import matplotlib.patches as patches\n","import matplotlib.path as path\n","import tensorflow as tf\n","\n","# Below imports are used to print out pretty pandas dataframes\n","from IPython.display import display, HTML\n","\n","# Imports for accessing Datathon data using Google BigQuery.\n","from google.colab import auth\n","from google.cloud import bigquery"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"rlP2b-PKvUMk"},"source":["Before running any queries using BigQuery, you need to first authenticate\n","yourself by running the following cell. If you are running it for the first\n","time, it will ask you to follow a link to log in using your Gmail account, and\n","accept the data access requests to your profile. Once this is done, it will\n","generate a string of verification code, which you should paste back to the cell\n","below and press enter."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"HDL3CjUKvddl","vscode":{"languageId":"python"}},"outputs":[],"source":["auth.authenticate_user()"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"0qezUBvxH7_6"},"source":["The data-hosting project `physionet-data` has read-only access, as a result, you\n","need to set a default project that you have BigQuery access to. A shared project\n","should be created by the event organizers, and we will be using it throughout\n","this tutorial.\n","\n","Note that during the datathon, all participants will be divided into teams and a\n","Google Cloud project will be created for each team specifically. That project\n","would be the preferred project to use. For now we'll stick with the shared\n","project for the purpose of the tutorial.\n","\n","After datathon is finished, the shared project may either lock down access or be\n","deleted, it's still possible to run queries from a project you own personally as\n","long as you have access to the dataset hosting project.\n","\n","**Change the variable project_id below to list the project you are using.**"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"nx4ZDlJ6we9j","vscode":{"languageId":"python"}},"outputs":[],"source":["# Note that this should be the project for the datathon work,\n","# not the physionet-data project which is for data hosting.\n","project_id = 'hack-aotearoa'\n","os.environ['GOOGLE_CLOUD_PROJECT'] = project_id"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"ZondovXiw-zq"},"source":["Let's define a few methods to wrap BigQuery operations, so that we don't have to\n","write the configurations again and again."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"DIbbCE3YxLdM","vscode":{"languageId":"python"}},"outputs":[],"source":["# Read data from BigQuery into pandas dataframes.\n","def run_query(query):\n"," return pd.io.gbq.read_gbq(\n"," query,\n"," project_id=project_id,\n"," dialect='standard')"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"VDfGS-0VxjpC"},"source":["OK, that's it for setup, now let's get our hands on the MIMIC demo data!"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"7wh4aG6fRubd"},"source":["## Analysis\n","\n","Let's now run some queries adapted from the\n","[MIMIC cohort selection tutorial](https://github.com/MIT-LCP/mimic-code/blob/master/tutorials/cohort-selection.ipynb).\n","\n","First let's run the following query to produce data to generate a histrogram\n","graph to show the distribution of patient ages in ten-year buckets (i.e. [0,\n","10), [10, 20), ..., [90, ∞)."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"0BKEaZ0mAS_a","vscode":{"languageId":"python"}},"outputs":[],"source":["df = run_query(\"\"\"\n","WITH ps AS (\n"," SELECT\n"," icu.subject_id,\n"," icu.hadm_id,\n"," icu.icustay_id,\n"," pat.dob,\n"," DATETIME_DIFF(icu.outtime, icu.intime, DAY) AS icu_length_of_stay,\n"," DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age\n"," FROM `physionet-data.mimiciii_demo.icustays` AS icu\n"," INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat\n"," ON icu.subject_id = pat.subject_id),\n","bu AS (\n"," SELECT\n"," CAST(FLOOR(age / 10) AS INT64) AS bucket\n"," FROM ps)\n","SELECT\n"," COUNT(bucket) AS num_icu_stays,\n"," IF(bucket >= 9, \">= 90\", FORMAT(\"%d - %d\", bucket * 10, (bucket + 1) * 10)) AS age_bucket\n","FROM bu\n","GROUP BY bucket\n","ORDER BY bucket ASC\n","\"\"\")\n","\n","df.set_index('age_bucket').plot(title='stay - age', kind='bar', legend=False)"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"v1IquJ4LTQzi"},"source":["The query consists of 3 parts:\n","\n","1. First we join `icustays` and `patients` tables to produce length of ICU\n"," stays in days for each patient, which is saved in a temporary table `ps`;\n","2. Next we put patients into buckets based on their ages at the time they got\n"," admitted into ICU in `bu` table;\n","3. The result data is filtered to include only the information required, i.e.\n"," `age_bucket` and `num_icu_stays`, to plot the chart."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"QenZBv-rxYEm"},"source":["**Note**: If you are having a hard time following the queries in this colab, or\n","you want to know more about the table structures of MIMIC-III dataset, please\n","consult\n","[our colab for a previous Datathon held in Sydney](../../anzics18/tutorial.ipynb)."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"3TVRWd6JAYHS"},"source":["Now let's see if there is correlation between age and average length of stay in\n","hours. Since we are using the age of patients when they get admitted, so we\n","don't need to worry about multiple admissions of patients. Note that we treat\n","the redacted ages (> 90) as noises and filter them out."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"Me7i5Z5pAZ4s","vscode":{"languageId":"python"}},"outputs":[],"source":["df = run_query(\"\"\"\n","WITH re AS (\n","SELECT\n"," DATETIME_DIFF(icu.outtime, icu.intime, HOUR) AS icu_length_of_stay,\n"," DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age\n","FROM `physionet-data.mimiciii_demo.icustays` AS icu\n","INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat\n"," ON icu.subject_id = pat.subject_id)\n","SELECT\n"," icu_length_of_stay AS stay,\n"," age\n","FROM re\n","WHERE age < 100\n","\"\"\")\n","\n","df.plot(kind='scatter', x='age', y='stay')"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"W3l1HyDeBVvW"},"source":["Let's take a look at another query which uses a filter that we often use, which\n","is the current service that ICU patients are undergoing."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"iIw3ykjHOY-Y","vscode":{"languageId":"python"}},"outputs":[],"source":["df = run_query(\"\"\"\n","WITH co AS (\n"," SELECT\n"," icu.subject_id,\n"," icu.hadm_id,\n"," icu.icustay_id,\n"," pat.dob,\n"," DATETIME_DIFF(icu.outtime, icu.intime, DAY) AS icu_length_of_stay,\n"," DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age,\n"," RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order\n"," FROM `physionet-data.mimiciii_demo.icustays` AS icu\n"," INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat\n"," ON icu.subject_id = pat.subject_id\n"," ORDER BY hadm_id DESC),\n","serv AS (\n"," SELECT\n"," icu.hadm_id,\n"," icu.icustay_id,\n"," se.curr_service,\n"," IF(curr_service like '%SURG' OR curr_service = 'ORTHO', 1, 0) AS surgical,\n"," RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank\n"," FROM `physionet-data.mimiciii_demo.icustays` AS icu\n"," LEFT JOIN `physionet-data.mimiciii_demo.services` AS se\n"," ON icu.hadm_id = se.hadm_id\n"," AND se.transfertime < DATETIME_ADD(icu.intime, INTERVAL 12 HOUR)\n"," ORDER BY icustay_id)\n","SELECT\n"," co.subject_id,\n"," co.hadm_id,\n"," co.icustay_id,\n"," co.icu_length_of_stay,\n"," co.age,\n"," IF(co.icu_length_of_stay < 2, 1, 0) AS short_stay,\n"," IF(co.icustay_id_order = 1, 0, 1) AS first_stay,\n"," IF(serv.surgical = 1, 1, 0) AS surgical\n","FROM co\n","LEFT JOIN serv USING (icustay_id, hadm_id)\n","WHERE\n"," serv.rank = 1 AND age < 100\n","ORDER BY subject_id, icustay_id_order\n","\"\"\")\n","\n","print(f'Number of rows in dataframe: {len(df)}')\n","df.head()"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"d4nxxe69VDqH"},"source":["This is a long query, but is pretty simple if we take a closer look. It consists\n","of 3 steps as well:\n","\n","1. We are trying to know how many ICU admissions each patient has by joining\n"," `icustays` and `patients`. Note that since each patient may be admitted\n"," multiple times, we usually filter out follow-up ICU stays, and only keep the\n"," first one to minimize unwanted data correlation. This is achieved by\n"," partitioning over `subject_id`, and ordering by admission time, then choose\n"," only the first one with `RANK` function, the result is saved to a temporary\n"," table `co`;\n","2. Next we are looking for first services in ICU stays for patients, and also\n"," adding a label to indicate whether last services before ICU admission were\n"," surgical, similarly the result is saved to `serv`;\n","3. Lastly, we are ready to save this surgical exclusion label to a cohort\n"," generation table by joining the two tables, `co` and `serv`. For the\n"," convenience of later analysis, we rename some columns, and filter out\n"," patients more than 100 years old."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"Fi3isyuRSgmg"},"source":["## Useful Tips"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"ycbTUnFEY_3H"},"source":["### Working with DATETIME\n","\n","The times in the tables are stored as DATETIME objects. This means you cannot\n","use operators like `<`, `=`, or `>` for comparing them.\n","\n","* Use the\n"," [DATETIME functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/datetime_functions)\n"," in BigQuery. An example would be if you were trying to find things within 1\n"," hour of another event. In that case, you could use the native\n"," `DATETIME_SUB()` function. In the example below, we are looking for stays of\n"," less than 1 hour (where the admit time is less than 1 hour away from the\n"," discharge time).\n","\n","> ```\n","> [...] WHERE ADMITTIME BETWEEN DATETIME_SUB(DISCHTIME, INTERVAL 1 HOUR) AND DISCHTIME\n","> ```\n","\n","* If you are more comfortable working with timestamps, you can cast the\n"," DATETIME object to a TIMESTAMP object and then use the\n"," [TIMESTAMP functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions)."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"OTYmwNrmZEN2"},"source":["### Input / Output Options\n","\n","There are a few cases where you may want to work with files outside of BigQuery.\n","Examples include importing your own custom Python library or saving a dataframe.\n","[This tutorial](https://colab.research.google.com/notebooks/io.ipynb) covers\n","importing and exporting from local filesystem, Google Drive, Google Sheets, and\n","Google Cloud Storage."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"t7IRBm4EBau8"},"source":["## ML Model Training\n","\n","Next we will show an example of using [Tensorflow](https://www.tensorflow.org/)\n","([getting started doc](https://www.tensorflow.org/get_started/)) to build a\n","simple predictor, where we use the patient's age and whether it is the first ICU\n","stay to predict whether the ICU stay will be a short one. With only 127 data\n","points in total, we don't expect to actually build an accurate or useful\n","predictor, but it should serve the purpose of showing how a model can be trained\n","and used using Tensorflow within Colab.\n","\n","First, let us split the 127 data points into a training set with 100 records and\n","a testing set with 27, and examine the distribution of the split sets to make\n","sure that the distribution is similar."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"QPP17fQL3LDa","vscode":{"languageId":"python"}},"outputs":[],"source":["data = df[['age', 'first_stay', 'short_stay']]\n","data.reindex(np.random.permutation(data.index))\n","training_df = data.head(100)\n","validation_df = data.tail(27)\n","\n","print('Training data summary:')\n","display(training_df.describe())\n","\n","print('Validation data summary:')\n","display(validation_df.describe())"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"v6uU-mRh3PAS"},"source":["And let's quickly check the label distribution for the features."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"KkfdyF7K3Q3z","vscode":{"languageId":"python"}},"outputs":[],"source":["display(training_df.groupby(['short_stay', 'first_stay']).count())\n","\n","fig, ax = plt.subplots()\n","shorts = training_df[training_df.short_stay == 1].age\n","longs = training_df[training_df.short_stay == 0].age\n","colors = ['b', 'g']\n","ax.hist([shorts, longs],\n"," bins=10,\n"," color=colors,\n"," label=['short_stay=1', 'short_stay=0'])\n","ax.set_xlabel('Age')\n","ax.set_ylabel('Number of Patients')\n","plt.legend(loc='upper left')\n","plt.show()"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"MrpVGVYx3S7_"},"source":["Let's first build a linear regression model to predict the numeric value of\n","\"short_stay\" based on age and first_stay features. You can tune the parameters\n","on the right-hand side and observe differences in the evaluation result."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"R-7VB9pc3Vll","vscode":{"languageId":"python"}},"outputs":[],"source":["#@title Linear Regression Parameters {display-mode:\"both\"}\n","BATCH_SIZE = 5 # @param\n","NUM_EPOCHS = 100 # @param\n","\n","first_stay = tf.feature_column.numeric_column('first_stay')\n","age = tf.feature_column.numeric_column('age')\n","\n","# Build linear regressor\n","linear_regressor = tf.estimator.LinearRegressor(\n"," feature_columns=[first_stay, age])\n","\n","# Train the Model.\n","model = linear_regressor.train(\n"," input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(\n"," x=training_df,\n"," y=training_df['short_stay'],\n"," num_epochs=100,\n"," batch_size=BATCH_SIZE,\n"," shuffle=True),\n"," steps=100)\n","\n","# Evaluate the model.\n","eval_result = linear_regressor.evaluate(\n"," input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(\n"," x=validation_df,\n"," y=validation_df['short_stay'],\n"," batch_size=BATCH_SIZE,\n"," shuffle=False))\n","\n","display(eval_result)"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"8SsUXasz3YUS"},"source":["Remember that the label `short_stay` is actually categorical, with the value 1\n","for an ICU stay of 1 day or less and value 0 for stays of length 2 days or more.\n","So a classification model better fits this task. Here we try a deep neural\n","networks model using the `DNNClassifier` estimator. Notice the little changes\n","from the regression code above."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{},"colab_type":"code","id":"Ie7BzB_f3aFk","vscode":{"languageId":"python"}},"outputs":[],"source":["#@title ML Training example {display-mode:\"both\"}\n","BATCH_SIZE = 5 # @param\n","NUM_EPOCHS = 100 # @param\n","HIDDEN_UNITS = [10, 10] # @param\n","\n","# Build linear regressor\n","classifier = tf.estimator.DNNClassifier(\n"," feature_columns=[first_stay, age], hidden_units=HIDDEN_UNITS)\n","\n","# Train the Model.\n","model = classifier.train(\n"," input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(\n"," x=training_df,\n"," y=training_df['short_stay'],\n"," num_epochs=100,\n"," batch_size=BATCH_SIZE,\n"," shuffle=True),\n"," steps=100)\n","\n","# Evaluate the model.\n","eval_result = classifier.evaluate(\n"," input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(\n"," x=validation_df,\n"," y=validation_df['short_stay'],\n"," batch_size=BATCH_SIZE,\n"," shuffle=False))\n","\n","display(eval_result)"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"dtoC3w63BcIV"},"source":["## Closing\n","\n","Congratulations! Now you have finished this datathon tutorial, and ready to\n","explore the real data by querying Google BigQuery. To do so, simply use\n","`mimiciii_clinical` as the dataset name. For example, the table\n","`mimiciii_demo.icustays` becomes `mimiciii_clinical.icustays` when you need the\n","actual MIMIC data. One thing to note though, is that it is highly recommended to\n","aggregate data aggressively wherever possible, because large dataframes may\n","cause the performance of colab to drop drastically or even out of memory errors.\n","\n","Now, let's do the substitution, and start the real datathon exploration."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"kTRDaXL1a2TZ"},"source":["## Troubleshooting\n","\n","Below are some tips for troubleshooting more frequently seen issues"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"jVIVlrKvbGPI"},"source":["### Common Errors\n","\n","* **Error after authenticating while trying to run a query**\n","\n","```\n","ERROR:root:An unexpected error occurred while tokenizing input\n","The following traceback may be corrupted or invalid\n","The error message is: ('EOF in multi-line string', (1, 0))\n","```\n","\n","> If you try to run a query and see this error message, scroll to the bottom of\n","> the error text. The very last row of the error will show the specific error\n","> message, which is usually related to having the wrong project_id or not having\n","> access to the project/dataset.\n","\n","* **Colab has stopped working, is running slowly, or the top right no longer\n"," has a green check mark saying \"Connected\", but shows 3 dots and says\n"," \"Busy\"**\n","\n","> Reset the runtime, to reinitialize. Note that this will clear any local\n","> variables or uploaded files. Do this by clicking the `Runtime` menu at the\n","> top, then `Reset all runtimes`"]}],"metadata":{"colab":{"collapsed_sections":[],"name":"MIMIC-III AarhusCritical 2019 Tutorial","provenance":[{"file_id":"https://github.com/GoogleCloudPlatform/healthcare/blob/master/datathon/mimic_eicu/tutorials/bigquery_tutorial.ipynb","timestamp":1565343265814},{"file_id":"1feOtwLH7t-lHuKvDlQ0iR61rDMVvf7Wj","timestamp":1527116244197},{"file_id":"16EHw62feMPU-FI-HOhtS1y3zQjRmXrPV","timestamp":1527113933826}],"toc_visible":true,"version":"0.3.2"},"kernelspec":{"display_name":"Python 3","name":"python3"}},"nbformat":4,"nbformat_minor":0} 2 | --------------------------------------------------------------------------------