├── tutorials ├── .DS_Store ├── mimic-iii │ ├── mimic-iii-los.Rmd │ └── mimic-iii-tutorial.ipynb ├── eicu │ ├── 04-summary-statistics.ipynb │ ├── 01-accessing-the-data.ipynb │ ├── 03-severity-of-illness.ipynb │ ├── 02-explore-patients.ipynb │ └── 05-prediction.ipynb └── mimic-cxr │ └── mimic-cxr-train.ipynb ├── .gitignore ├── README.md └── LICENSE /tutorials/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MIT-LCP/2019-hst-953/master/tutorials/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | 106 | # MacOS 107 | .DS_Store -------------------------------------------------------------------------------- /tutorials/mimic-iii/mimic-iii-los.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Length of stay in the ICU" 3 | author: "tom pollard" 4 | description: "Length of stay in the ICU for patients in MIMIC-III" 5 | output: pdf_document 6 | date: "10/10/2017" 7 | --- 8 | 9 | ```{r setup, include = FALSE} 10 | knitr::opts_chunk$set(echo = TRUE) 11 | 12 | # install.packages("ggplot2") 13 | # install.packages("bigrquery") 14 | 15 | library("ggplot2") 16 | library("bigrquery") 17 | ``` 18 | 19 | 20 | ```{r dbconnect, include=FALSE} 21 | # Load configuration settings 22 | project_id <- "aarhus-critical-2019-team" 23 | options(httr_oauth_cache=TRUE) 24 | 25 | run_query <- function(query){ 26 | data <- query_exec(query, project=project_id, use_legacy_sql = FALSE) 27 | return(data) 28 | } 29 | ``` 30 | 31 | 32 | ```{r load_data, include=FALSE} 33 | sql_query <- "SELECT i.subject_id, i.hadm_id, i.los 34 | FROM `physionet-data.mimiciii_demo.icustays` i;" 35 | 36 | data <- run_query(sql_query) 37 | 38 | head(data) 39 | ``` 40 | 41 | This document shows how RMarkdown can be used to create a reproducible analysis using MIMIC-III (version 1.4). Let's calculate the median length of stay in the ICU and then include this value in our document. 42 | 43 | ```{r calculate_mean_los, include=FALSE} 44 | avg_los <- median(data$los, na.rm=TRUE) 45 | rounded_avg_los <- round(avg_los, digits = 2) 46 | ``` 47 | 48 | So the median length of stay in the ICU is `r avg_los` days. Rounded to two decimal places, this is `r rounded_avg_los` days. We can plot the distribution of length of stay using the qplot function: 49 | 50 | 51 | ```{r plot_los, echo=FALSE, include=TRUE, warning = FALSE} 52 | qplot(data$los, geom="histogram",xlim=c(0,25), binwidth = 1, 53 | xlab = "Length of stay in the ICU, days.",fill=I("#FF9999"), col=I("white")) 54 | ``` 55 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # hst-953-2019 2 | 3 | ## Data Access 4 | 5 | Participants of the Datathon are added to the Google Group `hst-953-2019`. Members of this group have access to computation resources through the Google Cloud Platform (GCP) project `hst-953-2019` and to datasets via the GCP project `physionet-data`. 6 | 7 | You can access the datasets directly on GCP: 8 | 1. [Create a free account on GCP](cloud.google.com) 9 | 2. [Open `hst-953-2019` in the console](https://console.cloud.google.com/home/dashboard?project=hst-953-2019) 10 | 3. [Open BigQuery](https://console.cloud.google.com/bigquery?project=hst-953-2019) 11 | 4. Run a query. E.g. count the number of hospital admissions: 12 | 13 | ```SQL 14 | SELECT count(*) 15 | FROM `physionet-data.mimiciii_clinical.admissions` 16 | ``` 17 | 18 | ## Tutorial Notebooks 19 | 20 | You can open the following tutorial notebooks on Colab and get started instantly. Requirements for these notebooks are: (1) you have a Google account, and (2) your Google account has been added to the `hst-953-2019` Google group. 21 | 22 | ### eICU-CRD 23 | * [01-accessing-the-data.ipynb](https://colab.research.google.com/github/MIT-LCP/2019-hst-953/blob/master/tutorials/eicu/01-accessing-the-data.ipynb) 24 | * [02-explore-patients.ipynb](https://colab.research.google.com/github/MIT-LCP/2019-hst-953/blob/master/tutorials/eicu/02-explore-patients.ipynb) 25 | * [03-severity-of-illness.ipynb](https://colab.research.google.com/github/MIT-LCP/2019-hst-953/blob/master/tutorials/eicu/03-severity-of-illness.ipynb) 26 | * [04-summary-statistics.ipynb](https://colab.research.google.com/github/MIT-LCP/2019-hst-953/blob/master/tutorials/eicu/04-summary-statistics.ipynb) 27 | * [05-prediction.ipynb](https://colab.research.google.com/github/MIT-LCP/2019-hst-953/blob/master/tutorials/eicu/05-prediction.ipynb) 28 | 29 | ### MIMIC-III 30 | * [mimic-iii-tutorial.ipynb](https://colab.research.google.com/github/MIT-LCP/2019-hst-953/blob/master/tutorials/mimic-iii/mimic-iii-tutorial.ipynb) 31 | 32 | ### MIMIC-CXR 33 | * [mimic-cxr-train.ipynb](https://colab.research.google.com/github/MIT-LCP/2019-hst-953/blob/master/tutorials/mimic-cxr/mimic-cxr-train.ipynb) 34 | 35 | ## R 36 | 37 | Datasets can also be queried directly from R. This is exemplified in the following R markdown notebook: [mimic-iii-los.Rmd](https://colab.research.google.com/github/MIT-LCP/2019-hst-953/blob/master/tutorials/mimic-iii/mimic-iii-los.Rmd) 38 | 39 | ## Accessing MIMIC-CXR 40 | 41 | The dataset used in the MIMIC-CXR tutorial is preprocessed for optimal use with [Tensorflow](https://www.tensorflow.org/). If you use a different library or just want a simpler representation of the data, the MIMIC-CXR dataset is available as JPEG images and CSV tables in the following GCP bucket: `gs://physionet-data-mimic-cxr-jpg`. The URL for the bucket is 42 | -------------------------------------------------------------------------------- /tutorials/eicu/04-summary-statistics.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"04-summary-statistics","version":"0.3.2","provenance":[{"file_id":"https://colab.research.google.com/github/MIT-LCP/2019-hst-953/blob/master/tutorials/eicu/04-summary-statistics.ipynb","timestamp":1566463494741}],"collapsed_sections":[]},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.3"}},"cells":[{"cell_type":"markdown","metadata":{"colab_type":"text","id":"1G_TVh1ybQkl"},"source":["# eICU Collaborative Research Database\n","\n","# Notebook 4: Summary statistics\n","\n","This notebook shows how summary statistics can be computed for a patient cohort using the `tableone` package. Usage instructions for tableone are at: https://pypi.org/project/tableone/\n"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"L9XF77F2bnee"},"source":["## Load libraries and connect to the database"]},{"cell_type":"code","metadata":{"colab_type":"code","id":"wXiSE558bn_w","colab":{}},"source":["# Import libraries\n","import numpy as np\n","import os\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","import matplotlib.patches as patches\n","import matplotlib.path as path\n","\n","# Make pandas dataframes prettier\n","from IPython.display import display, HTML\n","\n","# Access data using Google BigQuery.\n","from google.colab import auth\n","from google.cloud import bigquery"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"pLGnLAy-bsKb","colab":{}},"source":["# authenticate\n","auth.authenticate_user()"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"PUjFDFdobszs","colab":{}},"source":["# Set up environment variables\n","project_id='2019-hst-953'\n","os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"bkJUF8HBbvWe","colab":{}},"source":["# Helper function to read data from BigQuery into a DataFrame.\n","def run_query(query):\n"," return pd.io.gbq.read_gbq(query, project_id=project_id, dialect=\"standard\")"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"iWDUCA5Nb5BK"},"source":["## Install and load the `tableone` package\n","\n","The tableone package can be used to compute summary statistics for a patient cohort. Unlike the previous packages, it is not installed by default in Colab, so will need to install it first."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"F9doCgtscOJd","colab":{}},"source":["!pip install tableone"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"SDI_Q7W0b4Le","colab":{}},"source":["# Import the tableone class\n","from tableone import TableOne"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"14TU4lcrdD7I"},"source":["## Load the patient cohort\n","\n","In this example, we will load all data from the patient data, and link it to APACHE data to provide richer summary information."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"HF5WF5EObwfw","colab":{}},"source":["# Link the patient and apachepatientresult tables on patientunitstayid\n","# using an inner join.\n","query = \"\"\"\n","SELECT p.unitadmitsource, p.gender, p.age, p.ethnicity, p.admissionweight, \n"," p.unittype, p.unitstaytype, a.acutephysiologyscore,\n"," a.apachescore, a.actualiculos, a.actualhospitalmortality,\n"," a.unabridgedunitlos, a.unabridgedhosplos\n","FROM `physionet-data.eicu_crd_demo.patient` p\n","INNER JOIN `physionet-data.eicu_crd_demo.apachepatientresult` a\n","ON p.patientunitstayid = a.patientunitstayid\n","WHERE apacheversion LIKE 'IVa'\n","\"\"\"\n","\n","cohort = run_query(query)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"k3hURHFihHNA","colab":{}},"source":["cohort.head()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"qnG8dVb2iHSn"},"source":["## Calculate summary statistics\n","\n","Before summarizing the data, we will need to convert the ages to numerical values."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"oKHpqwAPkx6U","colab":{}},"source":["cohort['agenum'] = pd.to_numeric(cohort['age'], errors='coerce')"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"FQT-u8EXhXRG","colab":{}},"source":["columns = ['unitadmitsource', 'gender', 'agenum', 'ethnicity',\n"," 'admissionweight','unittype','unitstaytype',\n"," 'acutephysiologyscore','apachescore','actualiculos',\n"," 'unabridgedunitlos','unabridgedhosplos']"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"3ETr3NCzielL","colab":{}},"source":["TableOne(cohort, columns=columns, labels={'agenum': 'age'}, \n"," groupby='actualhospitalmortality',\n"," label_suffix=True, limit=4)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"LCBcpJ9bZpDp"},"source":["## Questions\n","\n","- Are the severity of illness measures higher in the survival or non-survival group?\n","- What issues suggest that some of the summary statistics might be misleading?\n","- How might you address these issues?"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"2_8z1CIVahWg"},"source":["## Visualizing the data\n","\n","Plotting the distribution of each variable by group level via histograms, kernel density estimates and boxplots is a crucial component to data analysis pipelines. Vizualisation is often is the only way to detect problematic variables in many real-life scenarios. We'll review a couple of the variables."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"81yp2bSUigzh","colab":{}},"source":["# Plot distributions to review possible multimodality\n","cohort[['acutephysiologyscore','agenum']].dropna().plot.kde(figsize=[12,8])\n","plt.legend(['APS Score', 'Age (years)'])\n","plt.xlim([-30,250])"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"kZDUZB5sdhhU"},"source":["## Questions\n","\n","- Do the plots change your view on how these variable should be reported?"]}]} -------------------------------------------------------------------------------- /tutorials/eicu/01-accessing-the-data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "01-accessing-the-data.ipynb", 7 | "version": "0.3.2", 8 | "provenance": [], 9 | "collapsed_sections": [] 10 | }, 11 | "kernelspec": { 12 | "display_name": "Python 3", 13 | "language": "python", 14 | "name": "python3" 15 | }, 16 | "language_info": { 17 | "codemirror_mode": { 18 | "name": "ipython", 19 | "version": 3 20 | }, 21 | "file_extension": ".py", 22 | "mimetype": "text/x-python", 23 | "name": "python", 24 | "nbconvert_exporter": "python", 25 | "pygments_lexer": "ipython3", 26 | "version": "3.7.3" 27 | } 28 | }, 29 | "cells": [ 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "colab_type": "text", 34 | "id": "wIzJcXV1t-mr" 35 | }, 36 | "source": [ 37 | "# eICU Collaborative Research Database\n", 38 | "\n", 39 | "# Notebook 1: Accessing the data\n", 40 | "\n", 41 | "The aim of this notebook is to get set up with access to a demo version of the [eICU Collaborative Research Database](http://eicu-crd.mit.edu/). The demo is a subset of the full database, limited to 100 patients." 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": { 47 | "colab_type": "text", 48 | "id": "MSr3SoebuvwM" 49 | }, 50 | "source": [ 51 | "## Prerequisites\n", 52 | "\n", 53 | "- If you do not have a Gmail account, please create one at http://www.gmail.com. \n", 54 | "- If you have not yet signed the data use agreement (DUA) sent by the organizers, please do so now to get access to the dataset." 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": { 60 | "colab_type": "text", 61 | "id": "wgyxL8owvPvA" 62 | }, 63 | "source": [ 64 | "## Setup\n", 65 | "\n", 66 | "To run the queries in this notebook, you will need to create a copy by clicking *File\" > \"Save a copy in Drive...\"* from the menu. Before running a cell in the notebook, check for the green \"CONNECTED\" check mark in top right corner.\n", 67 | "\n", 68 | "Next we'll start playing with the data. First, you need to run some initialization code. You can run the following cell by clicking on the triangle button when you hover over the [ ] space on the top-left corner of the code cell below." 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "metadata": { 74 | "colab_type": "code", 75 | "id": "tc8ew3HzwwCi", 76 | "colab": {} 77 | }, 78 | "source": [ 79 | "# Import libraries\n", 80 | "import numpy as np\n", 81 | "import os\n", 82 | "import pandas as pd\n", 83 | "import matplotlib.pyplot as plt\n", 84 | "import matplotlib.patches as patches\n", 85 | "import matplotlib.path as path\n", 86 | "\n", 87 | "# Make pandas dataframes prettier\n", 88 | "from IPython.display import display, HTML\n", 89 | "\n", 90 | "# Access data using Google BigQuery.\n", 91 | "from google.colab import auth\n", 92 | "from google.cloud import bigquery" 93 | ], 94 | "execution_count": 0, 95 | "outputs": [] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": { 100 | "colab_type": "text", 101 | "id": "BJqeFUJlxmCj" 102 | }, 103 | "source": [ 104 | "Before running any queries, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter.\n" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "metadata": { 110 | "colab_type": "code", 111 | "id": "IK7lJh-hxSRQ", 112 | "colab": {} 113 | }, 114 | "source": [ 115 | "auth.authenticate_user()" 116 | ], 117 | "execution_count": 0, 118 | "outputs": [] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": { 123 | "colab_type": "text", 124 | "id": "jjRRnhuTx6yx" 125 | }, 126 | "source": [ 127 | "## Querying the dataset\n", 128 | "\n", 129 | "Now we are ready to load the data from the cloud server. The data-hosting project `physionet-data` allows you read-only access to the eICU Collaborative Research Database demo (as well as the complete eICU database, the MIMIC-III database and the MIMIC-CXR database). Let's see which datasets are available in this project. " 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "metadata": { 135 | "colab_type": "code", 136 | "id": "wLau3cWFxuYC", 137 | "colab": {} 138 | }, 139 | "source": [ 140 | "project_id='hst-953-2019'\n", 141 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id" 142 | ], 143 | "execution_count": 0, 144 | "outputs": [] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "metadata": { 149 | "colab_type": "code", 150 | "id": "SE5Dc2tpztSq", 151 | "colab": {} 152 | }, 153 | "source": [ 154 | "# create a connection to the database\n", 155 | "client = bigquery.Client(project='physionet-data')\n", 156 | "\n", 157 | "# load the dataset list\n", 158 | "datasets = client.list_datasets()\n", 159 | "\n", 160 | "# iterate the datasets list\n", 161 | "for dataset in datasets:\n", 162 | " did = dataset.dataset_id\n", 163 | " # print the dataset name\n", 164 | " print('Dataset \"{}\" has the following tables: '.format(did))\n", 165 | " # iterate the tables on the dataset\n", 166 | " for table in client.list_tables(client.dataset(did)):\n", 167 | " # print the table name\n", 168 | " print('- {}'.format(table.table_id))" 169 | ], 170 | "execution_count": 0, 171 | "outputs": [] 172 | } 173 | ] 174 | } -------------------------------------------------------------------------------- /tutorials/eicu/03-severity-of-illness.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"03-severity-of-illness","version":"0.3.2","provenance":[{"file_id":"https://colab.research.google.com/github/MIT-LCP/2019_hst-953/blob/master/tutorials/eicu/03-severity-of-illness.ipynb","timestamp":1566463280234}],"collapsed_sections":[]},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.3"}},"cells":[{"cell_type":"markdown","metadata":{"colab_type":"text","id":"y4AOVdliM8gm"},"source":["# eICU Collaborative Research Database\n","\n","# Notebook 3: Severity of illness\n","\n","This notebook introduces high level admission details relating to a single patient stay, using the following tables:\n","\n","- patient\n","- admissiondx\n","- apacheapsvar\n","- apachepredvar\n","- apachepatientresult\n","\n"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"e0lUnIkYOyv4"},"source":["## Load libraries and connect to the database"]},{"cell_type":"code","metadata":{"colab_type":"code","id":"SJ6l1i3fOL4j","colab":{}},"source":["# Import libraries\n","import numpy as np\n","import os\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","import matplotlib.patches as patches\n","import matplotlib.path as path\n","\n","# Make pandas dataframes prettier\n","from IPython.display import display, HTML\n","\n","# Access data using Google BigQuery.\n","from google.colab import auth\n","from google.cloud import bigquery"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"TE4JYS8aO-69","colab":{}},"source":["# authenticate\n","auth.authenticate_user()"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"oVavf-ujPOAv","colab":{}},"source":["# Set up environment variables\n","project_id='hst-953-2019'\n","os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"aBc7PA0KSIFM","colab":{}},"source":["# Helper function to read data from BigQuery into a DataFrame.\n","def run_query(query):\n"," return pd.io.gbq.read_gbq(query, project_id=project_id, dialect=\"standard\")"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"a1CAI3GjQYE0"},"source":["## Selecting a single patient stay¶\n","\n","As we have seen, the patient table includes general information about the patient admissions (for example, demographics, admission and discharge details). See: http://eicu-crd.mit.edu/eicutables/patient/\n","\n","## Questions\n","\n","Use your knowledge from the previous notebook and the online documentation (http://eicu-crd.mit.edu/) to answer the following questions:\n","\n","- Which column in the patient table is distinct for each stay in the ICU (similar to `icustay_id` in MIMIC-III)?\n","- Which column is unique for each patient (similar to `subject_id` in MIMIC-III)?"]},{"cell_type":"code","metadata":{"colab_type":"code","id":"R6huFICkSQAd","colab":{}},"source":["# view distinct ids\n","query = \"\"\"\n","SELECT DISTINCT(patientunitstayid)\n","FROM `physionet-data.eicu_crd_demo.patient`\n","\"\"\"\n","\n","run_query(query)\n"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"lfeQwFlvRly7","colab":{}},"source":["# select a single ICU stay\n","patientunitstayid = "],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"yEBIFRBqRo4y","colab":{}},"source":["# set the where clause to select the stay of interest\n","query = \"\"\"\n","SELECT *\n","FROM `physionet-data.eicu_crd_demo.patient`\n","WHERE patientunitstayid = {}\n","\"\"\".format(patientunitstayid)\n","\n","patient = run_query(query)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"LjIL2XR6TAyp","colab":{}},"source":["patient"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"QSbKYqF0TQ1n"},"source":["## Questions\n","\n","- Which type of unit was the patient admitted to? Hint: Try `patient['unittype']` or `patient.unittype`\n","- What year was the patient discharged from the ICU? Hint: You can view the table columns with `patient.columns`\n","- What was the status of the patient upon discharge from the unit?"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"izaH0XwwUxDD"},"source":["## The admissiondx table\n","\n","The `admissiondx` table contains the primary diagnosis for admission to the ICU according to the APACHE scoring criteria. For more detail, see: http://eicu-crd.mit.edu/eicutables/admissiondx/"]},{"cell_type":"code","metadata":{"colab_type":"code","id":"dlj3UCDTTEjj","colab":{}},"source":["# set the where clause to select the stay of interest\n","query = \"\"\"\n","SELECT *\n","FROM `physionet-data.eicu_crd_demo.admissiondx`\n","WHERE patientunitstayid = {}\n","\"\"\".format(patientunitstayid)\n","\n","admissiondx = run_query(query)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"3wdEHFLJVMKm","colab":{}},"source":["# View the columns in this data\n","admissiondx.columns"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"tbOA44lAVNLr","colab":{}},"source":["# View the data\n","admissiondx.head()"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"Hc0y4ueOVWOk","colab":{}},"source":["# Set the display options to avoid truncating the text\n","pd.set_option('display.max_colwidth', -1)\n","admissiondx.admitdxpath"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"mSb_BrgvWDdD"},"source":["## Questions\n","\n","- What was the primary reason for admission?\n","- How soon after admission to the ICU was the diagnoses recorded in eCareManager? Hint: The `offset` columns indicate the time in minutes after admission to the ICU. "]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"rd3Tw6_kWwlS"},"source":["## The apacheapsvar table\n","\n","The apacheapsvar table contains the variables used to calculate the Acute Physiology Score (APS) III for patients. APS-III is an established method of summarizing patient severity of illness on admission to the ICU, taking the \"worst\" observations for a patient in a 24 hour period.\n","\n","The score is part of the Acute Physiology Age Chronic Health Evaluation (APACHE) system of equations for predicting outcomes for ICU patients. See: http://eicu-crd.mit.edu/eicutables/apacheApsVar/"]},{"cell_type":"code","metadata":{"colab_type":"code","id":"fXOzR5XWVdNa","colab":{}},"source":["# set the where clause to select the stay of interest\n","query = \"\"\"\n","SELECT *\n","FROM `physionet-data.eicu_crd_demo.apacheapsvar`\n","WHERE patientunitstayid = {}\n","\"\"\".format(patientunitstayid)\n","\n","apacheapsvar = run_query(query)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"mL_lVORdXDIg","colab":{}},"source":["apacheapsvar.head()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"8x_Z8q4jXH7D"},"source":["## Questions\n","\n","- What was the 'worst' heart rate recorded for the patient during the scoring period?\n","- Was the patient oriented and able to converse normally on the day of admission? (hint: the verbal element refers to the Glasgow Coma Scale)."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"XplJvhIYX432"},"source":["# apachepredvar table\n","\n","The apachepredvar table provides variables underlying the APACHE predictions. Acute Physiology Age Chronic Health Evaluation (APACHE) consists of a groups of equations used for predicting outcomes in critically ill patients. See: http://eicu-crd.mit.edu/eicutables/apachePredVar/"]},{"cell_type":"code","metadata":{"colab_type":"code","id":"iAIFESy9XFhC","colab":{}},"source":["# set the where clause to select the stay of interest\n","query = \"\"\"\n","SELECT *\n","FROM `physionet-data.eicu_crd_demo.apachepredvar`\n","WHERE patientunitstayid = {}\n","\"\"\".format(patientunitstayid)\n","\n","apachepredvar = run_query(query)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"LAu7G72cYEY1","colab":{}},"source":["apachepredvar.columns"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"IEaS6L9OY0vJ"},"source":["## Questions\n","\n","- Was the patient ventilated during (APACHE) day 1 of their stay?\n","- Is the patient recorded as having diabetes?"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"nrTEkjxqZD2l"},"source":["# `apachepatientresult` table\n","\n","The `apachepatientresult` table provides predictions made by the APACHE score (versions IV and IVa), including probability of mortality, length of stay, and ventilation days. See: http://eicu-crd.mit.edu/eicutables/apachePatientResult/"]},{"cell_type":"code","metadata":{"colab_type":"code","id":"M2RCJNBgZOJ2","colab":{}},"source":["# set the where clause to select the stay of interest\n","query = \"\"\"\n","SELECT *\n","FROM `physionet-data.eicu_crd_demo.apachepatientresult`\n","WHERE patientunitstayid = {}\n","\"\"\".format(patientunitstayid)\n","\n","apachepatientresult = run_query(query)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"4whVaOP1Za8f","colab":{}},"source":["apachepatientresult"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"5YO_GQcNZUWR"},"source":["## Questions\n","\n","- What versions of the APACHE score are computed?\n","- How many days during the stay was the patient ventilated?\n","- How long was the patient predicted to stay in hospital?\n","- Was this prediction close to the truth?"]}]} -------------------------------------------------------------------------------- /tutorials/eicu/02-explore-patients.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "02-explore-patients", 7 | "version": "0.3.2", 8 | "provenance": [], 9 | "collapsed_sections": [] 10 | }, 11 | "kernelspec": { 12 | "display_name": "Python 3", 13 | "language": "python", 14 | "name": "python3" 15 | }, 16 | "language_info": { 17 | "codemirror_mode": { 18 | "name": "ipython", 19 | "version": 3 20 | }, 21 | "file_extension": ".py", 22 | "mimetype": "text/x-python", 23 | "name": "python", 24 | "nbconvert_exporter": "python", 25 | "pygments_lexer": "ipython3", 26 | "version": "3.7.3" 27 | } 28 | }, 29 | "cells": [ 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "colab_type": "text", 34 | "id": "NCI19_Ix7xuI" 35 | }, 36 | "source": [ 37 | "# eICU Collaborative Research Database\n", 38 | "\n", 39 | "# Notebook 2: Exploring the patient table\n", 40 | "\n", 41 | "In this notebook we introduce the patient table, a key table in the [eICU Collaborative Research Database](http://eicu-crd.mit.edu/). The patient table contains patient demographics and admission and discharge details for hospital and ICU stays. For more detail, see: http://eicu-crd.mit.edu/eicutables/patient/" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": { 47 | "colab_type": "text", 48 | "id": "l_CmlcBu8Wei" 49 | }, 50 | "source": [ 51 | "## Load libraries and connect to the data\n", 52 | "\n", 53 | "Run the following cells to import some libraries and then connect to the database." 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "metadata": { 59 | "colab_type": "code", 60 | "id": "3WQsJiAj8B5L", 61 | "colab": {} 62 | }, 63 | "source": [ 64 | "# Import libraries\n", 65 | "import numpy as np\n", 66 | "import os\n", 67 | "import pandas as pd\n", 68 | "import matplotlib.pyplot as plt\n", 69 | "import matplotlib.patches as patches\n", 70 | "import matplotlib.path as path\n", 71 | "\n", 72 | "# Make pandas dataframes prettier\n", 73 | "from IPython.display import display, HTML\n", 74 | "\n", 75 | "# Access data using Google BigQuery.\n", 76 | "from google.colab import auth\n", 77 | "from google.cloud import bigquery" 78 | ], 79 | "execution_count": 0, 80 | "outputs": [] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": { 85 | "colab_type": "text", 86 | "id": "Ld59KZ0W9E4v" 87 | }, 88 | "source": [ 89 | "As before, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter." 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "metadata": { 95 | "colab_type": "code", 96 | "id": "ABh4hMt288yg", 97 | "colab": {} 98 | }, 99 | "source": [ 100 | "auth.authenticate_user()" 101 | ], 102 | "execution_count": 0, 103 | "outputs": [] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": { 108 | "colab_type": "text", 109 | "id": "BPoHP2a8_eni" 110 | }, 111 | "source": [ 112 | "We'll also set the project details." 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "metadata": { 118 | "colab_type": "code", 119 | "id": "P0fdtVMa_di9", 120 | "colab": {} 121 | }, 122 | "source": [ 123 | "project_id='hst-953-2019'\n", 124 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id" 125 | ], 126 | "execution_count": 0, 127 | "outputs": [] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": { 132 | "colab_type": "text", 133 | "id": "5bHZALFP9VN1" 134 | }, 135 | "source": [ 136 | "# Load data from the `patient` table\n", 137 | "\n", 138 | "Now we can start exploring the data. We'll begin by running a simple query on the database to load all columns of the `patient` table to a Pandas DataFrame. The query is written in SQL, a common language for extracting data from databases. The structure of an SQL query is:\n", 139 | "\n", 140 | "```sql\n", 141 | "SELECT \n", 142 | "FROM \n", 143 | "WHERE \n", 144 | "```\n", 145 | "\n", 146 | "`*` is a wildcard that indicates all columns" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "metadata": { 152 | "colab_type": "code", 153 | "id": "3cddF8qc-7h4", 154 | "colab": {} 155 | }, 156 | "source": [ 157 | "# Helper function to read data from BigQuery into a DataFrame.\n", 158 | "def run_query(query):\n", 159 | " return pd.io.gbq.read_gbq(query, project_id=project_id, dialect=\"standard\")" 160 | ], 161 | "execution_count": 0, 162 | "outputs": [] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "metadata": { 167 | "colab_type": "code", 168 | "id": "RE-UZAPG_rHq", 169 | "colab": {} 170 | }, 171 | "source": [ 172 | "query = \"\"\"\n", 173 | "SELECT *\n", 174 | "FROM `physionet-data.eicu_crd_demo.patient`\n", 175 | "\"\"\"\n", 176 | "\n", 177 | "patient = run_query(query)" 178 | ], 179 | "execution_count": 0, 180 | "outputs": [] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": { 185 | "colab_type": "text", 186 | "id": "YbnkcCZxBkdK" 187 | }, 188 | "source": [ 189 | "We have now assigned the output to our query to a variable called `patient`. Let's use the `head` method to view the first few rows of our data." 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "metadata": { 195 | "colab_type": "code", 196 | "id": "GZph0FPDASEs", 197 | "colab": {} 198 | }, 199 | "source": [ 200 | "# view the top few rows of the patient data\n", 201 | "patient.head()" 202 | ], 203 | "execution_count": 0, 204 | "outputs": [] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": { 209 | "colab_type": "text", 210 | "id": "TlxaXLevC_Rz" 211 | }, 212 | "source": [ 213 | "## Questions\n", 214 | "\n", 215 | "- What does `patientunitstayid` represent? (hint, see: http://eicu-crd.mit.edu/eicutables/patient/)\n", 216 | "- What does `patienthealthsystemstayid` represent?\n", 217 | "- What does `uniquepid` represent?" 218 | ] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "metadata": { 223 | "colab_type": "code", 224 | "id": "2rLY0WyCBzp9", 225 | "colab": {} 226 | }, 227 | "source": [ 228 | "# select a limited number of columns to view\n", 229 | "columns = ['uniquepid', 'patientunitstayid','gender','age','unitdischargestatus']\n", 230 | "patient[columns].head()" 231 | ], 232 | "execution_count": 0, 233 | "outputs": [] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": { 238 | "colab_type": "text", 239 | "id": "FSdS2hS4EWtb" 240 | }, 241 | "source": [ 242 | "- Try running the following query, which lists unique values in the age column. What do you notice?" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "metadata": { 248 | "colab_type": "code", 249 | "id": "0Aom69ftDxBN", 250 | "colab": {} 251 | }, 252 | "source": [ 253 | "# what are the unique values for age?\n", 254 | "age_col = 'age'\n", 255 | "patient[age_col].sort_values().unique()" 256 | ], 257 | "execution_count": 0, 258 | "outputs": [] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": { 263 | "colab_type": "text", 264 | "id": "Y_qJL94jE0k8" 265 | }, 266 | "source": [ 267 | "- Try plotting a histogram of ages using the command in the cell below. What happens? Why?" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "metadata": { 273 | "colab_type": "code", 274 | "id": "1zad3Gr4D4LE", 275 | "colab": {} 276 | }, 277 | "source": [ 278 | "# try plotting a histogram of ages\n", 279 | "patient[age_col].plot(kind='hist', bins=15)" 280 | ], 281 | "execution_count": 0, 282 | "outputs": [] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": { 287 | "colab_type": "text", 288 | "id": "xIdwVEEPF25H" 289 | }, 290 | "source": [ 291 | "Let's create a new column named `age_num`, then try again." 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "metadata": { 297 | "colab_type": "code", 298 | "id": "-rwc-28oFF6R", 299 | "colab": {} 300 | }, 301 | "source": [ 302 | "# create a column containing numerical ages\n", 303 | "# If ‘coerce’, then invalid parsing will be set as NaN\n", 304 | "agenum_col = 'age_num'\n", 305 | "patient[agenum_col] = pd.to_numeric(patient[age_col], errors='coerce')\n", 306 | "patient[agenum_col].sort_values().unique()" 307 | ], 308 | "execution_count": 0, 309 | "outputs": [] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "metadata": { 314 | "colab_type": "code", 315 | "id": "uTFMqqWqFMjG", 316 | "colab": {} 317 | }, 318 | "source": [ 319 | "patient[agenum_col].plot(kind='hist', bins=15)" 320 | ], 321 | "execution_count": 0, 322 | "outputs": [] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": { 327 | "colab_type": "text", 328 | "id": "FrbR8rV3GlR1" 329 | }, 330 | "source": [ 331 | "## Questions\n", 332 | "\n", 333 | "- Use the `mean()` method to find the average age. Why do we expect this to be lower than the true mean?\n", 334 | "- In the same way that you use `mean()`, you can use `describe()`, `max()`, and `min()`. Look at the admission heights (`admissionheight`) of patients in cm. What issue do you see? How can you deal with this issue?" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "metadata": { 340 | "colab_type": "code", 341 | "id": "TPps13DZG6Ac", 342 | "colab": {} 343 | }, 344 | "source": [ 345 | "adheight_col = 'admissionheight'\n", 346 | "patient[adheight_col].describe()" 347 | ], 348 | "execution_count": 0, 349 | "outputs": [] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "metadata": { 354 | "colab_type": "code", 355 | "id": "9jhV9xQoGRJq", 356 | "colab": {} 357 | }, 358 | "source": [ 359 | "# set threshold\n", 360 | "adheight_col = 'admissionheight'\n", 361 | "patient[patient[adheight_col] < 10] = None" 362 | ], 363 | "execution_count": 0, 364 | "outputs": [] 365 | } 366 | ] 367 | } -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /tutorials/mimic-iii/mimic-iii-tutorial.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"MIMIC-III AarhusCritical 2019 Tutorial","version":"0.3.2","provenance":[{"file_id":"https://github.com/GoogleCloudPlatform/healthcare/blob/master/datathon/mimic_eicu/tutorials/bigquery_tutorial.ipynb","timestamp":1565343265814},{"file_id":"1feOtwLH7t-lHuKvDlQ0iR61rDMVvf7Wj","timestamp":1527116244197},{"file_id":"16EHw62feMPU-FI-HOhtS1y3zQjRmXrPV","timestamp":1527113933826}],"collapsed_sections":[],"toc_visible":true},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"colab_type":"text","id":"6fr_A5J1tVFQ"},"source":["Copyright 2019 Google Inc.\n","\n","Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use\n","this file except in compliance with the License. You may obtain a copy of the\n","License at\n","\n","> https://www.apache.org/licenses/LICENSE-2.0\n","\n","Unless required by applicable law or agreed to in writing, software distributed\n","under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR\n","CONDITIONS OF ANY KIND, either express or implied. See the License for the\n","specific language governing permissions and limitations under the License."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"QT2sukPDtrWQ"},"source":["# Datathon Tutorial\n","\n","The aim of this tutorial is to get you familiarized with BigQuery to\n","query/filter/aggregate/export data with Python.\n","\n","## Prerequisites\n","\n","You should already have had a valid Gmail account registered with the datathon\n","organizers. * If you do not have a Gmail account, you can create one at\n","http://www.gmail.com. You need to notify datathon organizers to register your\n","new account for data access. * If you have not yet signed the data use agreement\n","(DUA) sent by the organizers, please do so immediately to get access to the\n","MIMIC-III dataset."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"xks2nlrtt-um"},"source":["## Setup\n","\n","To be able to run the queries in this tutorial, you need to create a copy of\n","this Colab notebook by clicking \"File\" > \"Save a copy in Drive...\" menu. You can\n","share your copy with your teammates by clicking on the \"SHARE\" button on the\n","top-right corner of your Colab notebook copy. Everyone with \"Edit\" permission is\n","able to modify the notebook at the same time, so it is a great way for team\n","collaboration. Before running any cell in this colab, please make sure there is\n","a green check mark before \"CONNECTED\" on top right corner, if not, please click\n","\"CONNECTED\" button to connect to a random backend.\n","\n","Now that you have done the initial setup, let us start playing with the data.\n","First, you need to run some initialization code. You can run the following cell\n","by clicking on the triangle button when you hover over the [ ] space on the\n","top-left corner of the code cell below."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"rS9g4-r7uohe","colab":{}},"source":["# Import libraries\n","import numpy as np\n","import os\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","import matplotlib.patches as patches\n","import matplotlib.path as path\n","import tensorflow as tf\n","\n","# Below imports are used to print out pretty pandas dataframes\n","from IPython.display import display, HTML\n","\n","# Imports for accessing Datathon data using Google BigQuery.\n","from google.colab import auth\n","from google.cloud import bigquery"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"rlP2b-PKvUMk"},"source":["Before running any queries using BigQuery, you need to first authenticate\n","yourself by running the following cell. If you are running it for the first\n","time, it will ask you to follow a link to log in using your Gmail account, and\n","accept the data access requests to your profile. Once this is done, it will\n","generate a string of verification code, which you should paste back to the cell\n","below and press enter."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"HDL3CjUKvddl","colab":{}},"source":["auth.authenticate_user()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"0qezUBvxH7_6"},"source":["The data-hosting project `physionet-data` has read-only access, as a result, you\n","need to set a default project that you have BigQuery access to. A shared project\n","should be created by the event organizers, and we will be using it throughout\n","this tutorial.\n","\n","Note that during the datathon, all participants will be divided into teams and a\n","Google Cloud project will be created for each team specifically. That project\n","would be the preferred project to use. For now we'll stick with the shared\n","project for the purpose of the tutorial.\n","\n","After datathon is finished, the shared project may either lock down access or be\n","deleted, it's still possible to run queries from a project you own personally as\n","long as you have access to the dataset hosting project.\n","\n","**Change the variable project_id below to list the project you are using.**"]},{"cell_type":"code","metadata":{"colab_type":"code","id":"nx4ZDlJ6we9j","colab":{}},"source":["# Note that this should be the project for the datathon work,\n","# not the physionet-data project which is for data hosting.\n","project_id = 'aarhus-critical-2019-team'\n","os.environ['GOOGLE_CLOUD_PROJECT'] = project_id"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"ZondovXiw-zq"},"source":["Let's define a few methods to wrap BigQuery operations, so that we don't have to\n","write the configurations again and again."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"DIbbCE3YxLdM","colab":{}},"source":["# Read data from BigQuery into pandas dataframes.\n","def run_query(query):\n"," return pd.io.gbq.read_gbq(\n"," query,\n"," project_id=project_id,\n"," dialect='standard')"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"VDfGS-0VxjpC"},"source":["OK, that's it for setup, now let's get our hands on the MIMIC demo data!"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"7wh4aG6fRubd"},"source":["## Analysis\n","\n","Let's now run some queries adapted from the\n","[MIMIC cohort selection tutorial](https://github.com/MIT-LCP/mimic-code/blob/master/tutorials/cohort-selection.ipynb).\n","\n","First let's run the following query to produce data to generate a histrogram\n","graph to show the distribution of patient ages in ten-year buckets (i.e. [0,\n","10), [10, 20), ..., [90, ∞)."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"0BKEaZ0mAS_a","colab":{}},"source":["df = run_query(\"\"\"\n","WITH ps AS (\n"," SELECT\n"," icu.subject_id,\n"," icu.hadm_id,\n"," icu.icustay_id,\n"," pat.dob,\n"," DATETIME_DIFF(icu.outtime, icu.intime, DAY) AS icu_length_of_stay,\n"," DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age\n"," FROM `physionet-data.mimiciii_demo.icustays` AS icu\n"," INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat\n"," ON icu.subject_id = pat.subject_id),\n","bu AS (\n"," SELECT\n"," CAST(FLOOR(age / 10) AS INT64) AS bucket\n"," FROM ps)\n","SELECT\n"," COUNT(bucket) AS num_icu_stays,\n"," IF(bucket >= 9, \">= 90\", FORMAT(\"%d - %d\", bucket * 10, (bucket + 1) * 10)) AS age_bucket\n","FROM bu\n","GROUP BY bucket\n","ORDER BY bucket ASC\n","\"\"\")\n","\n","df.set_index('age_bucket').plot(title='stay - age', kind='bar', legend=False)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"v1IquJ4LTQzi"},"source":["The query consists of 3 parts:\n","\n","1. First we join `icustays` and `patients` tables to produce length of ICU\n"," stays in days for each patient, which is saved in a temporary table `ps`;\n","2. Next we put patients into buckets based on their ages at the time they got\n"," admitted into ICU in `bu` table;\n","3. The result data is filtered to include only the information required, i.e.\n"," `age_bucket` and `num_icu_stays`, to plot the chart."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"QenZBv-rxYEm"},"source":["**Note**: If you are having a hard time following the queries in this colab, or\n","you want to know more about the table structures of MIMIC-III dataset, please\n","consult\n","[our colab for a previous Datathon held in Sydney](../../anzics18/tutorial.ipynb)."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"3TVRWd6JAYHS"},"source":["Now let's see if there is correlation between age and average length of stay in\n","hours. Since we are using the age of patients when they get admitted, so we\n","don't need to worry about multiple admissions of patients. Note that we treat\n","the redacted ages (> 90) as noises and filter them out."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"Me7i5Z5pAZ4s","colab":{}},"source":["df = run_query(\"\"\"\n","WITH re AS (\n","SELECT\n"," DATETIME_DIFF(icu.outtime, icu.intime, HOUR) AS icu_length_of_stay,\n"," DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age\n","FROM `physionet-data.mimiciii_demo.icustays` AS icu\n","INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat\n"," ON icu.subject_id = pat.subject_id)\n","SELECT\n"," icu_length_of_stay AS stay,\n"," age\n","FROM re\n","WHERE age < 100\n","\"\"\")\n","\n","df.plot(kind='scatter', x='age', y='stay')"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"W3l1HyDeBVvW"},"source":["Let's take a look at another query which uses a filter that we often use, which\n","is the current service that ICU patients are undergoing."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"iIw3ykjHOY-Y","colab":{}},"source":["df = run_query(\"\"\"\n","WITH co AS (\n"," SELECT\n"," icu.subject_id,\n"," icu.hadm_id,\n"," icu.icustay_id,\n"," pat.dob,\n"," DATETIME_DIFF(icu.outtime, icu.intime, DAY) AS icu_length_of_stay,\n"," DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age,\n"," RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order\n"," FROM `physionet-data.mimiciii_demo.icustays` AS icu\n"," INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat\n"," ON icu.subject_id = pat.subject_id\n"," ORDER BY hadm_id DESC),\n","serv AS (\n"," SELECT\n"," icu.hadm_id,\n"," icu.icustay_id,\n"," se.curr_service,\n"," IF(curr_service like '%SURG' OR curr_service = 'ORTHO', 1, 0) AS surgical,\n"," RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank\n"," FROM `physionet-data.mimiciii_demo.icustays` AS icu\n"," LEFT JOIN `physionet-data.mimiciii_demo.services` AS se\n"," ON icu.hadm_id = se.hadm_id\n"," AND se.transfertime < DATETIME_ADD(icu.intime, INTERVAL 12 HOUR)\n"," ORDER BY icustay_id)\n","SELECT\n"," co.subject_id,\n"," co.hadm_id,\n"," co.icustay_id,\n"," co.icu_length_of_stay,\n"," co.age,\n"," IF(co.icu_length_of_stay < 2, 1, 0) AS short_stay,\n"," IF(co.icustay_id_order = 1, 0, 1) AS first_stay,\n"," IF(serv.surgical = 1, 1, 0) AS surgical\n","FROM co\n","LEFT JOIN serv USING (icustay_id, hadm_id)\n","WHERE\n"," serv.rank = 1 AND age < 100\n","ORDER BY subject_id, icustay_id_order\n","\"\"\")\n","\n","print(f'Number of rows in dataframe: {len(df)}')\n","df.head()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"d4nxxe69VDqH"},"source":["This is a long query, but is pretty simple if we take a closer look. It consists\n","of 3 steps as well:\n","\n","1. We are trying to know how many ICU admissions each patient has by joining\n"," `icustays` and `patients`. Note that since each patient may be admitted\n"," multiple times, we usually filter out follow-up ICU stays, and only keep the\n"," first one to minimize unwanted data correlation. This is achieved by\n"," partitioning over `subject_id`, and ordering by admission time, then choose\n"," only the first one with `RANK` function, the result is saved to a temporary\n"," table `co`;\n","2. Next we are looking for first services in ICU stays for patients, and also\n"," adding a label to indicate whether last services before ICU admission were\n"," surgical, similarly the result is saved to `serv`;\n","3. Lastly, we are ready to save this surgical exclusion label to a cohort\n"," generation table by joining the two tables, `co` and `serv`. For the\n"," convenience of later analysis, we rename some columns, and filter out\n"," patients more than 100 years old."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"Fi3isyuRSgmg"},"source":["## Useful Tips"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"ycbTUnFEY_3H"},"source":["### Working with DATETIME\n","\n","The times in the tables are stored as DATETIME objects. This means you cannot\n","use operators like `<`, `=`, or `>` for comparing them.\n","\n","* Use the\n"," [DATETIME functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/datetime_functions)\n"," in BigQuery. An example would be if you were trying to find things within 1\n"," hour of another event. In that case, you could use the native\n"," `DATETIME_SUB()` function. In the example below, we are looking for stays of\n"," less than 1 hour (where the admit time is less than 1 hour away from the\n"," discharge time).\n","\n","> ```\n","> [...] WHERE ADMITTIME BETWEEN DATETIME_SUB(DISCHTIME, INTERVAL 1 HOUR) AND DISCHTIME\n","> ```\n","\n","* If you are more comfortable working with timestamps, you can cast the\n"," DATETIME object to a TIMESTAMP object and then use the\n"," [TIMESTAMP functions](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions)."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"OTYmwNrmZEN2"},"source":["### Input / Output Options\n","\n","There are a few cases where you may want to work with files outside of BigQuery.\n","Examples include importing your own custom Python library or saving a dataframe.\n","[This tutorial](https://colab.research.google.com/notebooks/io.ipynb) covers\n","importing and exporting from local filesystem, Google Drive, Google Sheets, and\n","Google Cloud Storage."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"t7IRBm4EBau8"},"source":["## ML Model Training\n","\n","Next we will show an example of using [Tensorflow](https://www.tensorflow.org/)\n","([getting started doc](https://www.tensorflow.org/get_started/)) to build a\n","simple predictor, where we use the patient's age and whether it is the first ICU\n","stay to predict whether the ICU stay will be a short one. With only 127 data\n","points in total, we don't expect to actually build an accurate or useful\n","predictor, but it should serve the purpose of showing how a model can be trained\n","and used using Tensorflow within Colab.\n","\n","First, let us split the 127 data points into a training set with 100 records and\n","a testing set with 27, and examine the distribution of the split sets to make\n","sure that the distribution is similar."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"QPP17fQL3LDa","colab":{}},"source":["data = df[['age', 'first_stay', 'short_stay']]\n","data.reindex(np.random.permutation(data.index))\n","training_df = data.head(100)\n","validation_df = data.tail(27)\n","\n","print('Training data summary:')\n","display(training_df.describe())\n","\n","print('Validation data summary:')\n","display(validation_df.describe())"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"v6uU-mRh3PAS"},"source":["And let's quickly check the label distribution for the features."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"KkfdyF7K3Q3z","colab":{}},"source":["display(training_df.groupby(['short_stay', 'first_stay']).count())\n","\n","fig, ax = plt.subplots()\n","shorts = training_df[training_df.short_stay == 1].age\n","longs = training_df[training_df.short_stay == 0].age\n","colors = ['b', 'g']\n","ax.hist([shorts, longs],\n"," bins=10,\n"," color=colors,\n"," label=['short_stay=1', 'short_stay=0'])\n","ax.set_xlabel('Age')\n","ax.set_ylabel('Number of Patients')\n","plt.legend(loc='upper left')\n","plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"MrpVGVYx3S7_"},"source":["Let's first build a linear regression model to predict the numeric value of\n","\"short_stay\" based on age and first_stay features. You can tune the parameters\n","on the right-hand side and observe differences in the evaluation result."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"R-7VB9pc3Vll","colab":{}},"source":["#@title Linear Regression Parameters {display-mode:\"both\"}\n","BATCH_SIZE = 5 # @param\n","NUM_EPOCHS = 100 # @param\n","\n","first_stay = tf.feature_column.numeric_column('first_stay')\n","age = tf.feature_column.numeric_column('age')\n","\n","# Build linear regressor\n","linear_regressor = tf.estimator.LinearRegressor(\n"," feature_columns=[first_stay, age])\n","\n","# Train the Model.\n","model = linear_regressor.train(\n"," input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(\n"," x=training_df,\n"," y=training_df['short_stay'],\n"," num_epochs=100,\n"," batch_size=BATCH_SIZE,\n"," shuffle=True),\n"," steps=100)\n","\n","# Evaluate the model.\n","eval_result = linear_regressor.evaluate(\n"," input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(\n"," x=validation_df,\n"," y=validation_df['short_stay'],\n"," batch_size=BATCH_SIZE,\n"," shuffle=False))\n","\n","display(eval_result)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"8SsUXasz3YUS"},"source":["Remember that the label `short_stay` is actually categorical, with the value 1\n","for an ICU stay of 1 day or less and value 0 for stays of length 2 days or more.\n","So a classification model better fits this task. Here we try a deep neural\n","networks model using the `DNNClassifier` estimator. Notice the little changes\n","from the regression code above."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"Ie7BzB_f3aFk","colab":{}},"source":["#@title ML Training example {display-mode:\"both\"}\n","BATCH_SIZE = 5 # @param\n","NUM_EPOCHS = 100 # @param\n","HIDDEN_UNITS = [10, 10] # @param\n","\n","# Build linear regressor\n","classifier = tf.estimator.DNNClassifier(\n"," feature_columns=[first_stay, age], hidden_units=HIDDEN_UNITS)\n","\n","# Train the Model.\n","model = classifier.train(\n"," input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(\n"," x=training_df,\n"," y=training_df['short_stay'],\n"," num_epochs=100,\n"," batch_size=BATCH_SIZE,\n"," shuffle=True),\n"," steps=100)\n","\n","# Evaluate the model.\n","eval_result = classifier.evaluate(\n"," input_fn=tf.compat.v1.estimator.inputs.pandas_input_fn(\n"," x=validation_df,\n"," y=validation_df['short_stay'],\n"," batch_size=BATCH_SIZE,\n"," shuffle=False))\n","\n","display(eval_result)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"dtoC3w63BcIV"},"source":["## Closing\n","\n","Congratulations! Now you have finished this datathon tutorial, and ready to\n","explore the real data by querying Google BigQuery. To do so, simply use\n","`mimiciii_clinical` as the dataset name. For example, the table\n","`mimiciii_demo.icustays` becomes `mimiciii_clinical.icustays` when you need the\n","actual MIMIC data. One thing to note though, is that it is highly recommended to\n","aggregate data aggressively wherever possible, because large dataframes may\n","cause the performance of colab to drop drastically or even out of memory errors.\n","\n","Now, let's do the substitution, and start the real datathon exploration."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"kTRDaXL1a2TZ"},"source":["## Troubleshooting\n","\n","Below are some tips for troubleshooting more frequently seen issues"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"jVIVlrKvbGPI"},"source":["### Common Errors\n","\n","* **Error after authenticating while trying to run a query**\n","\n","```\n","ERROR:root:An unexpected error occurred while tokenizing input\n","The following traceback may be corrupted or invalid\n","The error message is: ('EOF in multi-line string', (1, 0))\n","```\n","\n","> If you try to run a query and see this error message, scroll to the bottom of\n","> the error text. The very last row of the error will show the specific error\n","> message, which is usually related to having the wrong project_id or not having\n","> access to the project/dataset.\n","\n","* **Colab has stopped working, is running slowly, or the top right no longer\n"," has a green check mark saying \"Connected\", but shows 3 dots and says\n"," \"Busy\"**\n","\n","> Reset the runtime, to reinitialize. Note that this will clear any local\n","> variables or uploaded files. Do this by clicking the `Runtime` menu at the\n","> top, then `Reset all runtimes`"]}]} -------------------------------------------------------------------------------- /tutorials/mimic-cxr/mimic-cxr-train.ipynb: -------------------------------------------------------------------------------- 1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"accelerator":"GPU","colab":{"name":"mimic-cxr-train-aarhus.ipynb","version":"0.3.2","provenance":[{"file_id":"https://github.com/GoogleCloudPlatform/healthcare/blob/master/datathon/mimic_cxr/mimic_cxr_train.ipynb","timestamp":1565763637468},{"file_id":"1yzMXIgWTQoGnY2OB-Gc-9slwsTXl0OS3","timestamp":1563517370612},{"file_id":"/piper/depot/google3/third_party/cloud/healthcare/datathon/mimic_cxr/mimic_cxr_train.ipynb?workspaceId=reidhayes:sing_datathon::citc","timestamp":1563515495313},{"file_id":"1dFno1Tp3-18P85Tr8TBoeCzr-Ya_jKjv","timestamp":1563513898474},{"file_id":"https://github.com/GoogleCloudPlatform/healthcare/blob/master/datathon/mimic_cxr/mimic_cxr_train.ipynb","timestamp":1563511677200},{"file_id":"1wJF3IQEczrGvX_ysdcSNYphQXUNN1UvN","timestamp":1558983612786},{"file_id":"/piper/depot/google3/third_party/cloud/healthcare/datathon/mimic_cxr/mimic_cxr_train.ipynb?workspaceId=reidhayes:imaging-datathon-int::citc","timestamp":1558727646133},{"file_id":"1gN9P9owTsXPMLlcP3lonYLGI4ERQqlk1","timestamp":1557769834852}],"collapsed_sections":[],"last_runtime":{"build_target":"//learning/brain/python/client:tpu_hw_notebook","kind":"private"}},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"cell_type":"markdown","metadata":{"colab_type":"text","id":"e4TNC-3LszlF"},"source":["# Training a Convolutional Neural Network to Classify Chest X-rays"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"JsNl7d9Cy6X4"},"source":["## Introduction"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"T5_Riz79y_ZO"},"source":["This notebook shows how to train a state of the art Convolutional Neural Network\n","(CNN) to classify chest X-rays images from the MIMIC CXR Dataset. Its approach\n","is influenced by [CheXpert: A Large Chest Radiograph Dataset with Uncertainty\n","Labels and Expert Comparison](https://arxiv.org/abs/1901.07031).\n","\n","You can run this notebook from [Colab](https://colab.research.google.com/) or\n","[Cloud AI Platform Notebook](https://cloud.google.com/ai-platform-notebooks/).\n","If you're serious about training your own models, you'll definitely want to use\n","a Cloud AI Platform notebook with one or more TPUs or GPUs. If you're just\n","interested in learning how to train a CNN, you can run this notebook in Colab.\n","Your Colab session will probably timeout before it can finish training the model\n","(Cloud AI Platform notebooks are more powerful and never timeout). In any case,\n","don't worry, several pretrained models are available along with the data."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"hwfrAocTs92n","colab":{}},"source":["from __future__ import division\n","from __future__ import print_function\n","\n","import datetime\n","import os\n","import tensorflow as tf\n","import multiprocessing\n","from enum import Enum\n","from google.cloud import bigquery\n","import seaborn as sns\n","import matplotlib.pyplot as plt\n","import sklearn.metrics\n","import numpy as np\n","import subprocess\n","import re\n","\n","try:\n"," from google.colab import auth\n"," IN_COLAB = True\n"," auth.authenticate_user()\n","except ImportError:\n"," IN_COLAB = False\n","\n","account = subprocess.check_output(\n"," ['gcloud', 'config', 'list', 'account', '--format',\n"," 'value(core.account)']).decode().strip()\n","MY_DIRECTORY = re.sub(r'[^\\w]', '_', account)[:128]\n","\n","%config InlineBackend.figure_format = 'svg'"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"WJ7qFWUisQJD"},"source":["## Understanding the dataset"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"L2RbPSlZ-bqi"},"source":["First, we need to specify where the training and validation datasets are\n","located. Labelled images are provided in\n","[TFRecord](https://www.tensorflow.org/guide/datasets#consuming_tfrecord_data)\n","format. TFRecords are a great choice for performant and convenient training. You\n","also have access to BigQuery tables that contain the labels for each image,\n","which we'll use to get a broad understanding of how the labels are distributed\n","before we dive into training our model.\n","\n","There are separate TFRecords for X-rays taken from frontal or lateral views. You\n","can choose either type of dataset, but make sure the validation and training\n","dataset correspond to the same view. There are pretrained models available for\n","both views."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"s6i-MM779FbU","colab":{}},"source":["#@title Input Datasets {run: \"auto\"}\n","GCP_ANALYSIS_PROJECT = 'aarhus-critical-2019-team' #@param {type: \"string\"}\n","TRAIN_TFRECORDS = 'gs://mimic_cxr_derived/tfrecords/train/frontal*' #@param {type: \"string\"}\n","VALID_TFRECORDS = 'gs://mimic_cxr_derived/tfrecords/valid/frontal*' #@param {type: \"string\"}\n","# VIEW should be one of 'frontal', or 'lateral'\n","VIEW = 'frontal' #@param [\"frontal\", \"lateral\"] {type: \"string\"}\n","TRAIN_BIGQUERY = 'physionet-data.mimic_cxr_derived.labels_train' #@param {type: \"string\"}\n","VALID_BIGQUERY = 'physionet-data.mimic_cxr_derived.labels_valid' #@param {type: \"string\"}"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"vaN3NKPk-3s8"},"source":["The dataset consists of labelled images. These labels were determined by\n","analyzing radiologist notes. A label is given the value of\n","\n","* `0` (`not_mentioned`) if the note made no mention of it\n","* `1` (`negative`) if the note said the label wasn't present in the image\n","* `2` (`uncertain`) if the note expressed uncertainty about the label's\n"," presence\n","* `3` (`positive`) if the label was mentioned with certainty in the note\n","\n","for more details about how these labels were generated, you can check out\n","[this paper](https://arxiv.org/abs/1901.07031). For our classifier we'll treat\n","`not_mentioned` (`0`) and `negative` (`1`) as the same thing. There's some\n","choice in how we handle `uncertain` (`2`). We will investigate the uncertain labels in the next section."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"4R8Amh9EyNvG","colab":{}},"source":["class Labels(Enum):\n"," no_finding = 0\n"," enlarged_cardiomediastinum = 1\n"," cardiomegaly = 2\n"," airspace_opacity = 3\n"," lung_lesion = 4\n"," edema = 5\n"," consolidation = 6\n"," pneumonia = 7\n"," atelectasis = 8\n"," pneumothorax = 9\n"," pleural_effusion = 10\n"," pleural_other = 11\n"," fracture = 12\n"," support_devices = 13\n","\n","\n","class LabelValues(Enum):\n"," not_mentioned = 0\n"," negative = 1\n"," uncertain = 2\n"," positive = 3\n","\n","\n","class Views(Enum):\n"," frontal = 0\n"," lateral = 1\n"," other = 2\n","\n","\n","class Datasets(Enum):\n"," train = 0\n"," valid = 1"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"csxgv426CP6B"},"source":["Before we start building our model, let's check out the distribution of the\n","data. We'll do this by writing a BigQuery StandardSQL statement that counts the\n","number of `not_mentioned`, `negative`, `uncertain` and `positive` values for\n","each label."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"uYoxsUQvEV6_","colab":{}},"source":["bq_client = bigquery.Client(project=GCP_ANALYSIS_PROJECT)\n","\n","queries = []\n","for label in Labels:\n"," queries.append(\"\"\"\n"," SELECT\n"," \"{label}\" AS label,\n"," {label} AS label_value,\n"," COUNT(DISTINCT path) AS cnt,\n"," dataset\n"," FROM\n"," (SELECT * FROM `{TRAIN_BIGQUERY}`\n"," UNION ALL\n"," SELECT * FROM `{VALID_BIGQUERY}`)\n"," WHERE view = {view_value}\n"," GROUP BY {label}, dataset\n"," \"\"\".format(\n"," TRAIN_BIGQUERY=TRAIN_BIGQUERY,\n"," VALID_BIGQUERY=VALID_BIGQUERY,\n"," label=label.name,\n"," view_value=Views[VIEW].value))\n","\n","barplot_df = bq_client.query('UNION ALL'.join(queries)).to_dataframe()\n","# Convert integer label values into strings\n","barplot_df.label_value = barplot_df.label_value.apply(\n"," lambda v: LabelValues(v).name)\n","barplot_df.dataset = barplot_df.dataset.apply(lambda v: Datasets(v).name)\n","print('Query succeeded!')"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"lqxqYe1Z27wR"},"source":["Our StandardSQL statement returns a pandas Dataframe in 'long' format, which\n","makes it really easy to visualize with [seaborn](https://seaborn.pydata.org/)\n","(or [ggplot2](https://ggplot2.tidyverse.org/) for R users)."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"9qNznWt9Dzbg","colab":{}},"source":["sns.catplot(\n"," y='label',\n"," x='cnt',\n"," hue='label_value',\n"," data=barplot_df,\n"," col='dataset',\n"," kind='bar',\n"," sharex=False)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"OmW0g8kI7JHR","colab":{}},"source":["N_TRAIN = barplot_df.cnt[(barplot_df.dataset == 'train')\n"," & (barplot_df.label == Labels(0).name)].sum()\n","N_VALID = barplot_df.cnt[(barplot_df.dataset == 'valid')\n"," & (barplot_df.label == Labels(0).name)].sum()\n","print('training examples: {:,}\\nvalidation examples: {:,}'.format(\n"," N_TRAIN, N_VALID))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"vzyafcOWE7uF"},"source":["What have we learned from this?\n","\n","* we have a medium sized dataset for training a CNN. For comparison\n"," [ImageNet](http://www.image-net.org/) has 14 million images, and\n"," [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) has 60,000 images.\n","* the distribution of the values varies for each label, the only constant is\n"," that the majority of values are `not_mentioned`.\n","\n","This informs our expectations for any model's performance with each label. It\n","also suggests that we find a way to take advantage of the `uncertain` labels,\n","since in some cases they actually outnumber the `positive` labels."]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"F1-b9NXMyykj"},"source":["## Creating an input pipeline"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"UeKhixlozB9w"},"source":["### Performance considerations\n","\n","One of the most important factors in determining how long it takes to train a\n","model is the process that loads data into the model. TPUs and GPUs are so fast\n","that keeping them busy with the next batch of data isn't easy.\n","\n","A few tips for fast input pipelines:\n","\n","1. Use TFRecords: they allow for data to be read in contiguous blocks, which is\n"," much faster than reading a bunch of small files.\n","1. Train your models and store your data in the cloud so you can take advantage\n"," of Google's fast internal networks.\n","1. Perform expensive transformations, including resizing large images, ahead of\n"," time. We've already done this for you using\n"," [Cloud Dataflow](https://cloud.google.com/dataflow/).\n","1. Use the largest batch size that will fit in your device's memory.\n","\n","You can find more tips\n","[here](https://www.tensorflow.org/alpha/guide/data_performance). And if you're\n","using a cloud TPU, you can also use the\n","[`cloud_tpu_profiler`](https://cloud.google.com/tpu/docs/cloud-tpu-tools#profile_tab),\n","which is an incredibly helpful tool for improving your model's performance.\n","\n","### Dealing with uncertain labels (advanced)\n","\n","Our second issue is how to assign values to the uncertain labels. This is a\n","little technical, so feel free to skim through this section.\n","\n","Almost all multi-label neural networks use a loss function like this:\n","\n","$$\n","L = - \\sum_{n,i} l\\left(y_{ni}, t_{ni}\\right)\n","$$\n","\n","Where $t_{ni}$ is the true value of the $i^\\text{th}$ label of the $n^\\text{th}$\n","sample and $y_{ni}$ is the corresponding prediction from our model.\n","\n","One way to incorporate uncertain labels is to assign them a value $u \\in [0, 1]$\n","and a weight $w \\in [0, 1]$ to that our loss function becomes\n","\n","$$\n","L = - \\sum_{n,i} \\begin{cases}\n","l\\left(y_{ni}, t_{ni}\\right) & (n, i)^{\\text{th}} \\text{ label is certain} \\\\\n","w \\cdot l\\left(y_{ni}, u\\right) & (n, i)^{\\text{th}} \\text{ label is uncertain}\n","\\end{cases}\n","$$\n","\n","$w = 0$ corresponds to ignoring the uncertain labels. $w = 1, u = 0$ corresponds\n","to counting all uncertain labels as negative. $w = 1, u = 1$ counts the\n","uncertain labels as positive. Something like $w = 0.5, u = 0.25$ is a hybrid of\n","the others. You can think of $w$ as playing the role of $\\sigma^2$ if $l$ was\n","the log-likelihood of a normal distribution.\n","\n","Other approaches for incorporating uncertain labels into a model are discussed\n","[here](https://arxiv.org/abs/1901.07031). You could also experiment with using\n","different values of $u$ and $w$ for each label. You could even try to optimize\n","these hyperparameters. The per label values of $u$ for example could be learned\n","with gradient descent, while $w$ could be updated every epoch to optimize the\n","loss on the `certain` labels.\n","\n","In summary:\n","\n","* `U_VALUE` ($u$) is the probability of being `positive` that you assign to\n"," uncertain labels\n","* `W_VALUE` ($w$) is the weight that you assign to uncertain labels during\n"," training."]},{"cell_type":"code","metadata":{"cellView":"both","colab_type":"code","id":"udjTuUIy_wUy","colab":{}},"source":["#@title Input pipeline parameters {run: \"auto\"}\n","BATCH_SIZE = 32 #@param {type: \"integer\"}\n","NUM_EPOCHS = 3 #@param {type: \"integer\"}\n","U_VALUE = 0.4 #@param {type:\"slider\", min:0, max:1, step:0.01}\n","W_VALUE = 0.75 #@param {type:\"slider\", min:0, max:1, step:0.01}"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"ctKPP8DXAHfS","colab":{}},"source":["# label -> probability table: 0 -> 0, 1 -> 0, 2 -> u, 3 -> 1\n","probabs_lookup = tf.constant([0.0, 0.0, U_VALUE, 1.0])\n","# label -> weight table: 0 -> 1, 1 -> 1, 2 -> w, 3 -> 1\n","weights_lookup = tf.constant([1.0, 1.0, W_VALUE, 1.0])"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"q_4UP-Wwyx3t","colab":{}},"source":["feature_description = {'jpg_bytes': tf.io.FixedLenFeature([], tf.string)}\n","for l in Labels:\n"," feature_description[l.name] = tf.io.FixedLenFeature([], tf.int64)\n","\n","# The height, width, and number of channels of the input images\n","INPUT_HWC = (320, 320, 1)\n","\n","\n","def parse_function(example):\n"," \"\"\"Convert a TFExample from a TFRecord into an input and its true label.\n","\n"," Args:\n"," example (tf.train.Example): A training example read from a TFRecord.\n","\n"," Returns:\n"," Tuple[tf.Tensor, tf.Tensor]: The X-ray image and its labels. The labels\n"," are represented as two stacked arrays. One array is the probability\n"," that this label exists in the image, the other is how much weight this\n"," label should have when training the model.\n"," \"\"\"\n"," parsed = tf.io.parse_single_example(example, feature_description)\n"," # Turn the JPEG data into a matrix of pixel intensities\n"," image = tf.io.decode_jpeg(parsed['jpg_bytes'], channels=1)\n"," # Give the image a definite size, which is needed by TPUs\n"," image = tf.reshape(image, INPUT_HWC)\n"," # Normalize the pixel values to be between 0 and 1\n"," scaled_image = (1.0 / 255.0) * tf.cast(image, tf.float32)\n"," # Combine the labels into an array\n"," labels = tf.stack([parsed[l.name] for l in Labels], axis=0)\n"," # Convert the labels into probabilities and weights using lookup tables.\n"," probs = tf.gather(probabs_lookup, labels)\n"," weights = tf.gather(weights_lookup, labels)\n"," # Return the input to the model and the true labels\n"," return scaled_image, tf.stack([probs, weights], axis=0)\n","\n","\n","def get_dataset(valid=False):\n"," \"\"\"Construct a pipeline for loading the data.\n","\n"," Args:\n"," valid (bool): If this is True, use the validation dataset instead of the\n"," training dataset.\n","\n"," Returns:\n"," tf.data.Dataset: A dataset loading pipeline ready for training.\n"," \"\"\"\n"," n_cpu = multiprocessing.cpu_count()\n"," tf_records = VALID_TFRECORDS if valid else TRAIN_TFRECORDS\n"," dataset = tf.data.TFRecordDataset(\n"," tf.io.gfile.glob(tf_records),\n"," buffer_size=16 * 1024 * 1024,\n"," num_parallel_reads=n_cpu)\n"," if not valid:\n"," dataset = dataset.shuffle(256)\n"," dataset = dataset.repeat()\n"," dataset = dataset.map(parse_function, num_parallel_calls=n_cpu)\n"," dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)\n"," dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)\n"," return dataset"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"-iFsTYJNT7mA"},"source":["## Using accelerators (TPUs and GPUs)"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"rG3LEZ_Jmdcj"},"source":["One of the most exciting things about training neural networks on the cloud is\n","the ability to scale your compute power. Tensorflow makes it easy to scale from\n","a single GPU to a multi-GPU system, to a network of distributed GPU systems.\n","TPUs are even easier. Since all TPUs are distributed systems (with 8 cores and 4\n","chips per board), you can scale from a single TPU to a TPU pod without changing\n","any of your code."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"798l9Ht1UAOd","colab":{}},"source":["#@title Accelerators {run: \"auto\"}\n","ACCELERATOR_TYPE = 'Single-GPU' #@param [\"Single/Multi-TPU\", \"Single-GPU\", \"Multi-GPU\", \"CPU\"] {type: \"string\"}"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"colab_type":"code","id":"iKFe1cWM7FB8","colab":{}},"source":["if ACCELERATOR_TYPE == 'Single/Multi-TPU':\n"," if IN_COLAB:\n"," tpu_name = 'grpc://' + os.environ['COLAB_TPU_ADDR']\n"," else:\n"," tpu_name = os.environ['TPU_NAME']\n"," resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu=tpu_name)\n"," tf.contrib.distribute.initialize_tpu_system(resolver)\n"," strategy = tf.contrib.distribute.TPUStrategy(resolver, steps_per_run=100)\n","elif ACCELERATOR_TYPE == 'Multi-GPU':\n"," strategy = tf.distribute.MirroredStrategy()\n","else:\n"," strategy = tf.distribute.get_strategy() # Default strategy"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"8udz2vQ_QhWK"},"source":["## Defining the model"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"9etia5dIQr2d"},"source":["The [keras applications module](https://keras.io/applications/) offers several\n","different CNN architectures for us to choose from, all of which are very good.\n","\n","All we have to do is add the last layer which produces a value for each of our\n","labels. Notice that the activation function for this layer is `'linear'`. That's\n","because our loss function applies its own `sigmoid` nonlinearity to `y_pred`.\n","\n","We define a custom loss function, that unpacks the true probabilities and\n","assigned weights from our input pipeline before using these to compute a\n","weighted cross-entropy loss."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"i0YyPiGB9vxj","colab":{}},"source":["with strategy.scope():\n"," base_model = tf.keras.applications.densenet.DenseNet121(\n"," include_top=False, weights=None, input_shape=INPUT_HWC, pooling='max')\n","\n"," predictions = tf.keras.layers.Dense(\n"," len(Labels), activation='linear')(\n"," base_model.output)\n","\n"," model = tf.keras.Model(inputs=base_model.input, outputs=predictions)\n","\n"," def weighted_binary_crossentropy(prob_weight_y_true, y_pred):\n"," \"\"\"Binary cross-entropy loss function with per-sample weights.\"\"\"\n"," prob_weight_y_true = tf.reshape(prob_weight_y_true, (-1, 2, len(Labels)))\n"," # Unpack the second output of our data pipeline into true probabilities and\n"," # weights for each label.\n"," probs = prob_weight_y_true[:, 0]\n"," weights = prob_weight_y_true[:, 1]\n"," return tf.compat.v1.losses.sigmoid_cross_entropy(\n"," probs,\n"," y_pred,\n"," weights,\n"," reduction=tf.compat.v1.losses.Reduction.SUM_OVER_BATCH_SIZE)\n","\n"," model.compile(\n"," optimizer=tf.train.AdamOptimizer(),\n"," loss=weighted_binary_crossentropy,\n"," )"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"4zuXzbe4tGGv"},"source":["## Training the model"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"QBagIQwEAs8z"},"source":["You can keep track of your loss function and weights during training by saving\n","the training log files to Google Cloud Storage and running a\n","[TensorBoard](https://www.tensorflow.org/guide/summaries_and_tensorboard)\n","session in a Google Cloud Shell or on your local machine."]},{"cell_type":"code","metadata":{"cellView":"both","colab_type":"code","id":"eCu8LYTxA0tJ","colab":{}},"source":["#@title GCS Tensorboard log directory {run: \"auto\"}\n","GCS_LOGS = 'gs://aarhus-critical-2019-team-shared-files/train_log' #@param {\"type\": \"string\"}\n","if not GCS_LOGS.endswith('/'):\n"," GCS_LOGS += '/'\n","GCS_LOGS += MY_DIRECTORY + '/'"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"EplL6hJcmoMV"},"source":["Training your model will take hours. You don't have time to waste, so feel free\n","to stop training at any time. You can do this by interrupting the execution of\n","the below cell, either by pressing `⌘/Ctrl+m i` on your keyboard or by clicking\n","the stop sign in the upper left corner with your mouse.\n","\n","We've already trained and generated predictions for several models that you can\n","use throughout the datathon. We'll show you how to use them in the next\n","sections."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"ArdQITx5BAVm","colab":{}},"source":["now_str = datetime.datetime.now().strftime('%Y%m%d-%H%M%S')\n","callbacks = []\n","if GCS_LOGS:\n"," LOGDIR = GCS_LOGS + ('' if GCS_LOGS.endswith('/') else '/') + now_str\n"," callbacks.append(tf.keras.callbacks.TensorBoard(LOGDIR, update_freq=500))\n"," print('Run `tensorboard --logdir {}` in cloud shell or your local machine.'\n"," .format(GCS_LOGS))\n","\n","model.fit(\n"," get_dataset(),\n"," epochs=NUM_EPOCHS,\n"," steps_per_epoch=N_TRAIN // BATCH_SIZE,\n"," validation_data=get_dataset(valid=True),\n"," validation_steps=N_VALID // BATCH_SIZE,\n"," callbacks=callbacks)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"O3TQzOPGtQwV"},"source":["## Saving the model"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"OE6nSWMsC6fs"},"source":["Let's save our model and copy it to Google Cloud Storage.\n","\n","First we'll give our model a descriptive name and decide where to put it."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"YuAurxVe_nqW","colab":{}},"source":["#@title Saved model files {run: \"auto\"}\n","MODEL_NAME = 'my_model.h5' #@param {type: \"string\"}\n","GCS_SAVED_MODEL_DIR = 'gs://aarhus-critical-2019-team-shared-files/mimic-cxr-models/' #@param {type: \"string\"}\n","GCS_MODEL_PATH = GCS_SAVED_MODEL_DIR\n","if not GCS_MODEL_PATH.endswith('/'):\n"," GCS_MODEL_PATH += '/'\n","GCS_MODEL_PATH += MY_DIRECTORY + '/'\n","GCS_MODEL_PATH += MODEL_NAME"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"AdPjQPF5uh26"},"source":["Since `tf.data` and `tf.io.gfile` treat files stored in Google Cloud Storage the\n","same as if they were on your local hard drive, uploading and downloading to GCS\n","is the same as reading and writing from your disk in python."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"Ei4Pa16psimM","colab":{}},"source":["def upload_to_gcs(local_path, gcs_path):\n"," with tf.io.gfile.GFile(gcs_path, 'wb') as gcs_file:\n"," with tf.io.gfile.GFile(local_path, 'rb') as local_file:\n"," gcs_file.write(local_file.read())\n","\n","\n","def download_from_gcs(gcs_path, local_path):\n"," with tf.io.gfile.GFile(gcs_path, 'rb') as gcs_file:\n"," with tf.io.gfile.GFile(local_path, 'wb') as local_file:\n"," local_file.write(gcs_file.read())"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"OAUDT4QvpaK7"},"source":["`model.save` save the entire model, including its architecture, learned weights,\n","optimizer, and loss function."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"teO8ts3jAohN","colab":{}},"source":["model.save(MODEL_NAME)\n","upload_to_gcs(MODEL_NAME, GCS_MODEL_PATH)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"EsCwiZLTzDfN"},"source":["## Making and evaluating predictions"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"PI81sHUDgy7A"},"source":["We've used this notebook to pretrain eight different models for you\n","to experiment with.\n","\n","We trained these models on two different image views and four different\n","imputation strategies.\n","\n","* View:\n"," * frontal\n"," * lateral\n","* Strategy for uncertain labels, $u$, $w$:\n"," * Ignore uncertain labels, $u = 0$, $w = 0$\n"," * Assign uncertain labels to negative, $u = 0$, $w = 1$\n"," * Assign uncertain labels to positive, $u = 1$, $w = 1$\n"," * Hybrid, $u = 0.5$, $w = 0.25$\n","\n","These pretrained keras models are hosted in Google Cloud Storage, and are\n","available for you to use throughout the datathon.\n","\n","The predictions given by each model are also available in BigQuery. We suggest\n","using these if your team is interested in combining multiple viewpoints or\n","building an ensemble model.\n","\n","First, choose the location of your pretrained keras model on GCS and its table\n","of predictions in BigQuery."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"Xe6-vX_3ob7a","colab":{}},"source":["PRETRAINED_KERAS_MODEL = 'gs://mimic_cxr_derived/models/densenet_frontal_u0_0__w0_0.h5' #@param {type: \"string\"}\n","VALID_PREDICTIONS_BIGQUERY = 'physionet-data.mimic_cxr_derived.densenet_frontal_u0_0__w0_0_predictions' #@param {type: \"string\"}"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"DzI75fUeA1yS"},"source":["Then we'll download and load the keras model."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"3hixJb9dG4_K","colab":{}},"source":["# Download the model from GCS\n","download_from_gcs(PRETRAINED_KERAS_MODEL, 'pretrained_model.h5')\n","\n","# Load the model, strategy.scope allows us to exploit multiple accelerators,\n","# just like we did during training.\n","with strategy.scope():\n"," model = tf.keras.models.load_model('pretrained_model.h5', compile=False)\n"," model.compile(\n"," optimizer=tf.train.AdamOptimizer(), loss=weighted_binary_crossentropy)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"SbRKVYi2A6RD"},"source":["We can make predictions with this Keras model by calling\n","[`.predict()`](https://www.tensorflow.org/api_docs/python/tf/keras/models/Model#predict).\n","`.predict()` accepts several different input data types including\n","`tf.data.Dataset`, `np.array`, and more."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"4p7629BQ6mPP","colab":{}},"source":["example_predictions = model.predict(get_dataset(valid=False), steps=3)\n","print('shape: {}, min: {}, max: {}'.format(example_predictions.shape,\n"," example_predictions.min(),\n"," example_predictions.max()))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"Oxk0fURkmt88"},"source":["To compare the model's accuracy against the true labels, let's write a SQL query\n","to join the predicted and true labels."]},{"cell_type":"code","metadata":{"colab_type":"code","id":"1_6FLeJNP1sR","colab":{}},"source":["pred_true_df = bq_client.query(\"\"\"SELECT * FROM\n","(SELECT {0} FROM `{1}` WHERE dataset = {4})\n","INNER JOIN\n","(SELECT {2} FROM `{3}` WHERE dataset = {4})\n","USING (path)\n","\"\"\".format(\n"," 'path, ' + ', '.join('{0} AS predicted_{0}'.format(l.name) for l in Labels),\n"," VALID_PREDICTIONS_BIGQUERY,\n"," 'path, ' + ', '.join('{0} AS true_{0}'.format(l.name) for l in Labels),\n"," VALID_BIGQUERY, Datasets.valid.value)).to_dataframe()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"P7kFknQdnGyg"},"source":["With this dataframe, we can plot the precision-recall and ROC curves for each\n","label."]},{"cell_type":"code","metadata":{"cellView":"both","colab_type":"code","id":"yCZ-s2eAkQ4p","colab":{}},"source":["#@title Evaluation {run: \"auto\"}\n","label = 'edema' #@param ['no_finding', 'enlarged_cardiomediastinum', 'cardiomegaly', 'airspace_opacity', 'lung_lesion', 'edema', 'consolidation', 'pneumonia', 'atelectasis', 'pneumothorax', 'pleural_effusion', 'pleural_other', 'fracture', 'support_devices'] {type:\"string\"}\n","\n","predicted = pred_true_df['predicted_{}'.format(label)]\n","true = pred_true_df['true_{}'.format(label)]\n","\n","# Ignore uncertain labels for calculating metrics\n","certain_mask = (true != LabelValues.uncertain.value)\n","true = true[certain_mask]\n","predicted = predicted[certain_mask]\n","# Use the same encodings as during training:\n","# not_mentioned, negative -> 0\n","# positive -> 1\n","true = np.array([float(LabelValues.positive == v) for v in LabelValues])[true]\n","\n","fig, axes = plt.subplots(1, 2)\n","\n","# Plot the precision-recall curve\n","precision, recall, thresholds = sklearn.metrics.precision_recall_curve(\n"," true, predicted)\n","average_precision = sklearn.metrics.average_precision_score(true, predicted)\n","\n","pr_axis = axes[0]\n","pr_axis.plot(recall, precision)\n","pr_axis.set_aspect('equal')\n","pr_axis.set_xlim(0, 1)\n","pr_axis.set_ylim(0, 1)\n","pr_axis.set_xlabel(r'Recall $\\left(\\frac{T_p}{T_p + F_n} \\right)$')\n","pr_axis.set_ylabel(r'Precision $\\left(\\frac{T_p}{T_p + F_p} \\right)$')\n","pr_axis.set_title('Precision-Recall, AP={:.2f}'.format(average_precision))\n","pr_axis.grid(True)\n","\n","# Plot the ROC curve\n","fpr, tpr, thresholds = sklearn.metrics.roc_curve(true, predicted)\n","roc_auc = sklearn.metrics.roc_auc_score(true, predicted)\n","\n","roc_axis = axes[1]\n","roc_axis.plot(fpr, tpr)\n","roc_axis.set_aspect('equal')\n","roc_axis.set_xlim(0, 1)\n","roc_axis.set_ylim(0, 1)\n","roc_axis.set_xlabel(\n"," r'False positive rate $\\left(\\frac{F_p}{F_p + T_n} \\right)$')\n","roc_axis.set_ylabel(r'True positive rate $\\left(\\frac{T_p}{T_p + F_n} \\right)$')\n","roc_axis.set_title('ROC, AUC={:.2f}'.format(roc_auc))\n","roc_axis.grid(True)\n","\n","plt.tight_layout()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"0dXMQQ1hjSQQ"},"source":["## Project Ideas"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"6BWSMKIRjYgy"},"source":["Ready to go off and explore MIMIC CXR on your own? Here are some ideas for\n","projects your team could work on.\n","\n","1. Multiple models are provided as pretrained models. How will your team choose\n"," the best one? Could you combine them into an ensemble method?\n","1. The pretrained models handle \"uncertain\" labels differently. How does this\n"," choice impact the model's predictions? Are the results what you would\n"," expect? Can we use these results to learn something about how each of the\n"," different labels were generated?\n","1. Most imaging studies have multiple frontal and lateral images. How can you\n"," combine the predictions from each image into a prediction for each study?\n"," Would a logistic-linear model work? What about using the maximum prediction\n"," in each category?\n","1. Investigate the images that are taken from a view other than frontal or\n"," lateral. What do these images look like? Could you design and/or train a\n"," model that uses these images?"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"15WI0XNhDRc7"},"source":["## Conclusion"]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"cJcDKdkejWrk"},"source":["That's it! You've trained a Convolutional Neural Network to classify chest X-ray\n","images.\n","\n","We've just scratched the surface of machine learning on GCP. The\n","[Cloud AI Platform](https://cloud.google.com/ai-platform/) has a full suite of\n","tools for research to production ML development including preprocessing\n","pipelines, Jupyter notebooks, distributed training, and model serving."]}]} -------------------------------------------------------------------------------- /tutorials/eicu/05-prediction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "05-prediction.ipynb", 7 | "version": "0.3.2", 8 | "provenance": [], 9 | "collapsed_sections": [] 10 | }, 11 | "kernelspec": { 12 | "display_name": "Python 3", 13 | "language": "python", 14 | "name": "python3" 15 | }, 16 | "language_info": { 17 | "codemirror_mode": { 18 | "name": "ipython", 19 | "version": 3 20 | }, 21 | "file_extension": ".py", 22 | "mimetype": "text/x-python", 23 | "name": "python", 24 | "nbconvert_exporter": "python", 25 | "pygments_lexer": "ipython3", 26 | "version": "3.7.3" 27 | } 28 | }, 29 | "cells": [ 30 | { 31 | "cell_type": "markdown", 32 | "metadata": { 33 | "colab_type": "text", 34 | "id": "T3wdKZCPklNq" 35 | }, 36 | "source": [ 37 | "# eICU Collaborative Research Database\n", 38 | "\n", 39 | "# Notebook 5: Prediction\n", 40 | "\n", 41 | "This notebook explores how a decision trees can be trained to predict in-hospital mortality of patients.\n" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": { 47 | "colab_type": "text", 48 | "id": "rG3HrM7GkwCH" 49 | }, 50 | "source": [ 51 | "## Load libraries and connect to the database" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "metadata": { 57 | "colab_type": "code", 58 | "id": "s-MoFA6NkkbZ", 59 | "colab": {} 60 | }, 61 | "source": [ 62 | "# Import libraries\n", 63 | "import os\n", 64 | "import numpy as np\n", 65 | "import pandas as pd\n", 66 | "import matplotlib.pyplot as plt\n", 67 | "\n", 68 | "# model building\n", 69 | "from sklearn import ensemble, impute, metrics, preprocessing, tree\n", 70 | "from sklearn.model_selection import cross_val_score, train_test_split\n", 71 | "from sklearn.pipeline import Pipeline\n", 72 | "\n", 73 | "# Make pandas dataframes prettier\n", 74 | "from IPython.display import display, HTML, Image\n", 75 | "plt.rcParams.update({'font.size': 20})\n", 76 | "%matplotlib inline\n", 77 | "plt.style.use('ggplot')\n", 78 | "\n", 79 | "# Access data using Google BigQuery.\n", 80 | "from google.colab import auth\n", 81 | "from google.cloud import bigquery" 82 | ], 83 | "execution_count": 0, 84 | "outputs": [] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "metadata": { 89 | "colab_type": "code", 90 | "id": "jyBV_Q9DkyD3", 91 | "colab": {} 92 | }, 93 | "source": [ 94 | "# authenticate\n", 95 | "auth.authenticate_user()" 96 | ], 97 | "execution_count": 0, 98 | "outputs": [] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "metadata": { 103 | "colab_type": "code", 104 | "id": "cF1udJKhkzYq", 105 | "colab": {} 106 | }, 107 | "source": [ 108 | "# Set up environment variables\n", 109 | "project_id='hst-953-2019'\n", 110 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id" 111 | ], 112 | "execution_count": 0, 113 | "outputs": [] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": { 118 | "colab_type": "text", 119 | "id": "xGurBAQIUDTt" 120 | }, 121 | "source": [ 122 | "To make our lives easier, we'll also install and import a set of helper functions from the `datathon2` package. We will be using the following functions from the package:\n", 123 | "- `plot_model_pred_2d`: to visualize our data, helping to display a class split assigned by a tree vs the true class.\n", 124 | "- `run_query()`: to run an SQL query against our BigQuery database and assign the results to a dataframe. \n" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "metadata": { 130 | "colab_type": "code", 131 | "id": "GDEewAlvk0oT", 132 | "colab": {} 133 | }, 134 | "source": [ 135 | "!pip install glowyr" 136 | ], 137 | "execution_count": 0, 138 | "outputs": [] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "metadata": { 143 | "colab_type": "code", 144 | "id": "JM6O5GPAUI89", 145 | "colab": {} 146 | }, 147 | "source": [ 148 | "import glowyr as dtn\n", 149 | "import pydotplus\n", 150 | "from tableone import TableOne" 151 | ], 152 | "execution_count": 0, 153 | "outputs": [] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": { 158 | "colab_type": "text", 159 | "id": "hq_09Hh-y17k" 160 | }, 161 | "source": [ 162 | "In this notebook we'll be looking at tree models, so we'll now install a package for visualizing these models." 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "metadata": { 168 | "colab_type": "code", 169 | "id": "jBMOwgwszGOw", 170 | "colab": {} 171 | }, 172 | "source": [ 173 | "!apt-get install graphviz -y" 174 | ], 175 | "execution_count": 0, 176 | "outputs": [] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": { 181 | "colab_type": "text", 182 | "id": "LgcRCqxCk3HC" 183 | }, 184 | "source": [ 185 | "## Load the patient cohort\n", 186 | "\n", 187 | "Let's extract a cohort of patients admitted to the ICU from the emergency department. We link demographics data from the `patient` table to severity of illness score data in the `apachepatientresult` table. We exclude readmissions and neurological patients to help create a population suitable for our demonstration." 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "metadata": { 193 | "colab_type": "code", 194 | "id": "ReCl7-aek1-k", 195 | "colab": {} 196 | }, 197 | "source": [ 198 | "# Link the patient, apachepatientresult, and apacheapsvar tables on patientunitstayid\n", 199 | "# using an inner join.\n", 200 | "query = \"\"\"\n", 201 | "SELECT p.unitadmitsource, p.gender, p.age, p.unittype, p.unitstaytype, \n", 202 | " a.actualhospitalmortality, a.acutePhysiologyScore, a.apacheScore\n", 203 | "FROM `physionet-data.eicu_crd_demo.patient` p\n", 204 | "INNER JOIN `physionet-data.eicu_crd_demo.apachepatientresult` a\n", 205 | "ON p.patientunitstayid = a.patientunitstayid\n", 206 | "WHERE a.apacheversion LIKE 'IVa'\n", 207 | "AND LOWER(p.unitadmitsource) LIKE \"%emergency%\"\n", 208 | "AND LOWER(p.unitstaytype) LIKE \"admit%\"\n", 209 | "AND LOWER(p.unittype) NOT LIKE \"%neuro%\";\n", 210 | "\"\"\"\n", 211 | "\n", 212 | "cohort = dtn.run_query(query,project_id)" 213 | ], 214 | "execution_count": 0, 215 | "outputs": [] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "metadata": { 220 | "colab_type": "code", 221 | "id": "yxLctVBpk9sO", 222 | "colab": {} 223 | }, 224 | "source": [ 225 | "cohort.head()" 226 | ], 227 | "execution_count": 0, 228 | "outputs": [] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": { 233 | "colab_type": "text", 234 | "id": "NPlwRV2buYb1" 235 | }, 236 | "source": [ 237 | "## Preparing the data for analysis\n", 238 | "\n", 239 | "Before continuing, we want to review our data, paying attention to factors such as:\n", 240 | "- data types (for example, are values recorded as characters or numerical values?) \n", 241 | "- missing data\n", 242 | "- distribution of values" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "metadata": { 248 | "colab_type": "code", 249 | "id": "v3OJ4LDvueKu", 250 | "colab": {} 251 | }, 252 | "source": [ 253 | "# dataset info\n", 254 | "print(cohort.info())" 255 | ], 256 | "execution_count": 0, 257 | "outputs": [] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "metadata": { 262 | "colab_type": "code", 263 | "id": "s4wQ6o_RvLph", 264 | "colab": {} 265 | }, 266 | "source": [ 267 | "# Encode the categorical data\n", 268 | "encoder = preprocessing.LabelEncoder()\n", 269 | "cohort['gender_code'] = encoder.fit_transform(cohort['gender'])\n", 270 | "cohort['actualhospitalmortality_code'] = encoder.fit_transform(cohort['actualhospitalmortality'])\n" 271 | ], 272 | "execution_count": 0, 273 | "outputs": [] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": { 278 | "colab_type": "text", 279 | "id": "_1LYcNUdjQA5" 280 | }, 281 | "source": [ 282 | "In the eICU Collaborative Research Database, ages >89 years have been removed to comply with data sharing regulations. We will need to decide how to handle these ages. For simplicity, we will assign an age of 91.5 years to these patients." 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "metadata": { 288 | "colab_type": "code", 289 | "id": "4ogi_ns-ylnP", 290 | "colab": {} 291 | }, 292 | "source": [ 293 | "# Handle the deidentified ages\n", 294 | "cohort['age'] = pd.to_numeric(cohort['age'], downcast='integer', errors='coerce')\n", 295 | "cohort['age'] = cohort['age'].fillna(value=91.5)" 296 | ], 297 | "execution_count": 0, 298 | "outputs": [] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "metadata": { 303 | "colab_type": "code", 304 | "id": "77M0QJQ5wcPQ", 305 | "colab": {} 306 | }, 307 | "source": [ 308 | "# Preview the encoded data\n", 309 | "cohort[['gender','gender_code']].head()" 310 | ], 311 | "execution_count": 0, 312 | "outputs": [] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "metadata": { 317 | "colab_type": "code", 318 | "id": "GqvwTNPN3KZz", 319 | "colab": {} 320 | }, 321 | "source": [ 322 | "# Check the outcome variable\n", 323 | "cohort['actualhospitalmortality_code'].unique()" 324 | ], 325 | "execution_count": 0, 326 | "outputs": [] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": { 331 | "colab_type": "text", 332 | "id": "OdGX1qWdkTgY" 333 | }, 334 | "source": [ 335 | "Now let's use the [tableone package](https://doi.org/10.1093/jamiaopen/ooy012\n", 336 | ") to review our dataset." 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "metadata": { 342 | "colab_type": "code", 343 | "id": "gIIsthy1WK3i", 344 | "colab": {} 345 | }, 346 | "source": [ 347 | "# View summary statistics\n", 348 | "pd.set_option('display.max_rows', 500)\n", 349 | "TableOne(cohort,groupby='actualhospitalmortality')" 350 | ], 351 | "execution_count": 0, 352 | "outputs": [] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": { 357 | "colab_type": "text", 358 | "id": "IGtKlTG1gvRf" 359 | }, 360 | "source": [ 361 | "From these summary statistics, we can see that the average age is higher in the group of patients who do not survive. What other differences do you see?" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": { 367 | "colab_type": "text", 368 | "id": "ze7y5J4Ioz8u" 369 | }, 370 | "source": [ 371 | "## Creating our train and test sets\n", 372 | "\n", 373 | "We only focus on two variables for our analysis, age and acute physiology score. Limiting ourselves to two variables will make it easier to visualize our models." 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "metadata": { 379 | "colab_type": "code", 380 | "id": "i5zXkn_AlDJW", 381 | "colab": {} 382 | }, 383 | "source": [ 384 | "features = ['age','acutePhysiologyScore']\n", 385 | "outcome = 'actualhospitalmortality_code'\n", 386 | "\n", 387 | "X = cohort[features]\n", 388 | "y = cohort[outcome]" 389 | ], 390 | "execution_count": 0, 391 | "outputs": [] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "metadata": { 396 | "colab_type": "code", 397 | "id": "IHhIgDUwocmA", 398 | "colab": {} 399 | }, 400 | "source": [ 401 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)" 402 | ], 403 | "execution_count": 0, 404 | "outputs": [] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "metadata": { 409 | "colab_type": "code", 410 | "id": "NvQWkuY6nkZ8", 411 | "colab": {} 412 | }, 413 | "source": [ 414 | "# Review the number of cases in each set\n", 415 | "print(\"Train data: {}\".format(len(X_train)))\n", 416 | "print(\"Test data: {}\".format(len(X_test)))" 417 | ], 418 | "execution_count": 0, 419 | "outputs": [] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": { 424 | "colab_type": "text", 425 | "id": "b2waK5qBqanC" 426 | }, 427 | "source": [ 428 | "## Decision trees\n", 429 | "\n", 430 | "Let's build the simplest tree model we can think of: a classification tree with only one split. Decision trees of this form are commonly referred to under the umbrella term Classification and Regression Trees (CART) [1]. \n", 431 | "\n", 432 | "While we will only be looking at classification here, regression isn't too different. After grouping the data (which is essentially what a decision tree does), classification involves assigning all members of the group to the majority class of that group during training. Regression is the same, except you would assign the average value, not the majority. \n", 433 | "\n", 434 | "In the case of a decision tree with one split, often called a \"stump\", the model will partition the data into two groups, and assign classes for those two groups based on majority vote. There are many parameters available for the DecisionTreeClassifier class; by specifying max_depth=1 we will build a decision tree with only one split - i.e. of depth 1.\n", 435 | "\n", 436 | "[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984." 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "metadata": { 442 | "colab_type": "code", 443 | "id": "RlG3N3OYBqAm", 444 | "colab": {} 445 | }, 446 | "source": [ 447 | "# specify max_depth=1 so we train a stump, i.e. a tree with only 1 split\n", 448 | "mdl = tree.DecisionTreeClassifier(max_depth=1)\n", 449 | "\n", 450 | "# fit the model to the data - trying to predict y from X\n", 451 | "mdl = mdl.fit(X_train,y_train)" 452 | ], 453 | "execution_count": 0, 454 | "outputs": [] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": { 459 | "colab_type": "text", 460 | "id": "8RlioUw8B_0O" 461 | }, 462 | "source": [ 463 | "Our model is so simple that we can look at the full decision tree." 464 | ] 465 | }, 466 | { 467 | "cell_type": "code", 468 | "metadata": { 469 | "colab_type": "code", 470 | "id": "G2t9Nz8pBqEb", 471 | "colab": {} 472 | }, 473 | "source": [ 474 | "graph = dtn.create_graph(mdl,feature_names=features)\n", 475 | "Image(graph.create_png())" 476 | ], 477 | "execution_count": 0, 478 | "outputs": [] 479 | }, 480 | { 481 | "cell_type": "markdown", 482 | "metadata": { 483 | "colab_type": "text", 484 | "id": "E-iPwWWKCGY9" 485 | }, 486 | "source": [ 487 | "Here we see three nodes: a node at the top, a node in the lower left, and a node in the lower right.\n", 488 | "\n", 489 | "The top node is the root of the tree: it contains all the data. Let's read this node bottom to top:\n", 490 | "- `value = [384, 44]`: Current class balance. There are 384 observations of class 0 and 44 observations of class 1.\n", 491 | "- `samples = 428`: Number of samples assessed at this node.\n", 492 | "- `gini = 0.184`: Gini impurity, a measure of \"impurity\". The higher the value, the bigger the mix of classes. A 50/50 split of two classes would result in an index of 0.5.\n", 493 | "- `acutePhysiologyScore <=78.5`: Decision rule learned by the node. In this case, patients with a score of <= 78.5 are moved into the left node and >78.5 to the right. " 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": { 499 | "colab_type": "text", 500 | "id": "KS0UcZqUeJKz" 501 | }, 502 | "source": [ 503 | "The gini impurity is actually used by the algorithm to determine a split. The model evaluates every feature (in our case, age and score) at every possible split (46, 47, 48..) to find the point with the lowest gini impurity in two resulting nodes. \n", 504 | "\n", 505 | "The approach is referred to as \"greedy\" because we are choosing the optimal split given our current state. Let's take a closer look at our decision boundary." 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "metadata": { 511 | "colab_type": "code", 512 | "id": "uXl22sNTtpHa", 513 | "colab": {} 514 | }, 515 | "source": [ 516 | "# look at the regions in a 2d plot\n", 517 | "# based on scikit-learn tutorial plot_iris.html\n", 518 | "plt.figure(figsize=[10,8])\n", 519 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, \n", 520 | " title=\"Decision tree (depth 1)\")" 521 | ], 522 | "execution_count": 0, 523 | "outputs": [] 524 | }, 525 | { 526 | "cell_type": "markdown", 527 | "metadata": { 528 | "colab_type": "text", 529 | "id": "25zSX-inCNOJ" 530 | }, 531 | "source": [ 532 | "In this plot we can see the decision boundary on the y-axis, separating the predicted classes. The true classes are indicated at each point. Where the background and point colours are mismatched, there has been misclassification. Of course we are using a very simple model. Let's see what happens when we increase the depth." 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "metadata": { 538 | "colab_type": "code", 539 | "id": "ZuO62CL3CSGm", 540 | "colab": {} 541 | }, 542 | "source": [ 543 | "mdl = tree.DecisionTreeClassifier(max_depth=5)\n", 544 | "mdl = mdl.fit(X_train,y_train)" 545 | ], 546 | "execution_count": 0, 547 | "outputs": [] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "metadata": { 552 | "colab_type": "code", 553 | "id": "A88Vi83LCSJ6", 554 | "colab": {} 555 | }, 556 | "source": [ 557 | "plt.figure(figsize=[10,8])\n", 558 | "dtn.plot_model_pred_2d(mdl, X_train, y_train,\n", 559 | " title=\"Decision tree (depth 5)\")" 560 | ], 561 | "execution_count": 0, 562 | "outputs": [] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "metadata": { 567 | "colab_type": "text", 568 | "id": "B88XlKDtCYmn" 569 | }, 570 | "source": [ 571 | "Now our tree is more complicated! We can see a few vertical boundaries as well as the horizontal one from before. Some of these we may like, but some appear unnatural. Let's look at the tree itself." 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "metadata": { 577 | "colab_type": "code", 578 | "id": "V1VLrOJJCcWo", 579 | "colab": {} 580 | }, 581 | "source": [ 582 | "graph = dtn.create_graph(mdl,feature_names=features)\n", 583 | "Image(graph.create_png())" 584 | ], 585 | "execution_count": 0, 586 | "outputs": [] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": { 591 | "colab_type": "text", 592 | "id": "Ton_EnvFqHIO" 593 | }, 594 | "source": [ 595 | "Looking at the tree, we can see that there are some very specific rules. Consider our patient aged 65 years with an acute physiology score of 87. From the top of the tree, we would work our way down:\n", 596 | "\n", 597 | "- acutePhysiologyScore <= 78.5? No.\n", 598 | "- acutePhysiologyScore <= 106.5? Yes.\n", 599 | "- age <= 75.5? Yes\n", 600 | "- age <= 66. Yes.\n", 601 | "- age <= 62.5? No. \n", 602 | "\n", 603 | "This leads us to our single node with a gini impurity of 0. Having an entire rule based upon this one observation seems silly, but it is perfectly logical as at the moment. The only objective the algorithm cares about is minimizing the gini impurity. \n", 604 | "\n", 605 | "We are at risk of overfitting our data! This is where \"pruning\" comes in." 606 | ] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "metadata": { 611 | "colab_type": "code", 612 | "id": "VvsNIjCDDIo_", 613 | "colab": {} 614 | }, 615 | "source": [ 616 | "# let's prune the model and look again\n", 617 | "mdl = dtn.prune(mdl, min_samples_leaf = 10)\n", 618 | "graph = dtn.create_graph(mdl,feature_names=features)\n", 619 | "Image(graph.create_png()) " 620 | ], 621 | "execution_count": 0, 622 | "outputs": [] 623 | }, 624 | { 625 | "cell_type": "markdown", 626 | "metadata": { 627 | "colab_type": "text", 628 | "id": "8pRzzV2VvdxP" 629 | }, 630 | "source": [ 631 | "Above, we can see that our second tree is (1) smaller in depth, and (2) never splits a node with <= 10 samples. We can look at the decision surface for this tree:" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "metadata": { 637 | "colab_type": "code", 638 | "id": "5LyGDz-Cr-mU", 639 | "colab": {} 640 | }, 641 | "source": [ 642 | "plt.figure(figsize=[10,8])\n", 643 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=\"Pruned decision tree\")" 644 | ], 645 | "execution_count": 0, 646 | "outputs": [] 647 | }, 648 | { 649 | "cell_type": "markdown", 650 | "metadata": { 651 | "colab_type": "text", 652 | "id": "xAnqmD_Dv_dh" 653 | }, 654 | "source": [ 655 | "Our pruned decision tree has a much more intuitive boundary, but does make some errors. We have reduced our performance in an effort to simplify the tree. This is the classic machine learning problem of trading off complexity with error.\n", 656 | "\n", 657 | "Note that, in order to do this, we \"invented\" the minimum samples per leaf node of 10. Why 10? Why not 5? Why not 20? The answer is: it depends on the dataset. Heuristically choosing these parameters can be time consuming, and we will see later on how gradient boosting elegantly handles this task." 658 | ] 659 | }, 660 | { 661 | "cell_type": "markdown", 662 | "metadata": { 663 | "colab_type": "text", 664 | "id": "2EFINpj-wD7H" 665 | }, 666 | "source": [ 667 | "## Decision trees have high \"variance\"\n", 668 | "\n", 669 | "Before we move on to boosting, it will be useful to demonstrate how decision trees have high \"variance\". In this context, variance refers to a property of some models to have a wide range of performance given random samples of data. Let's take a look at randomly slicing the data we have too see what that means." 670 | ] 671 | }, 672 | { 673 | "cell_type": "code", 674 | "metadata": { 675 | "colab_type": "code", 676 | "id": "JT7fuuj6vjKB", 677 | "colab": {} 678 | }, 679 | "source": [ 680 | "np.random.seed(123)\n", 681 | "\n", 682 | "fig = plt.figure(figsize=[12,3])\n", 683 | "\n", 684 | "for i in range(3):\n", 685 | " ax = fig.add_subplot(1,3,i+1)\n", 686 | "\n", 687 | " # generate indices in a random order\n", 688 | " idx = np.random.permutation(X_train.shape[0])\n", 689 | " \n", 690 | " # only use the first 50\n", 691 | " idx = idx[:50]\n", 692 | " X_temp = X_train.iloc[idx]\n", 693 | " y_temp = y_train.values[idx]\n", 694 | " \n", 695 | " # initialize the model\n", 696 | " mdl = tree.DecisionTreeClassifier(max_depth=5)\n", 697 | " \n", 698 | " # train the model using the dataset\n", 699 | " mdl = mdl.fit(X_temp, y_temp)\n", 700 | " txt = 'Random sample {}'.format(i)\n", 701 | " dtn.plot_model_pred_2d(mdl, X_temp, y_temp, title=txt)" 702 | ], 703 | "execution_count": 0, 704 | "outputs": [] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": { 709 | "colab_type": "text", 710 | "id": "j6VTIDr-yRRZ" 711 | }, 712 | "source": [ 713 | "Above we can see that we are using random subsets of data, and as a result, our decision boundary can change quite a bit. As you could guess, we actually don't want a model that randomly works well and randomly works poorly, so you may wonder why this is useful. \n", 714 | "\n", 715 | "The trick is that by combining many of instances of \"high variance\" classifiers (decision trees), we can end up with a single classifier with low variance. There is an old joke: two farmers and a statistician go hunting. They see a deer: the first farmer shoots, and misses to the left. The next farmer shoots, and misses to the right. The statistician yells \"We got it!!\".\n", 716 | "\n", 717 | "While it doesn't quite hold in real life, it turns out that this principle does hold for decision trees. Combining them in the right way ends up building powerful models." 718 | ] 719 | }, 720 | { 721 | "cell_type": "markdown", 722 | "metadata": { 723 | "colab_type": "text", 724 | "id": "iWnKvx6myf9Z" 725 | }, 726 | "source": [ 727 | "## Boosting\n", 728 | "\n", 729 | "The premise of boosting is the combination of many weak learners to form a single \"strong\" learner. In a nutshell, boosting involves building a models iteratively. At each step we focus on the data on which we performed poorly. \n", 730 | "\n", 731 | "In our context, we'll use decision trees, so the first step would be to build a tree using the data. Next, we'd look at the data that we misclassified, and re-weight the data so that we really wanted to classify those observations correctly, at a cost of maybe getting some of the other data wrong this time. Let's see how this works in practice." 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "metadata": { 737 | "colab_type": "code", 738 | "id": "YJWxu0bTwRzD", 739 | "colab": {} 740 | }, 741 | "source": [ 742 | "# build the model\n", 743 | "clf = tree.DecisionTreeClassifier(max_depth=1)\n", 744 | "mdl = ensemble.AdaBoostClassifier(base_estimator=clf,n_estimators=6)\n", 745 | "mdl = mdl.fit(X_train,y_train)\n", 746 | "\n", 747 | "# plot each individual decision tree\n", 748 | "fig = plt.figure(figsize=[12,6])\n", 749 | "for i, estimator in enumerate(mdl.estimators_):\n", 750 | " ax = fig.add_subplot(2,3,i+1)\n", 751 | " txt = 'Tree {}'.format(i+1)\n", 752 | " dtn.plot_model_pred_2d(estimator, X_train, y_train, title=txt)" 753 | ], 754 | "execution_count": 0, 755 | "outputs": [] 756 | }, 757 | { 758 | "cell_type": "markdown", 759 | "metadata": { 760 | "colab_type": "text", 761 | "id": "5zNfvDjTzh2U" 762 | }, 763 | "source": [ 764 | "Looking at our example above, we can see that the first iteration builds the exact same simple decision tree as we had seen earlier. This makes sense. It is using the entire dataset with no special weighting. \n", 765 | "\n", 766 | "In the next iteration we can see the model shift. It misclassified several observations in class 1, and now these are the most important observations. Consequently, it picks the boundary that, while prioritizing correctly classifies these observations, still tries to best classify the rest of the data too. \n", 767 | "\n", 768 | "The iteration process continues, until the model is apparently creating boundaries to capture just one or two observations (see, for example, Tree 6 on the bottom right). \n", 769 | "\n", 770 | "One important point is that each tree is weighted by its global error. So, for example, Tree 6 would carry less weight in the final model. It is clear that we wouldn't want Tree 6 to carry the same importance as Tree 1, when Tree 1 is doing so much better overall. It turns out that weighting each tree by the inverse of its error is a pretty good way to do this.\n", 771 | "\n", 772 | "Let's look at final model's decision surface.\n" 773 | ] 774 | }, 775 | { 776 | "cell_type": "code", 777 | "metadata": { 778 | "colab_type": "code", 779 | "id": "3pVG5ytfzp_B", 780 | "colab": {} 781 | }, 782 | "source": [ 783 | "# plot the final prediction\n", 784 | "plt.figure(figsize=[9,5])\n", 785 | "txt = 'Boosted tree (final decision surface)'\n", 786 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)" 787 | ], 788 | "execution_count": 0, 789 | "outputs": [] 790 | }, 791 | { 792 | "cell_type": "markdown", 793 | "metadata": { 794 | "colab_type": "text", 795 | "id": "YRGRFjRgz26h" 796 | }, 797 | "source": [ 798 | "And that's AdaBoost! There are a few tricks we have glossed over here, but you understand the general principle. Now we'll move on to a different approach. With boosting, we iteratively changed the dataset to have new trees focus on the \"difficult\" observations. The next approach we discuss is similar as it also involves using changed versions of our dataset to build new trees." 799 | ] 800 | }, 801 | { 802 | "cell_type": "markdown", 803 | "metadata": { 804 | "colab_type": "text", 805 | "id": "EFNDNsIpfP7j" 806 | }, 807 | "source": [ 808 | "## Bagging\n", 809 | "\n", 810 | "Bootstrap aggregation, or \"Bagging\", is another form of *ensemble learning* where we aim to build a single good model by combining many models together. With AdaBoost, we modified the data to focus on hard to classify observations. We can imagine this as a form of resampling the data for each new tree. For example, say we have three observations: A, B, and C, `[A, B, C]`. If we correctly classify observations `[A, B]`, but incorrectly classify `C`, then AdaBoost involves building a new tree that focuses on `C`. Equivalently, we could say AdaBoost builds a new tree using the dataset `[A, B, C, C, C]`, where we have *intentionally* repeated observation `C` 3 times so that the algorithm thinks it is 3 times as important as the other observations. Makes sense?\n", 811 | "\n", 812 | "Bagging involves the same approach, except we don't selectively choose which observations to focus on, but rather we *randomly select subsets of data each time*. As you can see, while this is a similar process to AdaBoost, the concept is quite different. Whereas before we aimed to iteratively improve our overall model with new trees, we now build trees on what we hope are independent datasets.\n", 813 | "\n", 814 | "Let's take a step back, and think about a practical example. Say we wanted a good model of heart disease. If we saw researchers build a model from a dataset of patients from their hospital, we would be happy. If they then acquired a new dataset from new patients, and built a new model, we'd be inclined to feel that the combination of the two models would be better than any one individually. This exact scenario is what bagging aims to replicate, except instead of actually going out and collecting new datasets, we instead use bootstrapping to create new sets of data from our current dataset. If you are unfamiliar with bootstrapping, you can treat it as \"magic\" for now (and if you are familiar with the bootstrap, you already know that it is magic).\n", 815 | "\n", 816 | "Let's take a look at a simple bootstrap model." 817 | ] 818 | }, 819 | { 820 | "cell_type": "code", 821 | "metadata": { 822 | "colab_type": "code", 823 | "id": "JrXAspvrzv8x", 824 | "colab": {} 825 | }, 826 | "source": [ 827 | "np.random.seed(321)\n", 828 | "clf = tree.DecisionTreeClassifier(max_depth=5)\n", 829 | "mdl = ensemble.BaggingClassifier(base_estimator=clf, n_estimators=6)\n", 830 | "mdl = mdl.fit(X_train, y_train)\n", 831 | "\n", 832 | "fig = plt.figure(figsize=[12,6])\n", 833 | "for i, estimator in enumerate(mdl.estimators_): \n", 834 | " ax = fig.add_subplot(2,3,i+1)\n", 835 | " txt = 'Tree {}'.format(i+1)\n", 836 | " dtn.plot_model_pred_2d(estimator, X_train, y_train, \n", 837 | " title=txt)" 838 | ], 839 | "execution_count": 0, 840 | "outputs": [] 841 | }, 842 | { 843 | "cell_type": "markdown", 844 | "metadata": { 845 | "colab_type": "text", 846 | "id": "s3kKUPORfW9F" 847 | }, 848 | "source": [ 849 | "We can see that each individual tree is quite variable. This is a result of using a random set of data to train the classifier." 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "metadata": { 855 | "colab_type": "code", 856 | "id": "w_D7_-0HfVMy", 857 | "colab": {} 858 | }, 859 | "source": [ 860 | "# plot the final prediction\n", 861 | "plt.figure(figsize=[8,5])\n", 862 | "txt = 'Bagged tree (final decision surface)'\n", 863 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)" 864 | ], 865 | "execution_count": 0, 866 | "outputs": [] 867 | }, 868 | { 869 | "cell_type": "markdown", 870 | "metadata": { 871 | "colab_type": "text", 872 | "id": "AOFnG0r6faLS" 873 | }, 874 | "source": [ 875 | "Not bad! Of course, since this is a simple dataset, we are not seeing that many dramatic changes between different models. Don't worry, we'll quantitatively evaluate them later. \n", 876 | "\n", 877 | "Next up, a minor addition creates one of the most popular models in machine learning." 878 | ] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "metadata": { 883 | "colab_type": "text", 884 | "id": "aiqrVfYtfcYk" 885 | }, 886 | "source": [ 887 | "## Random Forest\n", 888 | "\n", 889 | "In the previous example, we used bagging to randomly resample our data to generate \"new\" datasets. The Random Forest takes this one step further: instead of just resampling our data, we also select only a fraction of the features to include. \n", 890 | "\n", 891 | "It turns out that this subselection tends to improve the performance of our models. The odds of an individual being very good or very bad is higher (i.e. the variance of the trees is increased), and this ends up giving us a final model with better overall performance (lower bias).\n", 892 | "\n", 893 | "Let's train the model." 894 | ] 895 | }, 896 | { 897 | "cell_type": "code", 898 | "metadata": { 899 | "colab_type": "code", 900 | "id": "u27LS36_fglG", 901 | "colab": {} 902 | }, 903 | "source": [ 904 | "np.random.seed(321)\n", 905 | "mdl = ensemble.RandomForestClassifier(max_depth=5, n_estimators=6, max_features=1)\n", 906 | "mdl = mdl.fit(X_train,y_train)\n", 907 | "\n", 908 | "fig = plt.figure(figsize=[12,6])\n", 909 | "for i, estimator in enumerate(mdl.estimators_): \n", 910 | " ax = fig.add_subplot(2,3,i+1)\n", 911 | " txt = 'Tree {}'.format(i+1)\n", 912 | " dtn.plot_model_pred_2d(estimator, X_train, y_train, title=txt)" 913 | ], 914 | "execution_count": 0, 915 | "outputs": [] 916 | }, 917 | { 918 | "cell_type": "code", 919 | "metadata": { 920 | "colab_type": "code", 921 | "id": "5aG0PI8lruGN", 922 | "colab": {} 923 | }, 924 | "source": [ 925 | "plt.figure(figsize=[9,5])\n", 926 | "txt = 'Random forest (final decision surface)'\n", 927 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)" 928 | ], 929 | "execution_count": 0, 930 | "outputs": [] 931 | }, 932 | { 933 | "cell_type": "markdown", 934 | "metadata": { 935 | "colab_type": "text", 936 | "id": "2KmJuztXfjzm" 937 | }, 938 | "source": [ 939 | "Again, the visualization doesn't *really* show us the power of Random Forests, but we'll quantitatively evaluate them soon enough.\n", 940 | "\n", 941 | "Last, and not least, we move on to gradient boosting." 942 | ] 943 | }, 944 | { 945 | "cell_type": "markdown", 946 | "metadata": { 947 | "colab_type": "text", 948 | "id": "LTP8zFIofl2v" 949 | }, 950 | "source": [ 951 | "## Gradient Boosting\n", 952 | "\n", 953 | "Gradient boosting, our last topic, elegantly combines concepts from the previous methods. As a \"boosting\" method, gradient boosting involves iteratively building trees, aiming to improve upon misclassifications of the previous tree. Gradient boosting also borrows the concept of sub-sampling the variables (just like Random Forests), which can help to prevent overfitting.\n", 954 | "\n", 955 | "While it is hard to express in this non-technical tutorial, the biggest innovation in gradient boosting is that it provides a unifying mathematical framework for boosting models. The approach explicitly casts the problem of building a tree as an optimization problem, defining mathematical functions for how well a tree is performing (which we had before) *and* how complex a tree is. In this light, one can actually treat AdaBoost as a \"special case\" of gradient boosting, where the loss function is chosen to be the exponential loss.\n", 956 | "\n", 957 | "Let's build a gradient boosting model." 958 | ] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "metadata": { 963 | "colab_type": "code", 964 | "id": "L_QVZ9oNfnqk", 965 | "colab": {} 966 | }, 967 | "source": [ 968 | "np.random.seed(321)\n", 969 | "mdl = ensemble.GradientBoostingClassifier(n_estimators=10)\n", 970 | "mdl = mdl.fit(X_train, y_train)\n", 971 | "\n", 972 | "plt.figure(figsize=[9,5])\n", 973 | "txt = 'Gradient boosted tree (final decision surface)'\n", 974 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)" 975 | ], 976 | "execution_count": 0, 977 | "outputs": [] 978 | }, 979 | { 980 | "cell_type": "markdown", 981 | "metadata": { 982 | "colab_type": "text", 983 | "id": "tcCzP4gAsd7L" 984 | }, 985 | "source": [ 986 | "## Comparing model performance\n", 987 | "\n", 988 | "We've now learned the basics of the various tree methods and have visualized most of them. Let's finish by comparing the performance of our models on our held-out test data. Our goal, remember, is to predict whether or not a patient will survive their hospital stay using the patient's age and acute physiology score computed on the first day of their ICU stay." 989 | ] 990 | }, 991 | { 992 | "cell_type": "code", 993 | "metadata": { 994 | "colab_type": "code", 995 | "id": "tQST4TQAtHmU", 996 | "colab": {} 997 | }, 998 | "source": [ 999 | "clf = dict()\n", 1000 | "clf['Decision Tree'] = tree.DecisionTreeClassifier(criterion='entropy', splitter='best').fit(X_train,y_train)\n", 1001 | "clf['Gradient Boosting'] = ensemble.GradientBoostingClassifier(n_estimators=10).fit(X_train, y_train)\n", 1002 | "clf['Random Forest'] = ensemble.RandomForestClassifier(n_estimators=10).fit(X_train, y_train)\n", 1003 | "clf['Bagging'] = ensemble.BaggingClassifier(n_estimators=10).fit(X_train, y_train)\n", 1004 | "clf['AdaBoost'] = ensemble.AdaBoostClassifier(n_estimators=10).fit(X_train, y_train)\n", 1005 | "\n", 1006 | "fig = plt.figure(figsize=[10,10])\n", 1007 | "\n", 1008 | "print('AUROC\\tModel')\n", 1009 | "for i, curr_mdl in enumerate(clf): \n", 1010 | " yhat = clf[curr_mdl].predict_proba(X_test)[:,1]\n", 1011 | " score = metrics.roc_auc_score(y_test, yhat)\n", 1012 | " print('{:0.3f}\\t{}'.format(score, curr_mdl))\n", 1013 | " ax = fig.add_subplot(3,2,i+1)\n", 1014 | " dtn. plot_model_pred_2d(clf[curr_mdl], X_test, y_test, title=curr_mdl)\n", 1015 | " " 1016 | ], 1017 | "execution_count": 0, 1018 | "outputs": [] 1019 | }, 1020 | { 1021 | "cell_type": "markdown", 1022 | "metadata": { 1023 | "colab_type": "text", 1024 | "id": "osr6iM6ltLAP" 1025 | }, 1026 | "source": [ 1027 | "Here we can see that quantitatively, gradient boosting has produced the highest discrimination among all the models (~0.91). You'll see that some of the models appear to have simpler decision surfaces, which tends to result in improved generalization on a held-out test set (though not always!).\n", 1028 | "\n", 1029 | "To make appropriate comparisons, we should calculate 95% confidence intervals on these performance estimates. This can be done a number of ways. A simple but effective approach is to use bootstrapping, a resampling technique. In bootstrapping, we generate multiple datasets from the test set (allowing the same data point to be sampled multiple times). Using these datasets, we can then estimate the confidence intervals." 1030 | ] 1031 | } 1032 | ] 1033 | } --------------------------------------------------------------------------------