├── mimic_r
├── README.md
└── length_of_stay.Rmd
├── mimic_python
├── README.md
└── summary_stats.ipynb
├── requirements.txt
├── eicu_r
└── README.md
├── eicu_python
├── README.md
├── 03_summary_statistics.ipynb
├── 03_summary_statistics_satoshi.ipynb
├── 01_explore_patients.ipynb
├── 01_explore_patients_satoshi.ipynb
├── 02_severity_of_illness_satoshi.ipynb
├── 02_severity_of_illness.ipynb
├── 04_prediction.ipynb
└── 04_prediction_satoshi.ipynb
├── README.md
├── LICENSE
└── .gitignore
/mimic_r/README.md:
--------------------------------------------------------------------------------
1 | # Materials for analyzing MIMIC in R
2 |
3 | Coming soon...
--------------------------------------------------------------------------------
/mimic_python/README.md:
--------------------------------------------------------------------------------
1 | # Materials for analyzing MIMIC in Python
2 |
3 | Coming soon...
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | matplotlib==3.0.2
2 | numpy==1.15.4
3 | pandas==0.23.4
4 | tableone==0.6.0
--------------------------------------------------------------------------------
/eicu_r/README.md:
--------------------------------------------------------------------------------
1 | # Materials for analyzing the eICU Collaborative Database in R
2 |
3 | Coming soon...
--------------------------------------------------------------------------------
/eicu_python/README.md:
--------------------------------------------------------------------------------
1 | # Materials for analyzing the eICU Collaborative Database in Python
2 |
3 | Coming soon...
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 2019 Tokyo Datathon: Interactive workshop on prediction using the eICU Collaborative Research Database
2 |
3 | ## Tom Pollard, Stephanie Ko, and Satoshi Kimura
4 |
5 | Materials for the [2019 Tokyo Datathon](http://datathon-japan.jp/2019/).
6 |
7 | ## Set up details
8 |
9 | - Project name: datathonjapan2019
10 | - Permissions group: datathonjapan2019
11 |
12 | ## Introduction
13 |
14 | The folders in this repository include notebooks demonstrating how to connect to the [MIMIC-III](https://mimic.physionet.org/) and [eICU Collaborative Research Database](https://eicu-crd.mit.edu/).
15 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 MIT Laboratory for Computational Physiology
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/mimic_r/length_of_stay.Rmd:
--------------------------------------------------------------------------------
1 | ---
2 | title: "Length of stay in the ICU"
3 | author: "tom pollard"
4 | description: "Length of stay in the ICU for patients in MIMIC-III"
5 | output: pdf_document
6 | date: "10/10/2017"
7 | ---
8 |
9 | ```{r setup, include = FALSE}
10 | knitr::opts_chunk$set(echo = TRUE)
11 |
12 | # install.packages("ggplot2")
13 | # install.packages("bigrquery")
14 |
15 | library("ggplot2")
16 | library("bigrquery")
17 | ```
18 |
19 |
20 | ```{r dbconnect, include=FALSE}
21 | # Load configuration settings
22 | project_id <- "hst-953-2018"
23 | options(httr_oauth_cache=TRUE)
24 |
25 | run_query <- function(query){
26 | data <- query_exec(query, project=project_id, use_legacy_sql = FALSE)
27 | return(data)
28 | }
29 | ```
30 |
31 |
32 | ```{r load_data, include=FALSE}
33 | sql_query <- "SELECT i.subject_id, i.hadm_id, i.los
34 | FROM `physionet-data.mimiciii_demo.icustays` i;"
35 |
36 | data <- run_query(sql_query)
37 |
38 | head(data)
39 | ```
40 |
41 | This document shows how RMarkdown can be used to create a reproducible analysis using MIMIC-III (version 1.4). Let's calculate the median length of stay in the ICU and then include this value in our document.
42 |
43 | ```{r calculate_mean_los, include=FALSE}
44 | avg_los <- median(data$los, na.rm=TRUE)
45 | rounded_avg_los <-round(avg_los, digits = 2)
46 | ```
47 |
48 | So the median length of stay in the ICU is `r avg_los` days. Rounded to two decimal places, this is `r rounded_avg_los` days. We can plot the distribution of length of stay using the qplot function:
49 |
50 |
51 | ```{r plot_los, echo=FALSE, include=TRUE, warning = FALSE}
52 | qplot(data$los, geom="histogram",xlim=c(0,25), binwidth = 1,
53 | xlab = "Length of stay in the ICU, days.",fill=I("#FF9999"), col=I("white"))
54 | ```
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | MANIFEST
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 | .pytest_cache/
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | db.sqlite3
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # Jupyter Notebook
73 | .ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # SageMath parsed files
82 | *.sage.py
83 |
84 | # Environments
85 | .env
86 | .venv
87 | env/
88 | venv/
89 | ENV/
90 | env.bak/
91 | venv.bak/
92 |
93 | # Spyder project settings
94 | .spyderproject
95 | .spyproject
96 |
97 | # Rope project settings
98 | .ropeproject
99 |
100 | # mkdocs documentation
101 | /site
102 |
103 | # mypy
104 | .mypy_cache/
105 |
106 | # Mac OSX
107 | .DS_Store
108 |
--------------------------------------------------------------------------------
/mimic_python/summary_stats.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "03-summary-statistics",
7 | "version": "0.3.2",
8 | "provenance": [],
9 | "collapsed_sections": [],
10 | "include_colab_link": true
11 | },
12 | "kernelspec": {
13 | "display_name": "Python 3",
14 | "language": "python",
15 | "name": "python3"
16 | }
17 | },
18 | "cells": [
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "id": "view-in-github",
23 | "colab_type": "text"
24 | },
25 | "source": [
26 | "
"
27 | ]
28 | },
29 | {
30 | "metadata": {
31 | "colab_type": "text",
32 | "id": "1G_TVh1ybQkl"
33 | },
34 | "cell_type": "markdown",
35 | "source": [
36 | "# MIMIC-III\n",
37 | "\n",
38 | "# Summary statistics\n",
39 | "\n",
40 | "This notebook shows how summary statistics can be computed for a patient cohort using the `tableone` package. Usage instructions for tableone are at: https://pypi.org/project/tableone/\n",
41 | "\n",
42 | "このノートブックでは、`tableone`というパッケージを用いて、データの分布などの詳細について見ていきます。"
43 | ]
44 | },
45 | {
46 | "metadata": {
47 | "colab_type": "text",
48 | "id": "L9XF77F2bnee"
49 | },
50 | "cell_type": "markdown",
51 | "source": [
52 | "## Load libraries and connect to the database"
53 | ]
54 | },
55 | {
56 | "metadata": {
57 | "colab_type": "code",
58 | "id": "wXiSE558bn_w",
59 | "colab": {}
60 | },
61 | "cell_type": "code",
62 | "source": [
63 | "# Import libraries\n",
64 | "import numpy as np\n",
65 | "import os\n",
66 | "import pandas as pd\n",
67 | "import matplotlib.pyplot as plt\n",
68 | "import matplotlib.patches as patches\n",
69 | "import matplotlib.path as path\n",
70 | "\n",
71 | "# Make pandas dataframes prettier\n",
72 | "from IPython.display import display, HTML\n",
73 | "\n",
74 | "# Access data using Google BigQuery.\n",
75 | "from google.colab import auth\n",
76 | "from google.cloud import bigquery"
77 | ],
78 | "execution_count": 0,
79 | "outputs": []
80 | },
81 | {
82 | "metadata": {
83 | "colab_type": "code",
84 | "id": "pLGnLAy-bsKb",
85 | "colab": {}
86 | },
87 | "cell_type": "code",
88 | "source": [
89 | "# authenticate\n",
90 | "auth.authenticate_user()"
91 | ],
92 | "execution_count": 0,
93 | "outputs": []
94 | },
95 | {
96 | "metadata": {
97 | "colab_type": "code",
98 | "id": "PUjFDFdobszs",
99 | "colab": {}
100 | },
101 | "cell_type": "code",
102 | "source": [
103 | "# Set up environment variables\n",
104 | "project_id='datathonjapan2019'\n",
105 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
106 | ],
107 | "execution_count": 0,
108 | "outputs": []
109 | },
110 | {
111 | "metadata": {
112 | "colab_type": "text",
113 | "id": "ObJnK4WTUBKo"
114 | },
115 | "cell_type": "markdown",
116 | "source": [
117 | "To make our lives easier, we'll also install and import a set of helper functions from the `datathon2` package."
118 | ]
119 | },
120 | {
121 | "metadata": {
122 | "colab_type": "code",
123 | "id": "bkJUF8HBbvWe",
124 | "colab": {}
125 | },
126 | "cell_type": "code",
127 | "source": [
128 | "!pip install datathon2"
129 | ],
130 | "execution_count": 0,
131 | "outputs": []
132 | },
133 | {
134 | "metadata": {
135 | "colab_type": "code",
136 | "id": "dzPzFZykUIIx",
137 | "colab": {}
138 | },
139 | "cell_type": "code",
140 | "source": [
141 | "import datathon2 as dtn"
142 | ],
143 | "execution_count": 0,
144 | "outputs": []
145 | },
146 | {
147 | "metadata": {
148 | "colab_type": "text",
149 | "id": "iWDUCA5Nb5BK"
150 | },
151 | "cell_type": "markdown",
152 | "source": [
153 | "## Install and load the `tableone` package\n",
154 | "\n",
155 | "The tableone package can be used to compute summary statistics for a patient cohort. Unlike the previous packages, it is not installed by default in Colab, so will need to install it first.\n",
156 | "\n",
157 | "これまで使ったパッケージに加え、`tableone`もインストールします。"
158 | ]
159 | },
160 | {
161 | "metadata": {
162 | "colab_type": "code",
163 | "id": "F9doCgtscOJd",
164 | "colab": {}
165 | },
166 | "cell_type": "code",
167 | "source": [
168 | "!pip install tableone"
169 | ],
170 | "execution_count": 0,
171 | "outputs": []
172 | },
173 | {
174 | "metadata": {
175 | "colab_type": "code",
176 | "id": "SDI_Q7W0b4Le",
177 | "colab": {}
178 | },
179 | "cell_type": "code",
180 | "source": [
181 | "# Import the tableone class\n",
182 | "from tableone import TableOne"
183 | ],
184 | "execution_count": 0,
185 | "outputs": []
186 | },
187 | {
188 | "metadata": {
189 | "colab_type": "text",
190 | "id": "14TU4lcrdD7I"
191 | },
192 | "cell_type": "markdown",
193 | "source": [
194 | "## Load the patient cohort\n",
195 | "\n",
196 | "In this example, we will load data from the `admissions` table, taking the first hospital admission for each patient."
197 | ]
198 | },
199 | {
200 | "metadata": {
201 | "colab_type": "code",
202 | "id": "HF5WF5EObwfw",
203 | "colab": {}
204 | },
205 | "cell_type": "code",
206 | "source": [
207 | "# Link the patient and apachepatientresult tables on patientunitstayid\n",
208 | "# using an inner join.\n",
209 | "query = \"\"\"\n",
210 | "WITH tmp AS (\n",
211 | "SELECT a.subject_id, a.hadm_id, a.admission_type, a.admission_location, a.discharge_location,\n",
212 | " a.insurance, a.ethnicity, a.diagnosis, a.hospital_expire_flag,\n",
213 | " DENSE_RANK() OVER (PARTITION BY a.subject_id ORDER BY a.admittime) AS hospstay_seq,\n",
214 | " DATETIME_DIFF(a.dischtime, a.admittime, DAY) AS los_hospital_days,\n",
215 | " DATETIME_DIFF(a.edouttime, a.edregtime, HOUR) AS los_emergency_hrs\n",
216 | "FROM `physionet-data.mimiciii_demo.admissions` a)\n",
217 | "SELECT *\n",
218 | "FROM tmp\n",
219 | "WHERE hospstay_seq = 1;\n",
220 | "\"\"\"\n",
221 | "\n",
222 | "cohort = dtn.run_query(query,project_id)"
223 | ],
224 | "execution_count": 0,
225 | "outputs": []
226 | },
227 | {
228 | "metadata": {
229 | "colab_type": "code",
230 | "id": "k3hURHFihHNA",
231 | "colab": {}
232 | },
233 | "cell_type": "code",
234 | "source": [
235 | "cohort.head()"
236 | ],
237 | "execution_count": 0,
238 | "outputs": []
239 | },
240 | {
241 | "metadata": {
242 | "colab_type": "text",
243 | "id": "qnG8dVb2iHSn"
244 | },
245 | "cell_type": "markdown",
246 | "source": [
247 | "## Summary statistics"
248 | ]
249 | },
250 | {
251 | "metadata": {
252 | "colab_type": "code",
253 | "id": "FQT-u8EXhXRG",
254 | "colab": {}
255 | },
256 | "cell_type": "code",
257 | "source": [
258 | "columns = ['admission_type', 'admission_location', 'discharge_location', 'insurance',\n",
259 | " 'ethnicity','los_hospital_days','los_emergency_hrs']\n",
260 | "\n",
261 | "categorical = ['admission_type', 'admission_location', 'discharge_location', 'insurance',\n",
262 | " 'ethnicity']"
263 | ],
264 | "execution_count": 0,
265 | "outputs": []
266 | },
267 | {
268 | "metadata": {
269 | "colab_type": "code",
270 | "id": "3ETr3NCzielL",
271 | "colab": {}
272 | },
273 | "cell_type": "code",
274 | "source": [
275 | "TableOne(cohort, columns=columns, categorical = categorical, \n",
276 | " groupby='hospital_expire_flag',\n",
277 | " label_suffix=True, limit=4)"
278 | ],
279 | "execution_count": 0,
280 | "outputs": []
281 | },
282 | {
283 | "metadata": {
284 | "colab_type": "text",
285 | "id": "2_8z1CIVahWg"
286 | },
287 | "cell_type": "markdown",
288 | "source": [
289 | "## Visualizing the data\n",
290 | "\n",
291 | "Plotting the distribution of each variable by group level via histograms, kernel density estimates and boxplots is a crucial component to data analysis pipelines. Vizualisation is often is the only way to detect problematic variables in many real-life scenarios. We'll review a couple of the variables.\n",
292 | "\n",
293 | "データの分布を視覚化することは、データの問題点を把握するために非常に重要な方法です。以下にその例をみてみましょう。"
294 | ]
295 | },
296 | {
297 | "metadata": {
298 | "colab_type": "code",
299 | "id": "81yp2bSUigzh",
300 | "colab": {}
301 | },
302 | "cell_type": "code",
303 | "source": [
304 | "# Plot distributions to review possible multimodality\n",
305 | "cohort[['los_emergency_hrs','los_hospital_days']].dropna().plot.kde(figsize=[12,8])\n",
306 | "plt.legend(['ED time,Hours', 'Hospital LOS'])\n",
307 | "plt.xlim([-30,50])"
308 | ],
309 | "execution_count": 0,
310 | "outputs": []
311 | }
312 | ]
313 | }
--------------------------------------------------------------------------------
/eicu_python/03_summary_statistics.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "03-summary-statistics",
7 | "version": "0.3.2",
8 | "provenance": [],
9 | "collapsed_sections": [],
10 | "include_colab_link": true
11 | },
12 | "kernelspec": {
13 | "name": "python3",
14 | "display_name": "Python 3"
15 | }
16 | },
17 | "cells": [
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {
21 | "id": "view-in-github",
22 | "colab_type": "text"
23 | },
24 | "source": [
25 | "
"
26 | ]
27 | },
28 | {
29 | "metadata": {
30 | "id": "1G_TVh1ybQkl",
31 | "colab_type": "text"
32 | },
33 | "cell_type": "markdown",
34 | "source": [
35 | "# eICU Collaborative Research Database\n",
36 | "\n",
37 | "# Notebook 3: Summary statistics\n",
38 | "\n",
39 | "This notebook shows how summary statistics can be computed for a patient cohort using the `tableone` package. Usage instructions for tableone are at: https://pypi.org/project/tableone/\n"
40 | ]
41 | },
42 | {
43 | "metadata": {
44 | "id": "L9XF77F2bnee",
45 | "colab_type": "text"
46 | },
47 | "cell_type": "markdown",
48 | "source": [
49 | "## Load libraries and connect to the database"
50 | ]
51 | },
52 | {
53 | "metadata": {
54 | "id": "wXiSE558bn_w",
55 | "colab_type": "code",
56 | "colab": {}
57 | },
58 | "cell_type": "code",
59 | "source": [
60 | "# Import libraries\n",
61 | "import numpy as np\n",
62 | "import os\n",
63 | "import pandas as pd\n",
64 | "import matplotlib.pyplot as plt\n",
65 | "import matplotlib.patches as patches\n",
66 | "import matplotlib.path as path\n",
67 | "\n",
68 | "# Make pandas dataframes prettier\n",
69 | "from IPython.display import display, HTML\n",
70 | "\n",
71 | "# Access data using Google BigQuery.\n",
72 | "from google.colab import auth\n",
73 | "from google.cloud import bigquery"
74 | ],
75 | "execution_count": 0,
76 | "outputs": []
77 | },
78 | {
79 | "metadata": {
80 | "id": "pLGnLAy-bsKb",
81 | "colab_type": "code",
82 | "colab": {}
83 | },
84 | "cell_type": "code",
85 | "source": [
86 | "# authenticate\n",
87 | "auth.authenticate_user()"
88 | ],
89 | "execution_count": 0,
90 | "outputs": []
91 | },
92 | {
93 | "metadata": {
94 | "id": "PUjFDFdobszs",
95 | "colab_type": "code",
96 | "colab": {}
97 | },
98 | "cell_type": "code",
99 | "source": [
100 | "# Set up environment variables\n",
101 | "project_id='datathonjapan2019'\n",
102 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
103 | ],
104 | "execution_count": 0,
105 | "outputs": []
106 | },
107 | {
108 | "metadata": {
109 | "id": "ObJnK4WTUBKo",
110 | "colab_type": "text"
111 | },
112 | "cell_type": "markdown",
113 | "source": [
114 | "To make our lives easier, we'll also install and import a set of helper functions from the `datathon2` package."
115 | ]
116 | },
117 | {
118 | "metadata": {
119 | "id": "bkJUF8HBbvWe",
120 | "colab_type": "code",
121 | "colab": {}
122 | },
123 | "cell_type": "code",
124 | "source": [
125 | "!pip install datathon2"
126 | ],
127 | "execution_count": 0,
128 | "outputs": []
129 | },
130 | {
131 | "metadata": {
132 | "id": "dzPzFZykUIIx",
133 | "colab_type": "code",
134 | "colab": {}
135 | },
136 | "cell_type": "code",
137 | "source": [
138 | "import datathon2 as dtn"
139 | ],
140 | "execution_count": 0,
141 | "outputs": []
142 | },
143 | {
144 | "metadata": {
145 | "id": "iWDUCA5Nb5BK",
146 | "colab_type": "text"
147 | },
148 | "cell_type": "markdown",
149 | "source": [
150 | "## Install and load the `tableone` package\n",
151 | "\n",
152 | "The tableone package can be used to compute summary statistics for a patient cohort. Unlike the previous packages, it is not installed by default in Colab, so will need to install it first."
153 | ]
154 | },
155 | {
156 | "metadata": {
157 | "id": "F9doCgtscOJd",
158 | "colab_type": "code",
159 | "colab": {}
160 | },
161 | "cell_type": "code",
162 | "source": [
163 | "!pip install tableone"
164 | ],
165 | "execution_count": 0,
166 | "outputs": []
167 | },
168 | {
169 | "metadata": {
170 | "id": "SDI_Q7W0b4Le",
171 | "colab_type": "code",
172 | "colab": {}
173 | },
174 | "cell_type": "code",
175 | "source": [
176 | "# Import the tableone class\n",
177 | "from tableone import TableOne"
178 | ],
179 | "execution_count": 0,
180 | "outputs": []
181 | },
182 | {
183 | "metadata": {
184 | "id": "14TU4lcrdD7I",
185 | "colab_type": "text"
186 | },
187 | "cell_type": "markdown",
188 | "source": [
189 | "## Load the patient cohort\n",
190 | "\n",
191 | "In this example, we will load all data from the patient data, and link it to APACHE data to provide richer summary information."
192 | ]
193 | },
194 | {
195 | "metadata": {
196 | "id": "HF5WF5EObwfw",
197 | "colab_type": "code",
198 | "colab": {}
199 | },
200 | "cell_type": "code",
201 | "source": [
202 | "# Link the patient and apachepatientresult tables on patientunitstayid\n",
203 | "# using an inner join.\n",
204 | "query = \"\"\"\n",
205 | "SELECT p.unitadmitsource, p.gender, p.age, p.ethnicity, p.admissionweight, \n",
206 | " p.unittype, p.unitstaytype, a.acutephysiologyscore,\n",
207 | " a.apachescore, a.actualiculos, a.actualhospitalmortality,\n",
208 | " a.unabridgedunitlos, a.unabridgedhosplos\n",
209 | "FROM `physionet-data.eicu_crd_demo.patient` p\n",
210 | "INNER JOIN `physionet-data.eicu_crd_demo.apachepatientresult` a\n",
211 | "ON p.patientunitstayid = a.patientunitstayid\n",
212 | "WHERE apacheversion LIKE 'IVa'\n",
213 | "\"\"\"\n",
214 | "\n",
215 | "cohort = dtn.run_query(query,project_id)"
216 | ],
217 | "execution_count": 0,
218 | "outputs": []
219 | },
220 | {
221 | "metadata": {
222 | "id": "k3hURHFihHNA",
223 | "colab_type": "code",
224 | "colab": {}
225 | },
226 | "cell_type": "code",
227 | "source": [
228 | "cohort.head()"
229 | ],
230 | "execution_count": 0,
231 | "outputs": []
232 | },
233 | {
234 | "metadata": {
235 | "id": "qnG8dVb2iHSn",
236 | "colab_type": "text"
237 | },
238 | "cell_type": "markdown",
239 | "source": [
240 | "## Calculate summary statistics\n",
241 | "\n",
242 | "Before summarizing the data, we will need to convert the ages to numerical values."
243 | ]
244 | },
245 | {
246 | "metadata": {
247 | "id": "oKHpqwAPkx6U",
248 | "colab_type": "code",
249 | "colab": {}
250 | },
251 | "cell_type": "code",
252 | "source": [
253 | "cohort['agenum'] = pd.to_numeric(cohort['age'], errors='coerce')"
254 | ],
255 | "execution_count": 0,
256 | "outputs": []
257 | },
258 | {
259 | "metadata": {
260 | "id": "FQT-u8EXhXRG",
261 | "colab_type": "code",
262 | "colab": {}
263 | },
264 | "cell_type": "code",
265 | "source": [
266 | "columns = ['unitadmitsource', 'gender', 'agenum', 'ethnicity',\n",
267 | " 'admissionweight','unittype','unitstaytype',\n",
268 | " 'acutephysiologyscore','apachescore','actualiculos',\n",
269 | " 'unabridgedunitlos','unabridgedhosplos']"
270 | ],
271 | "execution_count": 0,
272 | "outputs": []
273 | },
274 | {
275 | "metadata": {
276 | "id": "3ETr3NCzielL",
277 | "colab_type": "code",
278 | "colab": {}
279 | },
280 | "cell_type": "code",
281 | "source": [
282 | "TableOne(cohort, columns=columns, labels={'agenum': 'age'}, \n",
283 | " groupby='actualhospitalmortality',\n",
284 | " label_suffix=True, limit=4)"
285 | ],
286 | "execution_count": 0,
287 | "outputs": []
288 | },
289 | {
290 | "metadata": {
291 | "id": "LCBcpJ9bZpDp",
292 | "colab_type": "text"
293 | },
294 | "cell_type": "markdown",
295 | "source": [
296 | "## Questions\n",
297 | "\n",
298 | "- Are the severity of illness measures higher in the survival or non-survival group?\n",
299 | "- What issues suggest that some of the summary statistics might be misleading?\n",
300 | "- How might you address these issues?"
301 | ]
302 | },
303 | {
304 | "metadata": {
305 | "id": "2_8z1CIVahWg",
306 | "colab_type": "text"
307 | },
308 | "cell_type": "markdown",
309 | "source": [
310 | "## Visualizing the data\n",
311 | "\n",
312 | "Plotting the distribution of each variable by group level via histograms, kernel density estimates and boxplots is a crucial component to data analysis pipelines. Vizualisation is often is the only way to detect problematic variables in many real-life scenarios. We'll review a couple of the variables."
313 | ]
314 | },
315 | {
316 | "metadata": {
317 | "id": "81yp2bSUigzh",
318 | "colab_type": "code",
319 | "colab": {}
320 | },
321 | "cell_type": "code",
322 | "source": [
323 | "# Plot distributions to review possible multimodality\n",
324 | "cohort[['acutephysiologyscore','agenum']].dropna().plot.kde(figsize=[12,8])\n",
325 | "plt.legend(['APS Score', 'Age (years)'])\n",
326 | "plt.xlim([-30,250])"
327 | ],
328 | "execution_count": 0,
329 | "outputs": []
330 | },
331 | {
332 | "metadata": {
333 | "id": "kZDUZB5sdhhU",
334 | "colab_type": "text"
335 | },
336 | "cell_type": "markdown",
337 | "source": [
338 | "## Questions\n",
339 | "\n",
340 | "- Do the plots change your view on how these variable should be reported?"
341 | ]
342 | }
343 | ]
344 | }
--------------------------------------------------------------------------------
/eicu_python/03_summary_statistics_satoshi.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "colab_type": "text",
7 | "id": "view-in-github"
8 | },
9 | "source": [
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "colab_type": "text",
17 | "id": "1G_TVh1ybQkl"
18 | },
19 | "source": [
20 | "# eICU Collaborative Research Database\n",
21 | "\n",
22 | "# Notebook 3: Summary statistics\n",
23 | "\n",
24 | "This notebook shows how summary statistics can be computed for a patient cohort using the `tableone` package. Usage instructions for tableone are at: https://pypi.org/project/tableone/\n",
25 | "\n",
26 | "このノートブックでは、`tableone`というパッケージを用いて、データの分布などの詳細について見ていきます。\n"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {
32 | "colab_type": "text",
33 | "id": "L9XF77F2bnee"
34 | },
35 | "source": [
36 | "## Load libraries and connect to the database"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 0,
42 | "metadata": {
43 | "colab": {},
44 | "colab_type": "code",
45 | "id": "wXiSE558bn_w"
46 | },
47 | "outputs": [],
48 | "source": [
49 | "# Import libraries\n",
50 | "import numpy as np\n",
51 | "import os\n",
52 | "import pandas as pd\n",
53 | "import matplotlib.pyplot as plt\n",
54 | "import matplotlib.patches as patches\n",
55 | "import matplotlib.path as path\n",
56 | "\n",
57 | "# Make pandas dataframes prettier\n",
58 | "from IPython.display import display, HTML\n",
59 | "\n",
60 | "# Access data using Google BigQuery.\n",
61 | "from google.colab import auth\n",
62 | "from google.cloud import bigquery"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 0,
68 | "metadata": {
69 | "colab": {},
70 | "colab_type": "code",
71 | "id": "pLGnLAy-bsKb"
72 | },
73 | "outputs": [],
74 | "source": [
75 | "# authenticate\n",
76 | "auth.authenticate_user()"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 0,
82 | "metadata": {
83 | "colab": {},
84 | "colab_type": "code",
85 | "id": "PUjFDFdobszs"
86 | },
87 | "outputs": [],
88 | "source": [
89 | "# Set up environment variables\n",
90 | "project_id='datathonjapan2019'\n",
91 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {
97 | "colab_type": "text",
98 | "id": "ObJnK4WTUBKo"
99 | },
100 | "source": [
101 | "To make our lives easier, we'll also install and import a set of helper functions from the `datathon2` package."
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 0,
107 | "metadata": {
108 | "colab": {},
109 | "colab_type": "code",
110 | "id": "bkJUF8HBbvWe"
111 | },
112 | "outputs": [],
113 | "source": [
114 | "!pip install datathon2"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 0,
120 | "metadata": {
121 | "colab": {},
122 | "colab_type": "code",
123 | "id": "dzPzFZykUIIx"
124 | },
125 | "outputs": [],
126 | "source": [
127 | "import datathon2 as dtn"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {
133 | "colab_type": "text",
134 | "id": "iWDUCA5Nb5BK"
135 | },
136 | "source": [
137 | "## Install and load the `tableone` package\n",
138 | "\n",
139 | "The tableone package can be used to compute summary statistics for a patient cohort. Unlike the previous packages, it is not installed by default in Colab, so will need to install it first.\n",
140 | "\n",
141 | "これまで使ったパッケージに加え、`tableone`もインストールします。"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 0,
147 | "metadata": {
148 | "colab": {},
149 | "colab_type": "code",
150 | "id": "F9doCgtscOJd"
151 | },
152 | "outputs": [],
153 | "source": [
154 | "!pip install tableone"
155 | ]
156 | },
157 | {
158 | "cell_type": "code",
159 | "execution_count": 0,
160 | "metadata": {
161 | "colab": {},
162 | "colab_type": "code",
163 | "id": "SDI_Q7W0b4Le"
164 | },
165 | "outputs": [],
166 | "source": [
167 | "# Import the tableone class\n",
168 | "from tableone import TableOne"
169 | ]
170 | },
171 | {
172 | "cell_type": "markdown",
173 | "metadata": {
174 | "colab_type": "text",
175 | "id": "14TU4lcrdD7I"
176 | },
177 | "source": [
178 | "## Load the patient cohort\n",
179 | "\n",
180 | "In this example, we will load all data from the patient data, and link it to APACHE data to provide richer summary information.\n",
181 | "\n",
182 | "`patient`テーブル(患者情報)と`apachepatientresult`テーブル(APACHEスコアに関する情報)を一つにまとめます。"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": 2,
188 | "metadata": {
189 | "colab": {},
190 | "colab_type": "code",
191 | "id": "HF5WF5EObwfw"
192 | },
193 | "outputs": [
194 | {
195 | "ename": "NameError",
196 | "evalue": "name 'dtn' is not defined",
197 | "output_type": "error",
198 | "traceback": [
199 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
200 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
201 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 13\u001b[0m \"\"\"\n\u001b[1;32m 14\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m \u001b[0mcohort\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdtn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_query\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mquery\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mproject_id\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
202 | "\u001b[0;31mNameError\u001b[0m: name 'dtn' is not defined"
203 | ]
204 | }
205 | ],
206 | "source": [
207 | "# Link the patient and apachepatientresult tables on patientunitstayid\n",
208 | "# using an inner join.\n",
209 | "query = \"\"\"\n",
210 | "SELECT p.unitadmitsource, p.gender, p.age, p.ethnicity, p.admissionweight, \n",
211 | " p.unittype, p.unitstaytype, a.acutephysiologyscore,\n",
212 | " a.apachescore, a.actualiculos, a.actualhospitalmortality,\n",
213 | " a.unabridgedunitlos, a.unabridgedhosplos\n",
214 | "FROM `physionet-data.eicu_crd_demo.patient` p\n",
215 | "INNER JOIN `physionet-data.eicu_crd_demo.apachepatientresult` a\n",
216 | "ON p.patientunitstayid = a.patientunitstayid\n",
217 | "WHERE apacheversion LIKE 'IVa'\n",
218 | "\"\"\"\n",
219 | "\n",
220 | "cohort = dtn.run_query(query,project_id)"
221 | ]
222 | },
223 | {
224 | "cell_type": "code",
225 | "execution_count": 0,
226 | "metadata": {
227 | "colab": {},
228 | "colab_type": "code",
229 | "id": "k3hURHFihHNA"
230 | },
231 | "outputs": [],
232 | "source": [
233 | "cohort.head()"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {
239 | "colab_type": "text",
240 | "id": "qnG8dVb2iHSn"
241 | },
242 | "source": [
243 | "## Calculate summary statistics\n",
244 | "\n",
245 | "Before summarizing the data, we will need to convert the ages to numerical values.\n",
246 | "\n",
247 | "まず、`age`を数値に変えて`agenum`という列を作ります。"
248 | ]
249 | },
250 | {
251 | "cell_type": "code",
252 | "execution_count": 0,
253 | "metadata": {
254 | "colab": {},
255 | "colab_type": "code",
256 | "id": "oKHpqwAPkx6U"
257 | },
258 | "outputs": [],
259 | "source": [
260 | "cohort['agenum'] = pd.to_numeric(cohort['age'], errors='coerce')"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": 0,
266 | "metadata": {
267 | "colab": {},
268 | "colab_type": "code",
269 | "id": "FQT-u8EXhXRG"
270 | },
271 | "outputs": [],
272 | "source": [
273 | "columns = ['unitadmitsource', 'gender', 'agenum', 'ethnicity',\n",
274 | " 'admissionweight','unittype','unitstaytype',\n",
275 | " 'acutephysiologyscore','apachescore','actualiculos',\n",
276 | " 'unabridgedunitlos','unabridgedhosplos']"
277 | ]
278 | },
279 | {
280 | "cell_type": "code",
281 | "execution_count": 0,
282 | "metadata": {
283 | "colab": {},
284 | "colab_type": "code",
285 | "id": "3ETr3NCzielL"
286 | },
287 | "outputs": [],
288 | "source": [
289 | "TableOne(cohort, columns=columns, labels={'agenum': 'age'}, \n",
290 | " groupby='actualhospitalmortality',\n",
291 | " label_suffix=True, limit=4)"
292 | ]
293 | },
294 | {
295 | "cell_type": "markdown",
296 | "metadata": {
297 | "colab_type": "text",
298 | "id": "LCBcpJ9bZpDp"
299 | },
300 | "source": [
301 | "## Questions\n",
302 | "\n",
303 | "- Are the severity of illness measures higher in the survival or non-survival group?(生存群 vs. 死亡群での重症度スコアの比較)\n",
304 | "- What issues suggest that some of the summary statistics might be misleading?(このサマリーで奇妙な点を指摘してください)\n",
305 | "- How might you address these issues?"
306 | ]
307 | },
308 | {
309 | "cell_type": "markdown",
310 | "metadata": {
311 | "colab_type": "text",
312 | "id": "2_8z1CIVahWg"
313 | },
314 | "source": [
315 | "## Visualizing the data\n",
316 | "\n",
317 | "Plotting the distribution of each variable by group level via histograms, kernel density estimates and boxplots is a crucial component to data analysis pipelines. Vizualisation is often is the only way to detect problematic variables in many real-life scenarios. We'll review a couple of the variables.\n",
318 | "\n",
319 | "データの分布を視覚化することは、データの問題点を把握するために非常に重要な方法です。以下にその例をみてみましょう。"
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": 0,
325 | "metadata": {
326 | "colab": {},
327 | "colab_type": "code",
328 | "id": "81yp2bSUigzh"
329 | },
330 | "outputs": [],
331 | "source": [
332 | "# Plot distributions to review possible multimodality\n",
333 | "cohort[['acutephysiologyscore','agenum']].dropna().plot.kde(figsize=[12,8])\n",
334 | "plt.legend(['APS Score', 'Age (years)'])\n",
335 | "plt.xlim([-30,250])"
336 | ]
337 | },
338 | {
339 | "cell_type": "markdown",
340 | "metadata": {
341 | "colab_type": "text",
342 | "id": "kZDUZB5sdhhU"
343 | },
344 | "source": [
345 | "## Questions\n",
346 | "\n",
347 | "- Do the plots change your view on how these variable should be reported?"
348 | ]
349 | }
350 | ],
351 | "metadata": {
352 | "colab": {
353 | "collapsed_sections": [],
354 | "include_colab_link": true,
355 | "name": "03-summary-statistics",
356 | "provenance": [],
357 | "version": "0.3.2"
358 | },
359 | "kernelspec": {
360 | "display_name": "Python 3",
361 | "language": "python",
362 | "name": "python3"
363 | },
364 | "language_info": {
365 | "codemirror_mode": {
366 | "name": "ipython",
367 | "version": 3
368 | },
369 | "file_extension": ".py",
370 | "mimetype": "text/x-python",
371 | "name": "python",
372 | "nbconvert_exporter": "python",
373 | "pygments_lexer": "ipython3",
374 | "version": "3.6.5"
375 | }
376 | },
377 | "nbformat": 4,
378 | "nbformat_minor": 1
379 | }
380 |
--------------------------------------------------------------------------------
/eicu_python/01_explore_patients.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "01-explore-patient-table",
7 | "version": "0.3.2",
8 | "provenance": [],
9 | "collapsed_sections": [],
10 | "include_colab_link": true
11 | },
12 | "kernelspec": {
13 | "name": "python3",
14 | "display_name": "Python 3"
15 | }
16 | },
17 | "cells": [
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {
21 | "id": "view-in-github",
22 | "colab_type": "text"
23 | },
24 | "source": [
25 | "
"
26 | ]
27 | },
28 | {
29 | "metadata": {
30 | "id": "NCI19_Ix7xuI",
31 | "colab_type": "text"
32 | },
33 | "cell_type": "markdown",
34 | "source": [
35 | "# eICU Collaborative Research Database\n",
36 | "\n",
37 | "# Notebook 1: Exploring the patient table\n",
38 | "\n",
39 | "In this notebook we introduce the patient table, a key table in the [eICU Collaborative Research Database](http://eicu-crd.mit.edu/). The patient table contains patient demographics and admission and discharge details for hospital and ICU stays. For more detail, see: http://eicu-crd.mit.edu/eicutables/patient/"
40 | ]
41 | },
42 | {
43 | "metadata": {
44 | "id": "l_CmlcBu8Wei",
45 | "colab_type": "text"
46 | },
47 | "cell_type": "markdown",
48 | "source": [
49 | "## Load libraries and connect to the data\n",
50 | "\n",
51 | "Run the following cells to import some libraries and then connect to the database."
52 | ]
53 | },
54 | {
55 | "metadata": {
56 | "id": "3WQsJiAj8B5L",
57 | "colab_type": "code",
58 | "colab": {}
59 | },
60 | "cell_type": "code",
61 | "source": [
62 | "# Import libraries\n",
63 | "import numpy as np\n",
64 | "import os\n",
65 | "import pandas as pd\n",
66 | "import matplotlib.pyplot as plt\n",
67 | "import matplotlib.patches as patches\n",
68 | "import matplotlib.path as path\n",
69 | "\n",
70 | "# Make pandas dataframes prettier\n",
71 | "from IPython.display import display, HTML\n",
72 | "\n",
73 | "# Access data using Google BigQuery.\n",
74 | "from google.colab import auth\n",
75 | "from google.cloud import bigquery"
76 | ],
77 | "execution_count": 0,
78 | "outputs": []
79 | },
80 | {
81 | "metadata": {
82 | "id": "Ld59KZ0W9E4v",
83 | "colab_type": "text"
84 | },
85 | "cell_type": "markdown",
86 | "source": [
87 | "As before, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter."
88 | ]
89 | },
90 | {
91 | "metadata": {
92 | "id": "ABh4hMt288yg",
93 | "colab_type": "code",
94 | "colab": {}
95 | },
96 | "cell_type": "code",
97 | "source": [
98 | "auth.authenticate_user()"
99 | ],
100 | "execution_count": 0,
101 | "outputs": []
102 | },
103 | {
104 | "metadata": {
105 | "id": "BPoHP2a8_eni",
106 | "colab_type": "text"
107 | },
108 | "cell_type": "markdown",
109 | "source": [
110 | "We'll also set the project details."
111 | ]
112 | },
113 | {
114 | "metadata": {
115 | "id": "P0fdtVMa_di9",
116 | "colab_type": "code",
117 | "colab": {}
118 | },
119 | "cell_type": "code",
120 | "source": [
121 | "project_id='datathonjapan2019'\n",
122 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
123 | ],
124 | "execution_count": 0,
125 | "outputs": []
126 | },
127 | {
128 | "metadata": {
129 | "id": "loOu79n7P-j9",
130 | "colab_type": "text"
131 | },
132 | "cell_type": "markdown",
133 | "source": [
134 | "To make our lives easier, finally we'll install and import a set of helper functions from the `datathon2` package."
135 | ]
136 | },
137 | {
138 | "metadata": {
139 | "id": "8TgC_2rDP-_R",
140 | "colab_type": "code",
141 | "colab": {}
142 | },
143 | "cell_type": "code",
144 | "source": [
145 | "!pip install datathon2"
146 | ],
147 | "execution_count": 0,
148 | "outputs": []
149 | },
150 | {
151 | "metadata": {
152 | "id": "a7IeYb4TQUMQ",
153 | "colab_type": "code",
154 | "colab": {}
155 | },
156 | "cell_type": "code",
157 | "source": [
158 | "import datathon2 as dtn"
159 | ],
160 | "execution_count": 0,
161 | "outputs": []
162 | },
163 | {
164 | "metadata": {
165 | "id": "5bHZALFP9VN1",
166 | "colab_type": "text"
167 | },
168 | "cell_type": "markdown",
169 | "source": [
170 | "# Load data from the `patient` table\n",
171 | "\n",
172 | "Now we can start exploring the data. We'll begin by running a simple query on the database to load all columns of the `patient` table to a Pandas DataFrame. The query is written in SQL, a common language for extracting data from databases. The structure of an SQL query is:\n",
173 | "\n",
174 | "```sql\n",
175 | "SELECT \n",
176 | "FROM \n",
177 | "WHERE \n",
178 | "```\n",
179 | "\n",
180 | "`*` is a wildcard that indicates all columns"
181 | ]
182 | },
183 | {
184 | "metadata": {
185 | "id": "RE-UZAPG_rHq",
186 | "colab_type": "code",
187 | "colab": {}
188 | },
189 | "cell_type": "code",
190 | "source": [
191 | "query = \"\"\"\n",
192 | "SELECT *\n",
193 | "FROM `physionet-data.eicu_crd_demo.patient`\n",
194 | "\"\"\"\n",
195 | "\n",
196 | "patient = dtn.run_query(query,project_id)"
197 | ],
198 | "execution_count": 0,
199 | "outputs": []
200 | },
201 | {
202 | "metadata": {
203 | "id": "YbnkcCZxBkdK",
204 | "colab_type": "text"
205 | },
206 | "cell_type": "markdown",
207 | "source": [
208 | "We have now assigned the output to our query to a variable called `patient`. Let's use the `head` method to view the first few rows of our data."
209 | ]
210 | },
211 | {
212 | "metadata": {
213 | "id": "GZph0FPDASEs",
214 | "colab_type": "code",
215 | "colab": {}
216 | },
217 | "cell_type": "code",
218 | "source": [
219 | "# view the top few rows of the patient data\n",
220 | "patient.head()"
221 | ],
222 | "execution_count": 0,
223 | "outputs": []
224 | },
225 | {
226 | "metadata": {
227 | "id": "TlxaXLevC_Rz",
228 | "colab_type": "text"
229 | },
230 | "cell_type": "markdown",
231 | "source": [
232 | "## Questions\n",
233 | "\n",
234 | "- What does `patientunitstayid` represent? (hint, see: http://eicu-crd.mit.edu/eicutables/patient/)\n",
235 | "- What does `patienthealthsystemstayid` represent?\n",
236 | "- What does `uniquepid` represent?"
237 | ]
238 | },
239 | {
240 | "metadata": {
241 | "id": "2rLY0WyCBzp9",
242 | "colab_type": "code",
243 | "colab": {}
244 | },
245 | "cell_type": "code",
246 | "source": [
247 | "# select a limited number of columns to view\n",
248 | "columns = ['uniquepid', 'patientunitstayid','gender','age','unitdischargestatus']\n",
249 | "patient[columns].head()"
250 | ],
251 | "execution_count": 0,
252 | "outputs": []
253 | },
254 | {
255 | "metadata": {
256 | "id": "FSdS2hS4EWtb",
257 | "colab_type": "text"
258 | },
259 | "cell_type": "markdown",
260 | "source": [
261 | "- Try running the following query, which lists unique values in the age column. What do you notice?"
262 | ]
263 | },
264 | {
265 | "metadata": {
266 | "id": "0Aom69ftDxBN",
267 | "colab_type": "code",
268 | "colab": {}
269 | },
270 | "cell_type": "code",
271 | "source": [
272 | "# what are the unique values for age?\n",
273 | "age_col = 'age'\n",
274 | "patient[age_col].sort_values().unique()"
275 | ],
276 | "execution_count": 0,
277 | "outputs": []
278 | },
279 | {
280 | "metadata": {
281 | "id": "Y_qJL94jE0k8",
282 | "colab_type": "text"
283 | },
284 | "cell_type": "markdown",
285 | "source": [
286 | "- Try plotting a histogram of ages using the command in the cell below. What happens? Why?"
287 | ]
288 | },
289 | {
290 | "metadata": {
291 | "id": "1zad3Gr4D4LE",
292 | "colab_type": "code",
293 | "colab": {}
294 | },
295 | "cell_type": "code",
296 | "source": [
297 | "# try plotting a histogram of ages\n",
298 | "patient[age_col].plot(kind='hist', bins=15)"
299 | ],
300 | "execution_count": 0,
301 | "outputs": []
302 | },
303 | {
304 | "metadata": {
305 | "id": "xIdwVEEPF25H",
306 | "colab_type": "text"
307 | },
308 | "cell_type": "markdown",
309 | "source": [
310 | "Let's create a new column named `age_num`, then try again."
311 | ]
312 | },
313 | {
314 | "metadata": {
315 | "id": "-rwc-28oFF6R",
316 | "colab_type": "code",
317 | "colab": {}
318 | },
319 | "cell_type": "code",
320 | "source": [
321 | "# create a column containing numerical ages\n",
322 | "# If ‘coerce’, then invalid parsing will be set as NaN\n",
323 | "agenum_col = 'age_num'\n",
324 | "patient[agenum_col] = pd.to_numeric(patient[age_col], errors='coerce')\n",
325 | "patient[agenum_col].sort_values().unique()"
326 | ],
327 | "execution_count": 0,
328 | "outputs": []
329 | },
330 | {
331 | "metadata": {
332 | "id": "uTFMqqWqFMjG",
333 | "colab_type": "code",
334 | "colab": {}
335 | },
336 | "cell_type": "code",
337 | "source": [
338 | "patient[agenum_col].plot(kind='hist', bins=15)"
339 | ],
340 | "execution_count": 0,
341 | "outputs": []
342 | },
343 | {
344 | "metadata": {
345 | "id": "FrbR8rV3GlR1",
346 | "colab_type": "text"
347 | },
348 | "cell_type": "markdown",
349 | "source": [
350 | "## Questions\n",
351 | "\n",
352 | "- Use the `mean()` method to find the average age. Why do we expect this to be lower than the true mean?\n",
353 | "- In the same way that you use `mean()`, you can use `describe()`, `max()`, and `min()`. Look at the admission heights (`admissionheight`) of patients in cm. What issue do you see? How can you deal with this issue?"
354 | ]
355 | },
356 | {
357 | "metadata": {
358 | "id": "TPps13DZG6Ac",
359 | "colab_type": "code",
360 | "colab": {}
361 | },
362 | "cell_type": "code",
363 | "source": [
364 | "adheight_col = 'admissionheight'\n",
365 | "patient[adheight_col].describe()"
366 | ],
367 | "execution_count": 0,
368 | "outputs": []
369 | },
370 | {
371 | "metadata": {
372 | "id": "9jhV9xQoGRJq",
373 | "colab_type": "code",
374 | "colab": {}
375 | },
376 | "cell_type": "code",
377 | "source": [
378 | "# set threshold\n",
379 | "adheight_col = 'admissionheight'\n",
380 | "patient[patient[adheight_col] < 10] = None"
381 | ],
382 | "execution_count": 0,
383 | "outputs": []
384 | }
385 | ]
386 | }
--------------------------------------------------------------------------------
/eicu_python/01_explore_patients_satoshi.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "colab_type": "text",
7 | "id": "view-in-github"
8 | },
9 | "source": [
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "colab_type": "text",
17 | "id": "NCI19_Ix7xuI"
18 | },
19 | "source": [
20 | "# eICU Collaborative Research Database\n",
21 | "\n",
22 | "# Notebook 1: Exploring the patient table:表の作り方\n",
23 | "\n",
24 | "In this notebook we introduce the patient table, a key table in the [eICU Collaborative Research Database](http://eicu-crd.mit.edu/). The patient table contains patient demographics and admission and discharge details for hospital and ICU stays. For more detail, see: http://eicu-crd.mit.edu/eicutables/patient/\n",
25 | "\n",
26 | "患者背景、入室、退室、ICU滞在期間などの表の作り方を説明します。"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {
32 | "colab_type": "text",
33 | "id": "l_CmlcBu8Wei"
34 | },
35 | "source": [
36 | "## Load libraries and connect to the data\n",
37 | "Run the following cells to import some libraries and then connect to the database.\n",
38 | "\n",
39 | "以下の操作により、必要なlibraryとデータベースへのアクセスが可能となります。"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 0,
45 | "metadata": {
46 | "colab": {},
47 | "colab_type": "code",
48 | "id": "3WQsJiAj8B5L"
49 | },
50 | "outputs": [],
51 | "source": [
52 | "# Import libraries\n",
53 | "import numpy as np\n",
54 | "import os\n",
55 | "import pandas as pd\n",
56 | "import matplotlib.pyplot as plt\n",
57 | "import matplotlib.patches as patches\n",
58 | "import matplotlib.path as path\n",
59 | "\n",
60 | "# Make pandas dataframes prettier\n",
61 | "from IPython.display import display, HTML\n",
62 | "\n",
63 | "# Access data using Google BigQuery.\n",
64 | "from google.colab import auth\n",
65 | "from google.cloud import bigquery"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {
71 | "colab_type": "text",
72 | "id": "Ld59KZ0W9E4v"
73 | },
74 | "source": [
75 | "As before, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter.\n",
76 | "\n",
77 | "まず、下のセルを走らせて(`Run`をクリック)ください。提示されてリンクに従い、各自で\"verification code\"をコピー&ペーストしてください。各自のgmail accountからアクセスすることが可能になります。"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 0,
83 | "metadata": {
84 | "colab": {},
85 | "colab_type": "code",
86 | "id": "ABh4hMt288yg"
87 | },
88 | "outputs": [],
89 | "source": [
90 | "auth.authenticate_user()"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {
96 | "colab_type": "text",
97 | "id": "BPoHP2a8_eni"
98 | },
99 | "source": [
100 | "We'll also set the project details."
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 0,
106 | "metadata": {
107 | "colab": {},
108 | "colab_type": "code",
109 | "id": "P0fdtVMa_di9"
110 | },
111 | "outputs": [],
112 | "source": [
113 | "project_id='datathonjapan2019'\n",
114 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {
120 | "colab_type": "text",
121 | "id": "loOu79n7P-j9"
122 | },
123 | "source": [
124 | "To make our lives easier, finally we'll install and import a set of helper functions from the `datathon2` package."
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 0,
130 | "metadata": {
131 | "colab": {},
132 | "colab_type": "code",
133 | "id": "8TgC_2rDP-_R"
134 | },
135 | "outputs": [],
136 | "source": [
137 | "!pip install datathon2"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 0,
143 | "metadata": {
144 | "colab": {},
145 | "colab_type": "code",
146 | "id": "a7IeYb4TQUMQ"
147 | },
148 | "outputs": [],
149 | "source": [
150 | "import datathon2 as dtn"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "ここまでが準備段階です。以下、データの抽出とテーブル作成に移ります。"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {
163 | "colab_type": "text",
164 | "id": "5bHZALFP9VN1"
165 | },
166 | "source": [
167 | "# Load data from the `patient` table\n",
168 | "\n",
169 | "Now we can start exploring the data. We'll begin by running a simple query on the database to load all columns of the `patient` table to a Pandas DataFrame. The query is written in SQL, a common language for extracting data from databases. The structure of an SQL query is:\n",
170 | "\n",
171 | "以下が、SQLを使ってデータを抽出(query)する基本的な構文です。`SELECT`で列を、`FROM`でテーブルを指定します。オプションで`WHERE`でより細かな条件を指定できます。\n",
172 | "\n",
173 | "```sql\n",
174 | "SELECT \n",
175 | "FROM \n",
176 | "WHERE \n",
177 | "```\n",
178 | "\n",
179 | "`*` is a wildcard that indicates all columns\n",
180 | "\n",
181 | "`*`を用いることで、全ての列を抽出できます。"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 0,
187 | "metadata": {
188 | "colab": {},
189 | "colab_type": "code",
190 | "id": "RE-UZAPG_rHq"
191 | },
192 | "outputs": [],
193 | "source": [
194 | "query = \"\"\"\n",
195 | "SELECT *\n",
196 | "FROM `physionet-data.eicu_crd_demo.patient`\n",
197 | "\"\"\"\n",
198 | "\n",
199 | "patient = dtn.run_query(query,project_id)"
200 | ]
201 | },
202 | {
203 | "cell_type": "markdown",
204 | "metadata": {
205 | "colab_type": "text",
206 | "id": "YbnkcCZxBkdK"
207 | },
208 | "source": [
209 | "We have now assigned the output to our query to a variable called `patient`. Let's use the `head` method to view the first few rows of our data."
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": 0,
215 | "metadata": {
216 | "colab": {},
217 | "colab_type": "code",
218 | "id": "GZph0FPDASEs"
219 | },
220 | "outputs": [],
221 | "source": [
222 | "# view the top few rows of the patient data\n",
223 | "patient.head()"
224 | ]
225 | },
226 | {
227 | "cell_type": "markdown",
228 | "metadata": {
229 | "colab_type": "text",
230 | "id": "TlxaXLevC_Rz"
231 | },
232 | "source": [
233 | "## Questions\n",
234 | "\n",
235 | "- What does `patientunitstayid` represent? (hint, see: http://eicu-crd.mit.edu/eicutables/patient/) \n",
236 | "\n",
237 | "(`patientunitstayid`列の意味は?)\n",
238 | "\n",
239 | "- What does `patienthealthsystemstayid` represent?\n",
240 | "\n",
241 | "(`patienthealthsystemstayid`列の意味は?)\n",
242 | "\n",
243 | "- What does `uniquepid` represent?\n",
244 | "\n",
245 | "(`uniquepid`列の意味は?)"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": 0,
251 | "metadata": {
252 | "colab": {},
253 | "colab_type": "code",
254 | "id": "2rLY0WyCBzp9"
255 | },
256 | "outputs": [],
257 | "source": [
258 | "# select a limited number of columns to view\n",
259 | "columns = ['uniquepid', 'patientunitstayid','gender','age','unitdischargestatus']\n",
260 | "patient[columns].head()"
261 | ]
262 | },
263 | {
264 | "cell_type": "markdown",
265 | "metadata": {
266 | "colab_type": "text",
267 | "id": "FSdS2hS4EWtb"
268 | },
269 | "source": [
270 | "- Try running the following query, which lists unique values in the age column. What do you notice?\n",
271 | "\n",
272 | "- `age`の分布をヒストグラムで描いてみましょう。予想通りですか?"
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": 0,
278 | "metadata": {
279 | "colab": {},
280 | "colab_type": "code",
281 | "id": "0Aom69ftDxBN"
282 | },
283 | "outputs": [],
284 | "source": [
285 | "# what are the unique values for age?\n",
286 | "age_col = 'age'\n",
287 | "patient[age_col].sort_values().unique()"
288 | ]
289 | },
290 | {
291 | "cell_type": "markdown",
292 | "metadata": {
293 | "colab_type": "text",
294 | "id": "Y_qJL94jE0k8"
295 | },
296 | "source": [
297 | "- Try plotting a histogram of ages using the command in the cell below. What happens? Why?"
298 | ]
299 | },
300 | {
301 | "cell_type": "code",
302 | "execution_count": 0,
303 | "metadata": {
304 | "colab": {},
305 | "colab_type": "code",
306 | "id": "1zad3Gr4D4LE"
307 | },
308 | "outputs": [],
309 | "source": [
310 | "# try plotting a histogram of ages\n",
311 | "patient[age_col].plot(kind='hist', bins=15)"
312 | ]
313 | },
314 | {
315 | "cell_type": "markdown",
316 | "metadata": {
317 | "colab_type": "text",
318 | "id": "xIdwVEEPF25H"
319 | },
320 | "source": [
321 | "Let's create a new column named `age_num`, then try again."
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "execution_count": 0,
327 | "metadata": {
328 | "colab": {},
329 | "colab_type": "code",
330 | "id": "-rwc-28oFF6R"
331 | },
332 | "outputs": [],
333 | "source": [
334 | "# create a column containing numerical ages\n",
335 | "# If ‘coerce’, then invalid parsing will be set as NaN\n",
336 | "agenum_col = 'age_num'\n",
337 | "patient[agenum_col] = pd.to_numeric(patient[age_col], errors='coerce')\n",
338 | "patient[agenum_col].sort_values().unique()"
339 | ]
340 | },
341 | {
342 | "cell_type": "code",
343 | "execution_count": 0,
344 | "metadata": {
345 | "colab": {},
346 | "colab_type": "code",
347 | "id": "uTFMqqWqFMjG"
348 | },
349 | "outputs": [],
350 | "source": [
351 | "patient[agenum_col].plot(kind='hist', bins=15)"
352 | ]
353 | },
354 | {
355 | "cell_type": "markdown",
356 | "metadata": {
357 | "colab_type": "text",
358 | "id": "FrbR8rV3GlR1"
359 | },
360 | "source": [
361 | "## Questions\n",
362 | "\n",
363 | "- Use the `mean()` method to find the average age. Why do we expect this to be lower than the true mean?\n",
364 | "\n",
365 | "(`mean`を用いて年齢の平均を出し、本当の平均値と比較してみましょう。)\n",
366 | "- In the same way that you use `mean()`, you can use `describe()`, `max()`, and `min()`. Look at the admission heights (`admissionheight`) of patients in cm. What issue do you see? How can you deal with this issue?\n",
367 | "\n",
368 | "(同様に`describe()`, `max()`, `min()`を使ってみて下さい。どこか問題がありませんか?)"
369 | ]
370 | },
371 | {
372 | "cell_type": "code",
373 | "execution_count": 0,
374 | "metadata": {
375 | "colab": {},
376 | "colab_type": "code",
377 | "id": "TPps13DZG6Ac"
378 | },
379 | "outputs": [],
380 | "source": [
381 | "adheight_col = 'admissionheight'\n",
382 | "patient[adheight_col].describe()"
383 | ]
384 | },
385 | {
386 | "cell_type": "code",
387 | "execution_count": 0,
388 | "metadata": {
389 | "colab": {},
390 | "colab_type": "code",
391 | "id": "9jhV9xQoGRJq"
392 | },
393 | "outputs": [],
394 | "source": [
395 | "# set threshold\n",
396 | "adheight_col = 'admissionheight'\n",
397 | "patient[patient[adheight_col] < 10] = None"
398 | ]
399 | }
400 | ],
401 | "metadata": {
402 | "colab": {
403 | "collapsed_sections": [],
404 | "include_colab_link": true,
405 | "name": "01-explore-patient-table",
406 | "provenance": [],
407 | "version": "0.3.2"
408 | },
409 | "kernelspec": {
410 | "display_name": "Python 3",
411 | "language": "python",
412 | "name": "python3"
413 | },
414 | "language_info": {
415 | "codemirror_mode": {
416 | "name": "ipython",
417 | "version": 3
418 | },
419 | "file_extension": ".py",
420 | "mimetype": "text/x-python",
421 | "name": "python",
422 | "nbconvert_exporter": "python",
423 | "pygments_lexer": "ipython3",
424 | "version": "3.6.5"
425 | }
426 | },
427 | "nbformat": 4,
428 | "nbformat_minor": 1
429 | }
430 |
--------------------------------------------------------------------------------
/eicu_python/02_severity_of_illness_satoshi.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "colab_type": "text",
7 | "id": "view-in-github"
8 | },
9 | "source": [
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "colab_type": "text",
17 | "id": "y4AOVdliM8gm"
18 | },
19 | "source": [
20 | "# eICU Collaborative Research Database\n",
21 | "\n",
22 | "# Notebook 2: Severity of illness\n",
23 | "\n",
24 | "This notebook introduces high level admission details relating to a single patient stay, using the following tables:\n",
25 | "\n",
26 | "ここでは、以下のテーブルを用いて患者のより詳細な入院情報を抽出します。\n",
27 | "\n",
28 | "- patient\n",
29 | "- admissiondx\n",
30 | "- apacheapsvar\n",
31 | "- apachepredvar\n",
32 | "- apachepatientresult\n",
33 | "\n"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {
39 | "colab_type": "text",
40 | "id": "e0lUnIkYOyv4"
41 | },
42 | "source": [
43 | "## Load libraries and connect to the database\n",
44 | "\n",
45 | "Notebook 1と同様、必要なlibraryとデータベースへのアクセスを行ってください。"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "metadata": {
52 | "colab": {},
53 | "colab_type": "code",
54 | "id": "SJ6l1i3fOL4j"
55 | },
56 | "outputs": [],
57 | "source": [
58 | "# Import libraries\n",
59 | "import numpy as np\n",
60 | "import os\n",
61 | "import pandas as pd\n",
62 | "import matplotlib.pyplot as plt\n",
63 | "import matplotlib.patches as patches\n",
64 | "import matplotlib.path as path\n",
65 | "\n",
66 | "# Make pandas dataframes prettier\n",
67 | "from IPython.display import display, HTML\n",
68 | "\n",
69 | "# Access data using Google BigQuery.\n",
70 | "from google.colab import auth\n",
71 | "from google.cloud import bigquery"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": null,
77 | "metadata": {
78 | "colab": {},
79 | "colab_type": "code",
80 | "id": "TE4JYS8aO-69"
81 | },
82 | "outputs": [],
83 | "source": [
84 | "# authenticate\n",
85 | "auth.authenticate_user()"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "metadata": {
92 | "colab": {},
93 | "colab_type": "code",
94 | "id": "oVavf-ujPOAv"
95 | },
96 | "outputs": [],
97 | "source": [
98 | "# Set up environment variables\n",
99 | "project_id='datathonjapan2019'\n",
100 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {
106 | "colab_type": "text",
107 | "id": "OEuFlzpIT3rT"
108 | },
109 | "source": [
110 | "To make our lives easier, we'll also install and import a set of helper functions from the `datathon2` package.\n",
111 | "\n"
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": null,
117 | "metadata": {
118 | "colab": {},
119 | "colab_type": "code",
120 | "id": "aBc7PA0KSIFM"
121 | },
122 | "outputs": [],
123 | "source": [
124 | "!pip install datathon2"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": null,
130 | "metadata": {
131 | "colab": {},
132 | "colab_type": "code",
133 | "id": "k2wcZtxVUHJy"
134 | },
135 | "outputs": [],
136 | "source": [
137 | "import datathon2 as dtn"
138 | ]
139 | },
140 | {
141 | "cell_type": "markdown",
142 | "metadata": {
143 | "colab_type": "text",
144 | "id": "a1CAI3GjQYE0"
145 | },
146 | "source": [
147 | "## Selecting a single patient stay¶\n",
148 | "\n",
149 | "As we have seen, the patient table includes general information about the patient admissions (for example, demographics, admission and discharge details). See: http://eicu-crd.mit.edu/eicutables/patient/\n",
150 | "\n",
151 | "まずは、`patient`テーブルからのqueryを行ってみます。\n",
152 | "\n",
153 | "## Questions\n",
154 | "\n",
155 | "Use your knowledge from the previous notebook and the online documentation (http://eicu-crd.mit.edu/) to answer the following questions:\n",
156 | "\n",
157 | "- Which column in the patient table is distinct for each stay in the ICU (similar to `icustay_id` in MIMIC-III)?\n",
158 | "- Which column is unique for each patient (similar to `subject_id` in MIMIC-III)?\n",
159 | "\n",
160 | "注:MIMIC-IIIでは、患者一人一人が異なるIDを持っていますが(`subject_id`)、同人物が複数回ICUに入室した場合それぞれ異なる`icustay_id`が割り振られます。"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": null,
166 | "metadata": {
167 | "colab": {},
168 | "colab_type": "code",
169 | "id": "R6huFICkSQAd"
170 | },
171 | "outputs": [],
172 | "source": [
173 | "# view distinct ids\n",
174 | "query = \"\"\"\n",
175 | "SELECT DISTINCT(patientunitstayid)\n",
176 | "FROM `physionet-data.eicu_crd_demo.patient`\n",
177 | "\"\"\"\n",
178 | "\n",
179 | "dtn.run_query(query,project_id)"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": null,
185 | "metadata": {
186 | "colab": {},
187 | "colab_type": "code",
188 | "id": "lfeQwFlvRly7"
189 | },
190 | "outputs": [],
191 | "source": [
192 | "# select a single ICU stay(適当に一つIDを選んでください)\n",
193 | "patientunitstayid = "
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": null,
199 | "metadata": {
200 | "colab": {},
201 | "colab_type": "code",
202 | "id": "yEBIFRBqRo4y"
203 | },
204 | "outputs": [],
205 | "source": [
206 | "# set the where clause to select the stay of interest\n",
207 | "query = \"\"\"\n",
208 | "SELECT *\n",
209 | "FROM `physionet-data.eicu_crd_demo.patient`\n",
210 | "WHERE patientunitstayid = {}\n",
211 | "\"\"\".format(patientunitstayid)\n",
212 | "\n",
213 | "patient = dtn.run_query(query,project_id)"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": null,
219 | "metadata": {
220 | "colab": {},
221 | "colab_type": "code",
222 | "id": "LjIL2XR6TAyp"
223 | },
224 | "outputs": [],
225 | "source": [
226 | "patient"
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {
232 | "colab_type": "text",
233 | "id": "QSbKYqF0TQ1n"
234 | },
235 | "source": [
236 | "## Questions\n",
237 | "\n",
238 | "- Which type of unit was the patient admitted to? Hint: Try `patient['unittype']` or `patient.unittype`(どこへ入室したか)\n",
239 | "- What year was the patient discharged from the ICU? Hint: You can view the table columns with `patient.columns`(いつ退室したか)\n",
240 | "- What was the status of the patient upon discharge from the unit?(退室時の状態は?)"
241 | ]
242 | },
243 | {
244 | "cell_type": "markdown",
245 | "metadata": {
246 | "colab_type": "text",
247 | "id": "izaH0XwwUxDD"
248 | },
249 | "source": [
250 | "## The admissiondx table\n",
251 | "\n",
252 | "The `admissiondx` table contains the primary diagnosis for admission to the ICU according to the APACHE scoring criteria. For more detail, see: http://eicu-crd.mit.edu/eicutables/admissiondx/\n",
253 | "\n",
254 | "`admissiondx`テーブルからは、ICU入室時の診断に関する情報が抽出できます。"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {
261 | "colab": {},
262 | "colab_type": "code",
263 | "id": "dlj3UCDTTEjj"
264 | },
265 | "outputs": [],
266 | "source": [
267 | "# set the where clause to select the stay of interest\n",
268 | "# `WHERE`で条件の絞り込み。\n",
269 | "query = \"\"\"\n",
270 | "SELECT *\n",
271 | "FROM `physionet-data.eicu_crd_demo.admissiondx`\n",
272 | "WHERE patientunitstayid = {}\n",
273 | "\"\"\".format(patientunitstayid)\n",
274 | "\n",
275 | "admissiondx = dtn.run_query(query,project_id)"
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": null,
281 | "metadata": {
282 | "colab": {},
283 | "colab_type": "code",
284 | "id": "3wdEHFLJVMKm"
285 | },
286 | "outputs": [],
287 | "source": [
288 | "# View the columns in this data\n",
289 | "admissiondx.columns"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "metadata": {
296 | "colab": {},
297 | "colab_type": "code",
298 | "id": "tbOA44lAVNLr"
299 | },
300 | "outputs": [],
301 | "source": [
302 | "# View the data\n",
303 | "admissiondx.head()"
304 | ]
305 | },
306 | {
307 | "cell_type": "code",
308 | "execution_count": null,
309 | "metadata": {
310 | "colab": {},
311 | "colab_type": "code",
312 | "id": "Hc0y4ueOVWOk"
313 | },
314 | "outputs": [],
315 | "source": [
316 | "# Set the display options to avoid truncating the text\n",
317 | "# `admitdxpath`の表示方法を変えます。\n",
318 | "pd.set_option('display.max_colwidth', -1)\n",
319 | "admissiondx.admitdxpath"
320 | ]
321 | },
322 | {
323 | "cell_type": "markdown",
324 | "metadata": {
325 | "colab_type": "text",
326 | "id": "mSb_BrgvWDdD"
327 | },
328 | "source": [
329 | "## Questions\n",
330 | "\n",
331 | "- What was the primary reason for admission?(入室理由は?)\n",
332 | "- How soon after admission to the ICU was the diagnoses recorded in eCareManager? Hint: The `offset` columns indicate the time in minutes after admission to the ICU. (入室後いつ診断されたか?)"
333 | ]
334 | },
335 | {
336 | "cell_type": "markdown",
337 | "metadata": {
338 | "colab_type": "text",
339 | "id": "rd3Tw6_kWwlS"
340 | },
341 | "source": [
342 | "## The apacheapsvar table\n",
343 | "\n",
344 | "The apacheapsvar table contains the variables used to calculate the Acute Physiology Score (APS) III for patients. APS-III is an established method of summarizing patient severity of illness on admission to the ICU, taking the \"worst\" observations for a patient in a 24 hour period.\n",
345 | "\n",
346 | "次に、APS-III(重症度スコア)に関するデータを抽出するため、`apacheapsvar`テーブルを参照します。\n",
347 | "\n",
348 | "The score is part of the Acute Physiology Age Chronic Health Evaluation (APACHE) system of equations for predicting outcomes for ICU patients. See: http://eicu-crd.mit.edu/eicutables/apacheApsVar/"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": null,
354 | "metadata": {
355 | "colab": {},
356 | "colab_type": "code",
357 | "id": "fXOzR5XWVdNa"
358 | },
359 | "outputs": [],
360 | "source": [
361 | "# set the where clause to select the stay of interest\n",
362 | "query = \"\"\"\n",
363 | "SELECT *\n",
364 | "FROM `physionet-data.eicu_crd_demo.apacheapsvar`\n",
365 | "WHERE patientunitstayid = {}\n",
366 | "\"\"\".format(patientunitstayid)\n",
367 | "\n",
368 | "apacheapsvar = dtn.run_query(query,project_id)"
369 | ]
370 | },
371 | {
372 | "cell_type": "code",
373 | "execution_count": null,
374 | "metadata": {
375 | "colab": {},
376 | "colab_type": "code",
377 | "id": "mL_lVORdXDIg"
378 | },
379 | "outputs": [],
380 | "source": [
381 | "apacheapsvar.head()"
382 | ]
383 | },
384 | {
385 | "cell_type": "markdown",
386 | "metadata": {
387 | "colab_type": "text",
388 | "id": "8x_Z8q4jXH7D"
389 | },
390 | "source": [
391 | "## Questions\n",
392 | "\n",
393 | "- What was the 'worst' heart rate recorded for the patient during the scoring period?(重症度スコアリング中の`最悪`の心拍数は?)\n",
394 | "- Was the patient oriented and able to converse normally on the day of admission? (hint: the verbal element refers to the Glasgow Coma Scale).(見当識障害は?)"
395 | ]
396 | },
397 | {
398 | "cell_type": "markdown",
399 | "metadata": {
400 | "colab_type": "text",
401 | "id": "XplJvhIYX432"
402 | },
403 | "source": [
404 | "# apachepredvar table\n",
405 | "\n",
406 | "The apachepredvar table provides variables underlying the APACHE predictions. Acute Physiology Age Chronic Health Evaluation (APACHE) consists of a groups of equations used for predicting outcomes in critically ill patients. See: http://eicu-crd.mit.edu/eicutables/apachePredVar/\n",
407 | "\n",
408 | "このテーブルは、APACHEスコアについての元情報も含んでいます。"
409 | ]
410 | },
411 | {
412 | "cell_type": "code",
413 | "execution_count": null,
414 | "metadata": {
415 | "colab": {},
416 | "colab_type": "code",
417 | "id": "iAIFESy9XFhC"
418 | },
419 | "outputs": [],
420 | "source": [
421 | "# set the where clause to select the stay of interest\n",
422 | "query = \"\"\"\n",
423 | "SELECT *\n",
424 | "FROM `physionet-data.eicu_crd_demo.apachepredvar`\n",
425 | "WHERE patientunitstayid = {}\n",
426 | "\"\"\".format(patientunitstayid)\n",
427 | "\n",
428 | "apachepredvar = dtn.run_query(query,project_id)"
429 | ]
430 | },
431 | {
432 | "cell_type": "code",
433 | "execution_count": null,
434 | "metadata": {
435 | "colab": {},
436 | "colab_type": "code",
437 | "id": "LAu7G72cYEY1"
438 | },
439 | "outputs": [],
440 | "source": [
441 | "apachepredvar.columns"
442 | ]
443 | },
444 | {
445 | "cell_type": "markdown",
446 | "metadata": {
447 | "colab_type": "text",
448 | "id": "IEaS6L9OY0vJ"
449 | },
450 | "source": [
451 | "## Questions\n",
452 | "\n",
453 | "- Was the patient ventilated during (APACHE) day 1 of their stay?(APACHEスコアリング1日目の機械換気?)\n",
454 | "- Is the patient recorded as having diabetes?(糖尿病の既往?)"
455 | ]
456 | },
457 | {
458 | "cell_type": "markdown",
459 | "metadata": {
460 | "colab_type": "text",
461 | "id": "nrTEkjxqZD2l"
462 | },
463 | "source": [
464 | "# `apachepatientresult` table\n",
465 | "\n",
466 | "The `apachepatientresult` table provides predictions made by the APACHE score (versions IV and IVa), including probability of mortality, length of stay, and ventilation days. See: http://eicu-crd.mit.edu/eicutables/apachePatientResult/\n",
467 | "\n",
468 | "`apachepatientresult`テーブルでは、APACHEスコアを用いた死亡率やICU滞在日数、挿管日数に関する予測値をみることができます。"
469 | ]
470 | },
471 | {
472 | "cell_type": "code",
473 | "execution_count": null,
474 | "metadata": {
475 | "colab": {},
476 | "colab_type": "code",
477 | "id": "M2RCJNBgZOJ2"
478 | },
479 | "outputs": [],
480 | "source": [
481 | "# set the where clause to select the stay of interest\n",
482 | "query = \"\"\"\n",
483 | "SELECT *\n",
484 | "FROM `physionet-data.eicu_crd_demo.apachepatientresult`\n",
485 | "WHERE patientunitstayid = {}\n",
486 | "\"\"\".format(patientunitstayid)\n",
487 | "\n",
488 | "apachepatientresult = dtn.run_query(query,project_id)"
489 | ]
490 | },
491 | {
492 | "cell_type": "code",
493 | "execution_count": null,
494 | "metadata": {
495 | "colab": {},
496 | "colab_type": "code",
497 | "id": "4whVaOP1Za8f"
498 | },
499 | "outputs": [],
500 | "source": [
501 | "apachepatientresult"
502 | ]
503 | },
504 | {
505 | "cell_type": "markdown",
506 | "metadata": {
507 | "colab_type": "text",
508 | "id": "5YO_GQcNZUWR"
509 | },
510 | "source": [
511 | "## Questions\n",
512 | "\n",
513 | "- What versions of the APACHE score are computed?(APACHEスコアのバージョンは?)\n",
514 | "- How many days during the stay was the patient ventilated?(挿管日数は?)\n",
515 | "- How long was the patient predicted to stay in hospital?(病院滞在日数は?)\n",
516 | "- Was this prediction close to the truth?(正確性?)"
517 | ]
518 | },
519 | {
520 | "cell_type": "code",
521 | "execution_count": null,
522 | "metadata": {},
523 | "outputs": [],
524 | "source": []
525 | }
526 | ],
527 | "metadata": {
528 | "colab": {
529 | "collapsed_sections": [],
530 | "include_colab_link": true,
531 | "name": "02-severity-of-illness",
532 | "provenance": [],
533 | "version": "0.3.2"
534 | },
535 | "kernelspec": {
536 | "display_name": "Python 3",
537 | "language": "python",
538 | "name": "python3"
539 | },
540 | "language_info": {
541 | "codemirror_mode": {
542 | "name": "ipython",
543 | "version": 3
544 | },
545 | "file_extension": ".py",
546 | "mimetype": "text/x-python",
547 | "name": "python",
548 | "nbconvert_exporter": "python",
549 | "pygments_lexer": "ipython3",
550 | "version": "3.6.5"
551 | }
552 | },
553 | "nbformat": 4,
554 | "nbformat_minor": 1
555 | }
556 |
--------------------------------------------------------------------------------
/eicu_python/02_severity_of_illness.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "02-severity-of-illness",
7 | "version": "0.3.2",
8 | "provenance": [],
9 | "collapsed_sections": [],
10 | "include_colab_link": true
11 | },
12 | "kernelspec": {
13 | "name": "python3",
14 | "display_name": "Python 3"
15 | }
16 | },
17 | "cells": [
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {
21 | "id": "view-in-github",
22 | "colab_type": "text"
23 | },
24 | "source": [
25 | "
"
26 | ]
27 | },
28 | {
29 | "metadata": {
30 | "id": "y4AOVdliM8gm",
31 | "colab_type": "text"
32 | },
33 | "cell_type": "markdown",
34 | "source": [
35 | "# eICU Collaborative Research Database\n",
36 | "\n",
37 | "# Notebook 2: Severity of illness\n",
38 | "\n",
39 | "This notebook introduces high level admission details relating to a single patient stay, using the following tables:\n",
40 | "\n",
41 | "- patient\n",
42 | "- admissiondx\n",
43 | "- apacheapsvar\n",
44 | "- apachepredvar\n",
45 | "- apachepatientresult\n",
46 | "\n"
47 | ]
48 | },
49 | {
50 | "metadata": {
51 | "id": "e0lUnIkYOyv4",
52 | "colab_type": "text"
53 | },
54 | "cell_type": "markdown",
55 | "source": [
56 | "## Load libraries and connect to the database"
57 | ]
58 | },
59 | {
60 | "metadata": {
61 | "id": "SJ6l1i3fOL4j",
62 | "colab_type": "code",
63 | "colab": {}
64 | },
65 | "cell_type": "code",
66 | "source": [
67 | "# Import libraries\n",
68 | "import numpy as np\n",
69 | "import os\n",
70 | "import pandas as pd\n",
71 | "import matplotlib.pyplot as plt\n",
72 | "import matplotlib.patches as patches\n",
73 | "import matplotlib.path as path\n",
74 | "\n",
75 | "# Make pandas dataframes prettier\n",
76 | "from IPython.display import display, HTML\n",
77 | "\n",
78 | "# Access data using Google BigQuery.\n",
79 | "from google.colab import auth\n",
80 | "from google.cloud import bigquery"
81 | ],
82 | "execution_count": 0,
83 | "outputs": []
84 | },
85 | {
86 | "metadata": {
87 | "id": "TE4JYS8aO-69",
88 | "colab_type": "code",
89 | "colab": {}
90 | },
91 | "cell_type": "code",
92 | "source": [
93 | "# authenticate\n",
94 | "auth.authenticate_user()"
95 | ],
96 | "execution_count": 0,
97 | "outputs": []
98 | },
99 | {
100 | "metadata": {
101 | "id": "oVavf-ujPOAv",
102 | "colab_type": "code",
103 | "colab": {}
104 | },
105 | "cell_type": "code",
106 | "source": [
107 | "# Set up environment variables\n",
108 | "project_id='datathonjapan2019'\n",
109 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
110 | ],
111 | "execution_count": 0,
112 | "outputs": []
113 | },
114 | {
115 | "metadata": {
116 | "id": "OEuFlzpIT3rT",
117 | "colab_type": "text"
118 | },
119 | "cell_type": "markdown",
120 | "source": [
121 | "To make our lives easier, we'll also install and import a set of helper functions from the `datathon2` package.\n",
122 | "\n"
123 | ]
124 | },
125 | {
126 | "metadata": {
127 | "id": "aBc7PA0KSIFM",
128 | "colab_type": "code",
129 | "colab": {}
130 | },
131 | "cell_type": "code",
132 | "source": [
133 | "!pip install datathon2"
134 | ],
135 | "execution_count": 0,
136 | "outputs": []
137 | },
138 | {
139 | "metadata": {
140 | "id": "k2wcZtxVUHJy",
141 | "colab_type": "code",
142 | "colab": {}
143 | },
144 | "cell_type": "code",
145 | "source": [
146 | "import datathon2 as dtn"
147 | ],
148 | "execution_count": 0,
149 | "outputs": []
150 | },
151 | {
152 | "metadata": {
153 | "id": "a1CAI3GjQYE0",
154 | "colab_type": "text"
155 | },
156 | "cell_type": "markdown",
157 | "source": [
158 | "## Selecting a single patient stay¶\n",
159 | "\n",
160 | "As we have seen, the patient table includes general information about the patient admissions (for example, demographics, admission and discharge details). See: http://eicu-crd.mit.edu/eicutables/patient/\n",
161 | "\n",
162 | "## Questions\n",
163 | "\n",
164 | "Use your knowledge from the previous notebook and the online documentation (http://eicu-crd.mit.edu/) to answer the following questions:\n",
165 | "\n",
166 | "- Which column in the patient table is distinct for each stay in the ICU (similar to `icustay_id` in MIMIC-III)?\n",
167 | "- Which column is unique for each patient (similar to `subject_id` in MIMIC-III)?"
168 | ]
169 | },
170 | {
171 | "metadata": {
172 | "id": "R6huFICkSQAd",
173 | "colab_type": "code",
174 | "colab": {}
175 | },
176 | "cell_type": "code",
177 | "source": [
178 | "# view distinct ids\n",
179 | "query = \"\"\"\n",
180 | "SELECT DISTINCT(patientunitstayid)\n",
181 | "FROM `physionet-data.eicu_crd_demo.patient`\n",
182 | "\"\"\"\n",
183 | "\n",
184 | "dtn.run_query(query,project_id)\n"
185 | ],
186 | "execution_count": 0,
187 | "outputs": []
188 | },
189 | {
190 | "metadata": {
191 | "id": "lfeQwFlvRly7",
192 | "colab_type": "code",
193 | "colab": {}
194 | },
195 | "cell_type": "code",
196 | "source": [
197 | "# select a single ICU stay\n",
198 | "patientunitstayid = "
199 | ],
200 | "execution_count": 0,
201 | "outputs": []
202 | },
203 | {
204 | "metadata": {
205 | "id": "yEBIFRBqRo4y",
206 | "colab_type": "code",
207 | "colab": {}
208 | },
209 | "cell_type": "code",
210 | "source": [
211 | "# set the where clause to select the stay of interest\n",
212 | "query = \"\"\"\n",
213 | "SELECT *\n",
214 | "FROM `physionet-data.eicu_crd_demo.patient`\n",
215 | "WHERE patientunitstayid = {}\n",
216 | "\"\"\".format(patientunitstayid)\n",
217 | "\n",
218 | "patient = dtn.run_query(query,project_id)"
219 | ],
220 | "execution_count": 0,
221 | "outputs": []
222 | },
223 | {
224 | "metadata": {
225 | "id": "LjIL2XR6TAyp",
226 | "colab_type": "code",
227 | "colab": {}
228 | },
229 | "cell_type": "code",
230 | "source": [
231 | "patient"
232 | ],
233 | "execution_count": 0,
234 | "outputs": []
235 | },
236 | {
237 | "metadata": {
238 | "id": "QSbKYqF0TQ1n",
239 | "colab_type": "text"
240 | },
241 | "cell_type": "markdown",
242 | "source": [
243 | "## Questions\n",
244 | "\n",
245 | "- Which type of unit was the patient admitted to? Hint: Try `patient['unittype']` or `patient.unittype`\n",
246 | "- What year was the patient discharged from the ICU? Hint: You can view the table columns with `patient.columns`\n",
247 | "- What was the status of the patient upon discharge from the unit?"
248 | ]
249 | },
250 | {
251 | "metadata": {
252 | "id": "izaH0XwwUxDD",
253 | "colab_type": "text"
254 | },
255 | "cell_type": "markdown",
256 | "source": [
257 | "## The admissiondx table\n",
258 | "\n",
259 | "The `admissiondx` table contains the primary diagnosis for admission to the ICU according to the APACHE scoring criteria. For more detail, see: http://eicu-crd.mit.edu/eicutables/admissiondx/"
260 | ]
261 | },
262 | {
263 | "metadata": {
264 | "id": "dlj3UCDTTEjj",
265 | "colab_type": "code",
266 | "colab": {}
267 | },
268 | "cell_type": "code",
269 | "source": [
270 | "# set the where clause to select the stay of interest\n",
271 | "query = \"\"\"\n",
272 | "SELECT *\n",
273 | "FROM `physionet-data.eicu_crd_demo.admissiondx`\n",
274 | "WHERE patientunitstayid = {}\n",
275 | "\"\"\".format(patientunitstayid)\n",
276 | "\n",
277 | "admissiondx = dtn.run_query(query,project_id)"
278 | ],
279 | "execution_count": 0,
280 | "outputs": []
281 | },
282 | {
283 | "metadata": {
284 | "id": "3wdEHFLJVMKm",
285 | "colab_type": "code",
286 | "colab": {}
287 | },
288 | "cell_type": "code",
289 | "source": [
290 | "# View the columns in this data\n",
291 | "admissiondx.columns"
292 | ],
293 | "execution_count": 0,
294 | "outputs": []
295 | },
296 | {
297 | "metadata": {
298 | "id": "tbOA44lAVNLr",
299 | "colab_type": "code",
300 | "colab": {}
301 | },
302 | "cell_type": "code",
303 | "source": [
304 | "# View the data\n",
305 | "admissiondx.head()"
306 | ],
307 | "execution_count": 0,
308 | "outputs": []
309 | },
310 | {
311 | "metadata": {
312 | "id": "Hc0y4ueOVWOk",
313 | "colab_type": "code",
314 | "colab": {}
315 | },
316 | "cell_type": "code",
317 | "source": [
318 | "# Set the display options to avoid truncating the text\n",
319 | "pd.set_option('display.max_colwidth', -1)\n",
320 | "admissiondx.admitdxpath"
321 | ],
322 | "execution_count": 0,
323 | "outputs": []
324 | },
325 | {
326 | "metadata": {
327 | "id": "mSb_BrgvWDdD",
328 | "colab_type": "text"
329 | },
330 | "cell_type": "markdown",
331 | "source": [
332 | "## Questions\n",
333 | "\n",
334 | "- What was the primary reason for admission?\n",
335 | "- How soon after admission to the ICU was the diagnoses recorded in eCareManager? Hint: The `offset` columns indicate the time in minutes after admission to the ICU. "
336 | ]
337 | },
338 | {
339 | "metadata": {
340 | "id": "rd3Tw6_kWwlS",
341 | "colab_type": "text"
342 | },
343 | "cell_type": "markdown",
344 | "source": [
345 | "## The apacheapsvar table\n",
346 | "\n",
347 | "The apacheapsvar table contains the variables used to calculate the Acute Physiology Score (APS) III for patients. APS-III is an established method of summarizing patient severity of illness on admission to the ICU, taking the \"worst\" observations for a patient in a 24 hour period.\n",
348 | "\n",
349 | "The score is part of the Acute Physiology Age Chronic Health Evaluation (APACHE) system of equations for predicting outcomes for ICU patients. See: http://eicu-crd.mit.edu/eicutables/apacheApsVar/"
350 | ]
351 | },
352 | {
353 | "metadata": {
354 | "id": "fXOzR5XWVdNa",
355 | "colab_type": "code",
356 | "colab": {}
357 | },
358 | "cell_type": "code",
359 | "source": [
360 | "# set the where clause to select the stay of interest\n",
361 | "query = \"\"\"\n",
362 | "SELECT *\n",
363 | "FROM `physionet-data.eicu_crd_demo.apacheapsvar`\n",
364 | "WHERE patientunitstayid = {}\n",
365 | "\"\"\".format(patientunitstayid)\n",
366 | "\n",
367 | "apacheapsvar = dtn.run_query(query,project_id)"
368 | ],
369 | "execution_count": 0,
370 | "outputs": []
371 | },
372 | {
373 | "metadata": {
374 | "id": "mL_lVORdXDIg",
375 | "colab_type": "code",
376 | "colab": {}
377 | },
378 | "cell_type": "code",
379 | "source": [
380 | "apacheapsvar.head()"
381 | ],
382 | "execution_count": 0,
383 | "outputs": []
384 | },
385 | {
386 | "metadata": {
387 | "id": "8x_Z8q4jXH7D",
388 | "colab_type": "text"
389 | },
390 | "cell_type": "markdown",
391 | "source": [
392 | "## Questions\n",
393 | "\n",
394 | "- What was the 'worst' heart rate recorded for the patient during the scoring period?\n",
395 | "- Was the patient oriented and able to converse normally on the day of admission? (hint: the verbal element refers to the Glasgow Coma Scale)."
396 | ]
397 | },
398 | {
399 | "metadata": {
400 | "id": "XplJvhIYX432",
401 | "colab_type": "text"
402 | },
403 | "cell_type": "markdown",
404 | "source": [
405 | "# apachepredvar table\n",
406 | "\n",
407 | "The apachepredvar table provides variables underlying the APACHE predictions. Acute Physiology Age Chronic Health Evaluation (APACHE) consists of a groups of equations used for predicting outcomes in critically ill patients. See: http://eicu-crd.mit.edu/eicutables/apachePredVar/"
408 | ]
409 | },
410 | {
411 | "metadata": {
412 | "id": "iAIFESy9XFhC",
413 | "colab_type": "code",
414 | "colab": {}
415 | },
416 | "cell_type": "code",
417 | "source": [
418 | "# set the where clause to select the stay of interest\n",
419 | "query = \"\"\"\n",
420 | "SELECT *\n",
421 | "FROM `physionet-data.eicu_crd_demo.apachepredvar`\n",
422 | "WHERE patientunitstayid = {}\n",
423 | "\"\"\".format(patientunitstayid)\n",
424 | "\n",
425 | "apachepredvar = dtn.run_query(query,project_id)"
426 | ],
427 | "execution_count": 0,
428 | "outputs": []
429 | },
430 | {
431 | "metadata": {
432 | "id": "LAu7G72cYEY1",
433 | "colab_type": "code",
434 | "colab": {}
435 | },
436 | "cell_type": "code",
437 | "source": [
438 | "apachepredvar.columns"
439 | ],
440 | "execution_count": 0,
441 | "outputs": []
442 | },
443 | {
444 | "metadata": {
445 | "id": "IEaS6L9OY0vJ",
446 | "colab_type": "text"
447 | },
448 | "cell_type": "markdown",
449 | "source": [
450 | "## Questions\n",
451 | "\n",
452 | "- Was the patient ventilated during (APACHE) day 1 of their stay?\n",
453 | "- Is the patient recorded as having diabetes?"
454 | ]
455 | },
456 | {
457 | "metadata": {
458 | "id": "nrTEkjxqZD2l",
459 | "colab_type": "text"
460 | },
461 | "cell_type": "markdown",
462 | "source": [
463 | "# `apachepatientresult` table\n",
464 | "\n",
465 | "The `apachepatientresult` table provides predictions made by the APACHE score (versions IV and IVa), including probability of mortality, length of stay, and ventilation days. See: http://eicu-crd.mit.edu/eicutables/apachePatientResult/"
466 | ]
467 | },
468 | {
469 | "metadata": {
470 | "id": "M2RCJNBgZOJ2",
471 | "colab_type": "code",
472 | "colab": {}
473 | },
474 | "cell_type": "code",
475 | "source": [
476 | "# set the where clause to select the stay of interest\n",
477 | "query = \"\"\"\n",
478 | "SELECT *\n",
479 | "FROM `physionet-data.eicu_crd_demo.apachepatientresult`\n",
480 | "WHERE patientunitstayid = {}\n",
481 | "\"\"\".format(patientunitstayid)\n",
482 | "\n",
483 | "apachepatientresult = dtn.run_query(query,project_id)"
484 | ],
485 | "execution_count": 0,
486 | "outputs": []
487 | },
488 | {
489 | "metadata": {
490 | "id": "4whVaOP1Za8f",
491 | "colab_type": "code",
492 | "colab": {}
493 | },
494 | "cell_type": "code",
495 | "source": [
496 | "apachepatientresult"
497 | ],
498 | "execution_count": 0,
499 | "outputs": []
500 | },
501 | {
502 | "metadata": {
503 | "id": "5YO_GQcNZUWR",
504 | "colab_type": "text"
505 | },
506 | "cell_type": "markdown",
507 | "source": [
508 | "## Questions\n",
509 | "\n",
510 | "- What versions of the APACHE score are computed?\n",
511 | "- How many days during the stay was the patient ventilated?\n",
512 | "- How long was the patient predicted to stay in hospital?\n",
513 | "- Was this prediction close to the truth?"
514 | ]
515 | }
516 | ]
517 | }
--------------------------------------------------------------------------------
/eicu_python/04_prediction.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "04-prediction",
7 | "version": "0.3.2",
8 | "provenance": [],
9 | "collapsed_sections": [],
10 | "include_colab_link": true
11 | },
12 | "kernelspec": {
13 | "name": "python3",
14 | "display_name": "Python 3"
15 | }
16 | },
17 | "cells": [
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {
21 | "id": "view-in-github",
22 | "colab_type": "text"
23 | },
24 | "source": [
25 | "
"
26 | ]
27 | },
28 | {
29 | "metadata": {
30 | "id": "T3wdKZCPklNq",
31 | "colab_type": "text"
32 | },
33 | "cell_type": "markdown",
34 | "source": [
35 | "# eICU Collaborative Research Database\n",
36 | "\n",
37 | "# Notebook 4: Prediction\n",
38 | "\n",
39 | "This notebook explores how a decision trees can be trained to predict in-hospital mortality of patients.\n"
40 | ]
41 | },
42 | {
43 | "metadata": {
44 | "id": "rG3HrM7GkwCH",
45 | "colab_type": "text"
46 | },
47 | "cell_type": "markdown",
48 | "source": [
49 | "## Load libraries and connect to the database"
50 | ]
51 | },
52 | {
53 | "metadata": {
54 | "id": "s-MoFA6NkkbZ",
55 | "colab_type": "code",
56 | "colab": {}
57 | },
58 | "cell_type": "code",
59 | "source": [
60 | "# Import libraries\n",
61 | "import os\n",
62 | "import numpy as np\n",
63 | "import pandas as pd\n",
64 | "import matplotlib.pyplot as plt\n",
65 | "\n",
66 | "# model building\n",
67 | "from sklearn import ensemble, impute, metrics, preprocessing, tree\n",
68 | "from sklearn.model_selection import cross_val_score, train_test_split\n",
69 | "from sklearn.pipeline import Pipeline\n",
70 | "\n",
71 | "# Make pandas dataframes prettier\n",
72 | "from IPython.display import display, HTML, Image\n",
73 | "plt.rcParams.update({'font.size': 20})\n",
74 | "%matplotlib inline\n",
75 | "plt.style.use('ggplot')\n",
76 | "\n",
77 | "# Access data using Google BigQuery.\n",
78 | "from google.colab import auth\n",
79 | "from google.cloud import bigquery"
80 | ],
81 | "execution_count": 0,
82 | "outputs": []
83 | },
84 | {
85 | "metadata": {
86 | "id": "jyBV_Q9DkyD3",
87 | "colab_type": "code",
88 | "colab": {}
89 | },
90 | "cell_type": "code",
91 | "source": [
92 | "# authenticate\n",
93 | "auth.authenticate_user()"
94 | ],
95 | "execution_count": 0,
96 | "outputs": []
97 | },
98 | {
99 | "metadata": {
100 | "id": "cF1udJKhkzYq",
101 | "colab_type": "code",
102 | "colab": {}
103 | },
104 | "cell_type": "code",
105 | "source": [
106 | "# Set up environment variables\n",
107 | "project_id='datathonjapan2019'\n",
108 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
109 | ],
110 | "execution_count": 0,
111 | "outputs": []
112 | },
113 | {
114 | "metadata": {
115 | "id": "xGurBAQIUDTt",
116 | "colab_type": "text"
117 | },
118 | "cell_type": "markdown",
119 | "source": [
120 | "To make our lives easier, we'll also install and import a set of helper functions from the `datathon2` package. We will be using the following functions from the package:\n",
121 | "- `plot_model_pred_2d`: to visualize our data, helping to display a class split assigned by a tree vs the true class.\n",
122 | "- `run_query()`: to run an SQL query against our BigQuery database and assign the results to a dataframe. \n"
123 | ]
124 | },
125 | {
126 | "metadata": {
127 | "id": "GDEewAlvk0oT",
128 | "colab_type": "code",
129 | "colab": {}
130 | },
131 | "cell_type": "code",
132 | "source": [
133 | "!pip install datathon2"
134 | ],
135 | "execution_count": 0,
136 | "outputs": []
137 | },
138 | {
139 | "metadata": {
140 | "id": "JM6O5GPAUI89",
141 | "colab_type": "code",
142 | "colab": {}
143 | },
144 | "cell_type": "code",
145 | "source": [
146 | "import datathon2 as dtn\n",
147 | "import pydotplus\n",
148 | "from tableone import TableOne"
149 | ],
150 | "execution_count": 0,
151 | "outputs": []
152 | },
153 | {
154 | "metadata": {
155 | "id": "hq_09Hh-y17k",
156 | "colab_type": "text"
157 | },
158 | "cell_type": "markdown",
159 | "source": [
160 | "In this notebook we'll be looking at tree models, so we'll now install a package for visualizing these models."
161 | ]
162 | },
163 | {
164 | "metadata": {
165 | "id": "jBMOwgwszGOw",
166 | "colab_type": "code",
167 | "colab": {}
168 | },
169 | "cell_type": "code",
170 | "source": [
171 | "!apt-get install graphviz -y"
172 | ],
173 | "execution_count": 0,
174 | "outputs": []
175 | },
176 | {
177 | "metadata": {
178 | "id": "LgcRCqxCk3HC",
179 | "colab_type": "text"
180 | },
181 | "cell_type": "markdown",
182 | "source": [
183 | "## Load the patient cohort\n",
184 | "\n",
185 | "Let's extract a cohort of patients admitted to the ICU from the emergency department. We link demographics data from the `patient` table to severity of illness score data in the `apachepatientresult` table. We exclude readmissions and neurological patients to help create a population suitable for our demonstration."
186 | ]
187 | },
188 | {
189 | "metadata": {
190 | "id": "ReCl7-aek1-k",
191 | "colab_type": "code",
192 | "colab": {}
193 | },
194 | "cell_type": "code",
195 | "source": [
196 | "# Link the patient, apachepatientresult, and apacheapsvar tables on patientunitstayid\n",
197 | "# using an inner join.\n",
198 | "query = \"\"\"\n",
199 | "SELECT p.unitadmitsource, p.gender, p.age, p.unittype, p.unitstaytype, \n",
200 | " a.actualhospitalmortality, a.acutePhysiologyScore, a.apacheScore\n",
201 | "FROM `physionet-data.eicu_crd_demo.patient` p\n",
202 | "INNER JOIN `physionet-data.eicu_crd_demo.apachepatientresult` a\n",
203 | "ON p.patientunitstayid = a.patientunitstayid\n",
204 | "WHERE a.apacheversion LIKE 'IVa'\n",
205 | "AND LOWER(p.unitadmitsource) LIKE \"%emergency%\"\n",
206 | "AND LOWER(p.unitstaytype) LIKE \"admit%\"\n",
207 | "AND LOWER(p.unittype) NOT LIKE \"%neuro%\";\n",
208 | "\"\"\"\n",
209 | "\n",
210 | "cohort = dtn.run_query(query,project_id)"
211 | ],
212 | "execution_count": 0,
213 | "outputs": []
214 | },
215 | {
216 | "metadata": {
217 | "id": "yxLctVBpk9sO",
218 | "colab_type": "code",
219 | "colab": {}
220 | },
221 | "cell_type": "code",
222 | "source": [
223 | "cohort.head()"
224 | ],
225 | "execution_count": 0,
226 | "outputs": []
227 | },
228 | {
229 | "metadata": {
230 | "id": "NPlwRV2buYb1",
231 | "colab_type": "text"
232 | },
233 | "cell_type": "markdown",
234 | "source": [
235 | "## Preparing the data for analysis\n",
236 | "\n",
237 | "Before continuing, we want to review our data, paying attention to factors such as:\n",
238 | "- data types (for example, are values recorded as characters or numerical values?) \n",
239 | "- missing data\n",
240 | "- distribution of values"
241 | ]
242 | },
243 | {
244 | "metadata": {
245 | "id": "v3OJ4LDvueKu",
246 | "colab_type": "code",
247 | "colab": {}
248 | },
249 | "cell_type": "code",
250 | "source": [
251 | "# dataset info\n",
252 | "print(cohort.info())"
253 | ],
254 | "execution_count": 0,
255 | "outputs": []
256 | },
257 | {
258 | "metadata": {
259 | "id": "s4wQ6o_RvLph",
260 | "colab_type": "code",
261 | "colab": {}
262 | },
263 | "cell_type": "code",
264 | "source": [
265 | "# Encode the categorical data\n",
266 | "encoder = preprocessing.LabelEncoder()\n",
267 | "cohort['gender_code'] = encoder.fit_transform(cohort['gender'])\n",
268 | "cohort['actualhospitalmortality_code'] = encoder.fit_transform(cohort['actualhospitalmortality'])\n"
269 | ],
270 | "execution_count": 0,
271 | "outputs": []
272 | },
273 | {
274 | "metadata": {
275 | "id": "_1LYcNUdjQA5",
276 | "colab_type": "text"
277 | },
278 | "cell_type": "markdown",
279 | "source": [
280 | "In the eICU Collaborative Research Database, ages >89 years have been removed to comply with data sharing regulations. We will need to decide how to handle these ages. For simplicity, we will assign an age of 91.5 years to these patients."
281 | ]
282 | },
283 | {
284 | "metadata": {
285 | "id": "4ogi_ns-ylnP",
286 | "colab_type": "code",
287 | "colab": {}
288 | },
289 | "cell_type": "code",
290 | "source": [
291 | "# Handle the deidentified ages\n",
292 | "cohort['age'] = pd.to_numeric(cohort['age'], downcast='integer', errors='coerce')\n",
293 | "cohort['age'] = cohort['age'].fillna(value=91.5)"
294 | ],
295 | "execution_count": 0,
296 | "outputs": []
297 | },
298 | {
299 | "metadata": {
300 | "id": "77M0QJQ5wcPQ",
301 | "colab_type": "code",
302 | "colab": {}
303 | },
304 | "cell_type": "code",
305 | "source": [
306 | "# Preview the encoded data\n",
307 | "cohort[['gender','gender_code']].head()"
308 | ],
309 | "execution_count": 0,
310 | "outputs": []
311 | },
312 | {
313 | "metadata": {
314 | "id": "GqvwTNPN3KZz",
315 | "colab_type": "code",
316 | "colab": {}
317 | },
318 | "cell_type": "code",
319 | "source": [
320 | "# Check the outcome variable\n",
321 | "cohort['actualhospitalmortality_code'].unique()"
322 | ],
323 | "execution_count": 0,
324 | "outputs": []
325 | },
326 | {
327 | "metadata": {
328 | "id": "OdGX1qWdkTgY",
329 | "colab_type": "text"
330 | },
331 | "cell_type": "markdown",
332 | "source": [
333 | "Now let's use the [tableone package](https://doi.org/10.1093/jamiaopen/ooy012\n",
334 | ") to review our dataset."
335 | ]
336 | },
337 | {
338 | "metadata": {
339 | "id": "gIIsthy1WK3i",
340 | "colab_type": "code",
341 | "colab": {}
342 | },
343 | "cell_type": "code",
344 | "source": [
345 | "# View summary statistics\n",
346 | "pd.set_option('display.height', 500)\n",
347 | "pd.set_option('display.max_rows', 500)\n",
348 | "TableOne(cohort,groupby='actualhospitalmortality')"
349 | ],
350 | "execution_count": 0,
351 | "outputs": []
352 | },
353 | {
354 | "metadata": {
355 | "id": "IGtKlTG1gvRf",
356 | "colab_type": "text"
357 | },
358 | "cell_type": "markdown",
359 | "source": [
360 | "From these summary statistics, we can see that the average age is higher in the group of patients who do not survive. What other differences do you see?"
361 | ]
362 | },
363 | {
364 | "metadata": {
365 | "id": "ze7y5J4Ioz8u",
366 | "colab_type": "text"
367 | },
368 | "cell_type": "markdown",
369 | "source": [
370 | "## Creating our train and test sets\n",
371 | "\n",
372 | "We only focus on two variables for our analysis, age and acute physiology score. Limiting ourselves to two variables will make it easier to visualize our models."
373 | ]
374 | },
375 | {
376 | "metadata": {
377 | "id": "i5zXkn_AlDJW",
378 | "colab_type": "code",
379 | "colab": {}
380 | },
381 | "cell_type": "code",
382 | "source": [
383 | "features = ['age','acutePhysiologyScore']\n",
384 | "outcome = 'actualhospitalmortality_code'\n",
385 | "\n",
386 | "X = cohort[features]\n",
387 | "y = cohort[outcome]"
388 | ],
389 | "execution_count": 0,
390 | "outputs": []
391 | },
392 | {
393 | "metadata": {
394 | "id": "IHhIgDUwocmA",
395 | "colab_type": "code",
396 | "colab": {}
397 | },
398 | "cell_type": "code",
399 | "source": [
400 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)"
401 | ],
402 | "execution_count": 0,
403 | "outputs": []
404 | },
405 | {
406 | "metadata": {
407 | "id": "NvQWkuY6nkZ8",
408 | "colab_type": "code",
409 | "colab": {}
410 | },
411 | "cell_type": "code",
412 | "source": [
413 | "# Review the number of cases in each set\n",
414 | "print(\"Train data: {}\".format(len(X_train)))\n",
415 | "print(\"Test data: {}\".format(len(X_test)))"
416 | ],
417 | "execution_count": 0,
418 | "outputs": []
419 | },
420 | {
421 | "metadata": {
422 | "id": "b2waK5qBqanC",
423 | "colab_type": "text"
424 | },
425 | "cell_type": "markdown",
426 | "source": [
427 | "## Decision trees\n",
428 | "\n",
429 | "Let's build the simplest tree model we can think of: a classification tree with only one split. Decision trees of this form are commonly referred to under the umbrella term Classification and Regression Trees (CART) [1]. \n",
430 | "\n",
431 | "While we will only be looking at classification here, regression isn't too different. After grouping the data (which is essentially what a decision tree does), classification involves assigning all members of the group to the majority class of that group during training. Regression is the same, except you would assign the average value, not the majority. \n",
432 | "\n",
433 | "In the case of a decision tree with one split, often called a \"stump\", the model will partition the data into two groups, and assign classes for those two groups based on majority vote. There are many parameters available for the DecisionTreeClassifier class; by specifying max_depth=1 we will build a decision tree with only one split - i.e. of depth 1.\n",
434 | "\n",
435 | "[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984."
436 | ]
437 | },
438 | {
439 | "metadata": {
440 | "id": "RlG3N3OYBqAm",
441 | "colab_type": "code",
442 | "colab": {}
443 | },
444 | "cell_type": "code",
445 | "source": [
446 | "# specify max_depth=1 so we train a stump, i.e. a tree with only 1 split\n",
447 | "mdl = tree.DecisionTreeClassifier(max_depth=1)\n",
448 | "\n",
449 | "# fit the model to the data - trying to predict y from X\n",
450 | "mdl = mdl.fit(X_train,y_train)"
451 | ],
452 | "execution_count": 0,
453 | "outputs": []
454 | },
455 | {
456 | "metadata": {
457 | "id": "8RlioUw8B_0O",
458 | "colab_type": "text"
459 | },
460 | "cell_type": "markdown",
461 | "source": [
462 | "Our model is so simple that we can look at the full decision tree."
463 | ]
464 | },
465 | {
466 | "metadata": {
467 | "id": "G2t9Nz8pBqEb",
468 | "colab_type": "code",
469 | "colab": {}
470 | },
471 | "cell_type": "code",
472 | "source": [
473 | "graph = dtn.create_graph(mdl,feature_names=features)\n",
474 | "Image(graph.create_png())"
475 | ],
476 | "execution_count": 0,
477 | "outputs": []
478 | },
479 | {
480 | "metadata": {
481 | "id": "E-iPwWWKCGY9",
482 | "colab_type": "text"
483 | },
484 | "cell_type": "markdown",
485 | "source": [
486 | "Here we see three nodes: a node at the top, a node in the lower left, and a node in the lower right.\n",
487 | "\n",
488 | "The top node is the root of the tree: it contains all the data. Let's read this node bottom to top:\n",
489 | "- `value = [384, 44]`: Current class balance. There are 384 observations of class 0 and 44 observations of class 1.\n",
490 | "- `samples = 428`: Number of samples assessed at this node.\n",
491 | "- `gini = 0.184`: Gini impurity, a measure of \"impurity\". The higher the value, the bigger the mix of classes. A 50/50 split of two classes would result in an index of 0.5.\n",
492 | "- `acutePhysiologyScore <=78.5`: Decision rule learned by the node. In this case, patients with a score of <= 78.5 are moved into the left node and >78.5 to the right. "
493 | ]
494 | },
495 | {
496 | "metadata": {
497 | "id": "KS0UcZqUeJKz",
498 | "colab_type": "text"
499 | },
500 | "cell_type": "markdown",
501 | "source": [
502 | "The gini impurity is actually used by the algorithm to determine a split. The model evaluates every feature (in our case, age and score) at every possible split (46, 47, 48..) to find the point with the lowest gini impurity in two resulting nodes. \n",
503 | "\n",
504 | "The approach is referred to as \"greedy\" because we are choosing the optimal split given our current state. Let's take a closer look at our decision boundary."
505 | ]
506 | },
507 | {
508 | "metadata": {
509 | "id": "uXl22sNTtpHa",
510 | "colab_type": "code",
511 | "colab": {}
512 | },
513 | "cell_type": "code",
514 | "source": [
515 | "# look at the regions in a 2d plot\n",
516 | "# based on scikit-learn tutorial plot_iris.html\n",
517 | "plt.figure(figsize=[10,8])\n",
518 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, \n",
519 | " title=\"Decision tree (depth 1)\")"
520 | ],
521 | "execution_count": 0,
522 | "outputs": []
523 | },
524 | {
525 | "metadata": {
526 | "id": "25zSX-inCNOJ",
527 | "colab_type": "text"
528 | },
529 | "cell_type": "markdown",
530 | "source": [
531 | "In this plot we can see the decision boundary on the y-axis, separating the predicted classes. The true classes are indicated at each point. Where the background and point colours are mismatched, there has been misclassification. Of course we are using a very simple model. Let's see what happens when we increase the depth."
532 | ]
533 | },
534 | {
535 | "metadata": {
536 | "id": "ZuO62CL3CSGm",
537 | "colab_type": "code",
538 | "colab": {}
539 | },
540 | "cell_type": "code",
541 | "source": [
542 | "mdl = tree.DecisionTreeClassifier(max_depth=5)\n",
543 | "mdl = mdl.fit(X_train,y_train)"
544 | ],
545 | "execution_count": 0,
546 | "outputs": []
547 | },
548 | {
549 | "metadata": {
550 | "id": "A88Vi83LCSJ6",
551 | "colab_type": "code",
552 | "colab": {}
553 | },
554 | "cell_type": "code",
555 | "source": [
556 | "plt.figure(figsize=[10,8])\n",
557 | "dtn.plot_model_pred_2d(mdl, X_train, y_train,\n",
558 | " title=\"Decision tree (depth 5)\")"
559 | ],
560 | "execution_count": 0,
561 | "outputs": []
562 | },
563 | {
564 | "metadata": {
565 | "id": "B88XlKDtCYmn",
566 | "colab_type": "text"
567 | },
568 | "cell_type": "markdown",
569 | "source": [
570 | "Now our tree is more complicated! We can see a few vertical boundaries as well as the horizontal one from before. Some of these we may like, but some appear unnatural. Let's look at the tree itself."
571 | ]
572 | },
573 | {
574 | "metadata": {
575 | "id": "V1VLrOJJCcWo",
576 | "colab_type": "code",
577 | "colab": {}
578 | },
579 | "cell_type": "code",
580 | "source": [
581 | "graph = dtn.create_graph(mdl,feature_names=features)\n",
582 | "Image(graph.create_png())"
583 | ],
584 | "execution_count": 0,
585 | "outputs": []
586 | },
587 | {
588 | "metadata": {
589 | "id": "Ton_EnvFqHIO",
590 | "colab_type": "text"
591 | },
592 | "cell_type": "markdown",
593 | "source": [
594 | "Looking at the tree, we can see that there are some very specific rules. Consider our patient aged 65 years with an acute physiology score of 87. From the top of the tree, we would work our way down:\n",
595 | "\n",
596 | "- acutePhysiologyScore <= 78.5? No.\n",
597 | "- acutePhysiologyScore <= 106.5? Yes.\n",
598 | "- age <= 75.5? Yes\n",
599 | "- age <= 66. Yes.\n",
600 | "- age <= 62.5? No. \n",
601 | "\n",
602 | "This leads us to our single node with a gini impurity of 0. Having an entire rule based upon this one observation seems silly, but it is perfectly logical as at the moment. The only objective the algorithm cares about is minimizing the gini impurity. \n",
603 | "\n",
604 | "We are at risk of overfitting our data! This is where \"pruning\" comes in."
605 | ]
606 | },
607 | {
608 | "metadata": {
609 | "id": "VvsNIjCDDIo_",
610 | "colab_type": "code",
611 | "colab": {}
612 | },
613 | "cell_type": "code",
614 | "source": [
615 | "# let's prune the model and look again\n",
616 | "mdl = dtn.prune(mdl, min_samples_leaf = 10)\n",
617 | "graph = dtn.create_graph(mdl,feature_names=features)\n",
618 | "Image(graph.create_png()) "
619 | ],
620 | "execution_count": 0,
621 | "outputs": []
622 | },
623 | {
624 | "metadata": {
625 | "id": "8pRzzV2VvdxP",
626 | "colab_type": "text"
627 | },
628 | "cell_type": "markdown",
629 | "source": [
630 | "Above, we can see that our second tree is (1) smaller in depth, and (2) never splits a node with <= 10 samples. We can look at the decision surface for this tree:"
631 | ]
632 | },
633 | {
634 | "metadata": {
635 | "id": "5LyGDz-Cr-mU",
636 | "colab_type": "code",
637 | "colab": {}
638 | },
639 | "cell_type": "code",
640 | "source": [
641 | "plt.figure(figsize=[10,8])\n",
642 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=\"Pruned decision tree\")"
643 | ],
644 | "execution_count": 0,
645 | "outputs": []
646 | },
647 | {
648 | "metadata": {
649 | "id": "xAnqmD_Dv_dh",
650 | "colab_type": "text"
651 | },
652 | "cell_type": "markdown",
653 | "source": [
654 | "Our pruned decision tree has a much more intuitive boundary, but does make some errors. We have reduced our performance in an effort to simplify the tree. This is the classic machine learning problem of trading off complexity with error.\n",
655 | "\n",
656 | "Note that, in order to do this, we \"invented\" the minimum samples per leaf node of 10. Why 10? Why not 5? Why not 20? The answer is: it depends on the dataset. Heuristically choosing these parameters can be time consuming, and we will see later on how gradient boosting elegantly handles this task."
657 | ]
658 | },
659 | {
660 | "metadata": {
661 | "id": "2EFINpj-wD7H",
662 | "colab_type": "text"
663 | },
664 | "cell_type": "markdown",
665 | "source": [
666 | "## Decision trees have high \"variance\"\n",
667 | "\n",
668 | "Before we move on to boosting, it will be useful to demonstrate how decision trees have high \"variance\". In this context, variance refers to a property of some models to have a wide range of performance given random samples of data. Let's take a look at randomly slicing the data we have too see what that means."
669 | ]
670 | },
671 | {
672 | "metadata": {
673 | "id": "JT7fuuj6vjKB",
674 | "colab_type": "code",
675 | "colab": {}
676 | },
677 | "cell_type": "code",
678 | "source": [
679 | "np.random.seed(123)\n",
680 | "\n",
681 | "fig = plt.figure(figsize=[12,3])\n",
682 | "\n",
683 | "for i in range(3):\n",
684 | " ax = fig.add_subplot(1,3,i+1)\n",
685 | "\n",
686 | " # generate indices in a random order\n",
687 | " idx = np.random.permutation(X_train.shape[0])\n",
688 | " \n",
689 | " # only use the first 50\n",
690 | " idx = idx[:50]\n",
691 | " X_temp = X_train.iloc[idx]\n",
692 | " y_temp = y_train.values[idx]\n",
693 | " \n",
694 | " # initialize the model\n",
695 | " mdl = tree.DecisionTreeClassifier(max_depth=5)\n",
696 | " \n",
697 | " # train the model using the dataset\n",
698 | " mdl = mdl.fit(X_temp, y_temp)\n",
699 | " txt = 'Random sample {}'.format(i)\n",
700 | " dtn.plot_model_pred_2d(mdl, X_temp, y_temp, title=txt)"
701 | ],
702 | "execution_count": 0,
703 | "outputs": []
704 | },
705 | {
706 | "metadata": {
707 | "id": "j6VTIDr-yRRZ",
708 | "colab_type": "text"
709 | },
710 | "cell_type": "markdown",
711 | "source": [
712 | "Above we can see that we are using random subsets of data, and as a result, our decision boundary can change quite a bit. As you could guess, we actually don't want a model that randomly works well and randomly works poorly, so you may wonder why this is useful. \n",
713 | "\n",
714 | "The trick is that by combining many of instances of \"high variance\" classifiers (decision trees), we can end up with a single classifier with low variance. There is an old joke: two farmers and a statistician go hunting. They see a deer: the first farmer shoots, and misses to the left. The next farmer shoots, and misses to the right. The statistician yells \"We got it!!\".\n",
715 | "\n",
716 | "While it doesn't quite hold in real life, it turns out that this principle does hold for decision trees. Combining them in the right way ends up building powerful models."
717 | ]
718 | },
719 | {
720 | "metadata": {
721 | "id": "iWnKvx6myf9Z",
722 | "colab_type": "text"
723 | },
724 | "cell_type": "markdown",
725 | "source": [
726 | "## Boosting\n",
727 | "\n",
728 | "The premise of boosting is the combination of many weak learners to form a single \"strong\" learner. In a nutshell, boosting involves building a models iteratively. At each step we focus on the data on which we performed poorly. \n",
729 | "\n",
730 | "In our context, we'll use decision trees, so the first step would be to build a tree using the data. Next, we'd look at the data that we misclassified, and re-weight the data so that we really wanted to classify those observations correctly, at a cost of maybe getting some of the other data wrong this time. Let's see how this works in practice."
731 | ]
732 | },
733 | {
734 | "metadata": {
735 | "id": "YJWxu0bTwRzD",
736 | "colab_type": "code",
737 | "colab": {}
738 | },
739 | "cell_type": "code",
740 | "source": [
741 | "# build the model\n",
742 | "clf = tree.DecisionTreeClassifier(max_depth=1)\n",
743 | "mdl = ensemble.AdaBoostClassifier(base_estimator=clf,n_estimators=6)\n",
744 | "mdl = mdl.fit(X_train,y_train)\n",
745 | "\n",
746 | "# plot each individual decision tree\n",
747 | "fig = plt.figure(figsize=[12,6])\n",
748 | "for i, estimator in enumerate(mdl.estimators_):\n",
749 | " ax = fig.add_subplot(2,3,i+1)\n",
750 | " txt = 'Tree {}'.format(i+1)\n",
751 | " dtn.plot_model_pred_2d(estimator, X_train, y_train, title=txt)"
752 | ],
753 | "execution_count": 0,
754 | "outputs": []
755 | },
756 | {
757 | "metadata": {
758 | "id": "5zNfvDjTzh2U",
759 | "colab_type": "text"
760 | },
761 | "cell_type": "markdown",
762 | "source": [
763 | "Looking at our example above, we can see that the first iteration builds the exact same simple decision tree as we had seen earlier. This makes sense. It is using the entire dataset with no special weighting. \n",
764 | "\n",
765 | "In the next iteration we can see the model shift. It misclassified several observations in class 1, and now these are the most important observations. Consequently, it picks the boundary that, while prioritizing correctly classifies these observations, still tries to best classify the rest of the data too. \n",
766 | "\n",
767 | "The iteration process continues, until the model is apparently creating boundaries to capture just one or two observations (see, for example, Tree 6 on the bottom right). \n",
768 | "\n",
769 | "One important point is that each tree is weighted by its global error. So, for example, Tree 6 would carry less weight in the final model. It is clear that we wouldn't want Tree 6 to carry the same importance as Tree 1, when Tree 1 is doing so much better overall. It turns out that weighting each tree by the inverse of its error is a pretty good way to do this.\n",
770 | "\n",
771 | "Let's look at final model's decision surface.\n"
772 | ]
773 | },
774 | {
775 | "metadata": {
776 | "id": "3pVG5ytfzp_B",
777 | "colab_type": "code",
778 | "colab": {}
779 | },
780 | "cell_type": "code",
781 | "source": [
782 | "# plot the final prediction\n",
783 | "plt.figure(figsize=[9,5])\n",
784 | "txt = 'Boosted tree (final decision surface)'\n",
785 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)"
786 | ],
787 | "execution_count": 0,
788 | "outputs": []
789 | },
790 | {
791 | "metadata": {
792 | "id": "YRGRFjRgz26h",
793 | "colab_type": "text"
794 | },
795 | "cell_type": "markdown",
796 | "source": [
797 | "And that's AdaBoost! There are a few tricks we have glossed over here, but you understand the general principle. Now we'll move on to a different approach. With boosting, we iteratively changed the dataset to have new trees focus on the \"difficult\" observations. The next approach we discuss is similar as it also involves using changed versions of our dataset to build new trees."
798 | ]
799 | },
800 | {
801 | "metadata": {
802 | "id": "EFNDNsIpfP7j",
803 | "colab_type": "text"
804 | },
805 | "cell_type": "markdown",
806 | "source": [
807 | "## Bagging\n",
808 | "\n",
809 | "Bootstrap aggregation, or \"Bagging\", is another form of *ensemble learning* where we aim to build a single good model by combining many models together. With AdaBoost, we modified the data to focus on hard to classify observations. We can imagine this as a form of resampling the data for each new tree. For example, say we have three observations: A, B, and C, `[A, B, C]`. If we correctly classify observations `[A, B]`, but incorrectly classify `C`, then AdaBoost involves building a new tree that focuses on `C`. Equivalently, we could say AdaBoost builds a new tree using the dataset `[A, B, C, C, C]`, where we have *intentionally* repeated observation `C` 3 times so that the algorithm thinks it is 3 times as important as the other observations. Makes sense?\n",
810 | "\n",
811 | "Bagging involves the same approach, except we don't selectively choose which observations to focus on, but rather we *randomly select subsets of data each time*. As you can see, while this is a similar process to AdaBoost, the concept is quite different. Whereas before we aimed to iteratively improve our overall model with new trees, we now build trees on what we hope are independent datasets.\n",
812 | "\n",
813 | "Let's take a step back, and think about a practical example. Say we wanted a good model of heart disease. If we saw researchers build a model from a dataset of patients from their hospital, we would be happy. If they then acquired a new dataset from new patients, and built a new model, we'd be inclined to feel that the combination of the two models would be better than any one individually. This exact scenario is what bagging aims to replicate, except instead of actually going out and collecting new datasets, we instead use bootstrapping to create new sets of data from our current dataset. If you are unfamiliar with bootstrapping, you can treat it as \"magic\" for now (and if you are familiar with the bootstrap, you already know that it is magic).\n",
814 | "\n",
815 | "Let's take a look at a simple bootstrap model."
816 | ]
817 | },
818 | {
819 | "metadata": {
820 | "id": "JrXAspvrzv8x",
821 | "colab_type": "code",
822 | "colab": {}
823 | },
824 | "cell_type": "code",
825 | "source": [
826 | "np.random.seed(321)\n",
827 | "clf = tree.DecisionTreeClassifier(max_depth=5)\n",
828 | "mdl = ensemble.BaggingClassifier(base_estimator=clf, n_estimators=6)\n",
829 | "mdl = mdl.fit(X_train, y_train)\n",
830 | "\n",
831 | "fig = plt.figure(figsize=[12,6])\n",
832 | "for i, estimator in enumerate(mdl.estimators_): \n",
833 | " ax = fig.add_subplot(2,3,i+1)\n",
834 | " txt = 'Tree {}'.format(i+1)\n",
835 | " dtn.plot_model_pred_2d(estimator, X_train, y_train, \n",
836 | " title=txt)"
837 | ],
838 | "execution_count": 0,
839 | "outputs": []
840 | },
841 | {
842 | "metadata": {
843 | "id": "s3kKUPORfW9F",
844 | "colab_type": "text"
845 | },
846 | "cell_type": "markdown",
847 | "source": [
848 | "We can see that each individual tree is quite variable. This is a result of using a random set of data to train the classifier."
849 | ]
850 | },
851 | {
852 | "metadata": {
853 | "id": "w_D7_-0HfVMy",
854 | "colab_type": "code",
855 | "colab": {}
856 | },
857 | "cell_type": "code",
858 | "source": [
859 | "# plot the final prediction\n",
860 | "plt.figure(figsize=[8,5])\n",
861 | "txt = 'Bagged tree (final decision surface)'\n",
862 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)"
863 | ],
864 | "execution_count": 0,
865 | "outputs": []
866 | },
867 | {
868 | "metadata": {
869 | "id": "AOFnG0r6faLS",
870 | "colab_type": "text"
871 | },
872 | "cell_type": "markdown",
873 | "source": [
874 | "Not bad! Of course, since this is a simple dataset, we are not seeing that many dramatic changes between different models. Don't worry, we'll quantitatively evaluate them later. \n",
875 | "\n",
876 | "Next up, a minor addition creates one of the most popular models in machine learning."
877 | ]
878 | },
879 | {
880 | "metadata": {
881 | "id": "aiqrVfYtfcYk",
882 | "colab_type": "text"
883 | },
884 | "cell_type": "markdown",
885 | "source": [
886 | "## Random Forest\n",
887 | "\n",
888 | "In the previous example, we used bagging to randomly resample our data to generate \"new\" datasets. The Random Forest takes this one step further: instead of just resampling our data, we also select only a fraction of the features to include. \n",
889 | "\n",
890 | "It turns out that this subselection tends to improve the performance of our models. The odds of an individual being very good or very bad is higher (i.e. the variance of the trees is increased), and this ends up giving us a final model with better overall performance (lower bias).\n",
891 | "\n",
892 | "Let's train the model."
893 | ]
894 | },
895 | {
896 | "metadata": {
897 | "id": "u27LS36_fglG",
898 | "colab_type": "code",
899 | "colab": {}
900 | },
901 | "cell_type": "code",
902 | "source": [
903 | "np.random.seed(321)\n",
904 | "mdl = ensemble.RandomForestClassifier(max_depth=5, n_estimators=6, max_features=1)\n",
905 | "mdl = mdl.fit(X_train,y_train)\n",
906 | "\n",
907 | "fig = plt.figure(figsize=[12,6])\n",
908 | "for i, estimator in enumerate(mdl.estimators_): \n",
909 | " ax = fig.add_subplot(2,3,i+1)\n",
910 | " txt = 'Tree {}'.format(i+1)\n",
911 | " dtn.plot_model_pred_2d(estimator, X_train, y_train, title=txt)"
912 | ],
913 | "execution_count": 0,
914 | "outputs": []
915 | },
916 | {
917 | "metadata": {
918 | "id": "5aG0PI8lruGN",
919 | "colab_type": "code",
920 | "colab": {}
921 | },
922 | "cell_type": "code",
923 | "source": [
924 | "plt.figure(figsize=[9,5])\n",
925 | "txt = 'Random forest (final decision surface)'\n",
926 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)"
927 | ],
928 | "execution_count": 0,
929 | "outputs": []
930 | },
931 | {
932 | "metadata": {
933 | "id": "2KmJuztXfjzm",
934 | "colab_type": "text"
935 | },
936 | "cell_type": "markdown",
937 | "source": [
938 | "Again, the visualization doesn't *really* show us the power of Random Forests, but we'll quantitatively evaluate them soon enough.\n",
939 | "\n",
940 | "Last, and not least, we move on to gradient boosting."
941 | ]
942 | },
943 | {
944 | "metadata": {
945 | "id": "LTP8zFIofl2v",
946 | "colab_type": "text"
947 | },
948 | "cell_type": "markdown",
949 | "source": [
950 | "## Gradient Boosting\n",
951 | "\n",
952 | "Gradient boosting, our last topic, elegantly combines concepts from the previous methods. As a \"boosting\" method, gradient boosting involves iteratively building trees, aiming to improve upon misclassifications of the previous tree. Gradient boosting also borrows the concept of sub-sampling the variables (just like Random Forests), which can help to prevent overfitting.\n",
953 | "\n",
954 | "While it is hard to express in this non-technical tutorial, the biggest innovation in gradient boosting is that it provides a unifying mathematical framework for boosting models. The approach explicitly casts the problem of building a tree as an optimization problem, defining mathematical functions for how well a tree is performing (which we had before) *and* how complex a tree is. In this light, one can actually treat AdaBoost as a \"special case\" of gradient boosting, where the loss function is chosen to be the exponential loss.\n",
955 | "\n",
956 | "Let's build a gradient boosting model."
957 | ]
958 | },
959 | {
960 | "metadata": {
961 | "id": "L_QVZ9oNfnqk",
962 | "colab_type": "code",
963 | "colab": {}
964 | },
965 | "cell_type": "code",
966 | "source": [
967 | "np.random.seed(321)\n",
968 | "mdl = ensemble.GradientBoostingClassifier(n_estimators=10)\n",
969 | "mdl = mdl.fit(X_train, y_train)\n",
970 | "\n",
971 | "plt.figure(figsize=[9,5])\n",
972 | "txt = 'Gradient boosted tree (final decision surface)'\n",
973 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)"
974 | ],
975 | "execution_count": 0,
976 | "outputs": []
977 | },
978 | {
979 | "metadata": {
980 | "id": "tcCzP4gAsd7L",
981 | "colab_type": "text"
982 | },
983 | "cell_type": "markdown",
984 | "source": [
985 | "## Comparing model performance\n",
986 | "\n",
987 | "We've now learned the basics of the various tree methods and have visualized most of them. Let's finish by comparing the performance of our models on our held-out test data. Our goal, remember, is to predict whether or not a patient will survive their hospital stay using the patient's age and acute physiology score computed on the first day of their ICU stay."
988 | ]
989 | },
990 | {
991 | "metadata": {
992 | "id": "tQST4TQAtHmU",
993 | "colab_type": "code",
994 | "colab": {}
995 | },
996 | "cell_type": "code",
997 | "source": [
998 | "clf = dict()\n",
999 | "clf['Decision Tree'] = tree.DecisionTreeClassifier(criterion='entropy', splitter='best').fit(X_train,y_train)\n",
1000 | "clf['Gradient Boosting'] = ensemble.GradientBoostingClassifier(n_estimators=10).fit(X_train, y_train)\n",
1001 | "clf['Random Forest'] = ensemble.RandomForestClassifier(n_estimators=10).fit(X_train, y_train)\n",
1002 | "clf['Bagging'] = ensemble.BaggingClassifier(n_estimators=10).fit(X_train, y_train)\n",
1003 | "clf['AdaBoost'] = ensemble.AdaBoostClassifier(n_estimators=10).fit(X_train, y_train)\n",
1004 | "\n",
1005 | "fig = plt.figure(figsize=[10,10])\n",
1006 | "\n",
1007 | "print('AUROC\\tModel')\n",
1008 | "for i, curr_mdl in enumerate(clf): \n",
1009 | " yhat = clf[curr_mdl].predict_proba(X_test)[:,1]\n",
1010 | " score = metrics.roc_auc_score(y_test, yhat)\n",
1011 | " print('{:0.3f}\\t{}'.format(score, curr_mdl))\n",
1012 | " ax = fig.add_subplot(3,2,i+1)\n",
1013 | " dtn. plot_model_pred_2d(clf[curr_mdl], X_test, y_test, title=curr_mdl)\n",
1014 | " "
1015 | ],
1016 | "execution_count": 0,
1017 | "outputs": []
1018 | },
1019 | {
1020 | "metadata": {
1021 | "id": "osr6iM6ltLAP",
1022 | "colab_type": "text"
1023 | },
1024 | "cell_type": "markdown",
1025 | "source": [
1026 | "Here we can see that quantitatively, gradient boosting has produced the highest discrimination among all the models (~0.91). You'll see that some of the models appear to have simpler decision surfaces, which tends to result in improved generalization on a held-out test set (though not always!).\n",
1027 | "\n",
1028 | "To make appropriate comparisons, we should calculate 95% confidence intervals on these performance estimates. This can be done a number of ways. A simple but effective approach is to use bootstrapping, a resampling technique. In bootstrapping, we generate multiple datasets from the test set (allowing the same data point to be sampled multiple times). Using these datasets, we can then estimate the confidence intervals."
1029 | ]
1030 | },
1031 | {
1032 | "metadata": {
1033 | "id": "kABSe8ZmudSH",
1034 | "colab_type": "text"
1035 | },
1036 | "cell_type": "markdown",
1037 | "source": [
1038 | "## Further reading"
1039 | ]
1040 | },
1041 | {
1042 | "metadata": {
1043 | "id": "bfFZSe0vue86",
1044 | "colab_type": "code",
1045 | "colab": {}
1046 | },
1047 | "cell_type": "code",
1048 | "source": [
1049 | ""
1050 | ],
1051 | "execution_count": 0,
1052 | "outputs": []
1053 | }
1054 | ]
1055 | }
--------------------------------------------------------------------------------
/eicu_python/04_prediction_satoshi.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "id": "view-in-github",
7 | "colab_type": "text"
8 | },
9 | "source": [
10 | "
"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "colab_type": "text",
17 | "id": "T3wdKZCPklNq"
18 | },
19 | "source": [
20 | "# eICU Collaborative Research Database\n",
21 | "\n",
22 | "# Notebook 4: Prediction\n",
23 | "\n",
24 | "This notebook explores how a decision trees can be trained to predict in-hospital mortality of patients.\n",
25 | "\n",
26 | "このノートブックでは、decision treeを用いて院内死亡を予測するモデルを作ります。"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {
32 | "colab_type": "text",
33 | "id": "rG3HrM7GkwCH"
34 | },
35 | "source": [
36 | "## Load libraries and connect to the database"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 0,
42 | "metadata": {
43 | "colab": {},
44 | "colab_type": "code",
45 | "id": "s-MoFA6NkkbZ"
46 | },
47 | "outputs": [],
48 | "source": [
49 | "# Import libraries\n",
50 | "import os\n",
51 | "import numpy as np\n",
52 | "import pandas as pd\n",
53 | "import matplotlib.pyplot as plt\n",
54 | "\n",
55 | "# model building\n",
56 | "from sklearn import ensemble, impute, metrics, preprocessing, tree\n",
57 | "from sklearn.model_selection import cross_val_score, train_test_split\n",
58 | "from sklearn.pipeline import Pipeline\n",
59 | "\n",
60 | "# Make pandas dataframes prettier\n",
61 | "from IPython.display import display, HTML, Image\n",
62 | "plt.rcParams.update({'font.size': 20})\n",
63 | "%matplotlib inline\n",
64 | "plt.style.use('ggplot')\n",
65 | "\n",
66 | "# Access data using Google BigQuery.\n",
67 | "from google.colab import auth\n",
68 | "from google.cloud import bigquery"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 0,
74 | "metadata": {
75 | "colab": {},
76 | "colab_type": "code",
77 | "id": "jyBV_Q9DkyD3"
78 | },
79 | "outputs": [],
80 | "source": [
81 | "# authenticate\n",
82 | "auth.authenticate_user()"
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": 0,
88 | "metadata": {
89 | "colab": {},
90 | "colab_type": "code",
91 | "id": "cF1udJKhkzYq"
92 | },
93 | "outputs": [],
94 | "source": [
95 | "# Set up environment variables\n",
96 | "project_id='datathonjapan2019'\n",
97 | "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
98 | ]
99 | },
100 | {
101 | "cell_type": "markdown",
102 | "metadata": {
103 | "colab_type": "text",
104 | "id": "xGurBAQIUDTt"
105 | },
106 | "source": [
107 | "To make our lives easier, we'll also install and import a set of helper functions from the `datathon2` package. We will be using the following functions from the package:\n",
108 | "\n",
109 | "`datathon2`パッケージの中にある以下のfunctionを使いましょう。\n",
110 | "- `plot_model_pred_2d`: to visualize our data, helping to display a class split assigned by a tree vs the true class.\n",
111 | "\n",
112 | "(作ったtreeと、それによるクラス分けを図示してくれるfunctionです)\n",
113 | "- `run_query()`: to run an SQL query against our BigQuery database and assign the results to a dataframe. \n",
114 | "\n",
115 | "(BigQuery上でSQLを行い、データを抽出するためのfunctionです)\n"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": 0,
121 | "metadata": {
122 | "colab": {},
123 | "colab_type": "code",
124 | "id": "GDEewAlvk0oT"
125 | },
126 | "outputs": [],
127 | "source": [
128 | "!pip install datathon2"
129 | ]
130 | },
131 | {
132 | "cell_type": "code",
133 | "execution_count": 0,
134 | "metadata": {
135 | "colab": {},
136 | "colab_type": "code",
137 | "id": "JM6O5GPAUI89"
138 | },
139 | "outputs": [],
140 | "source": [
141 | "import datathon2 as dtn\n",
142 | "import pydotplus\n",
143 | "from tableone import TableOne"
144 | ]
145 | },
146 | {
147 | "cell_type": "markdown",
148 | "metadata": {
149 | "colab_type": "text",
150 | "id": "hq_09Hh-y17k"
151 | },
152 | "source": [
153 | "In this notebook we'll be looking at tree models, so we'll now install a package for visualizing these models."
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": 0,
159 | "metadata": {
160 | "colab": {},
161 | "colab_type": "code",
162 | "id": "jBMOwgwszGOw"
163 | },
164 | "outputs": [],
165 | "source": [
166 | "!apt-get install graphviz -y"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {
172 | "colab_type": "text",
173 | "id": "LgcRCqxCk3HC"
174 | },
175 | "source": [
176 | "## Load the patient cohort\n",
177 | "\n",
178 | "Let's extract a cohort of patients admitted to the ICU from the emergency department. We link demographics data from the `patient` table to severity of illness score data in the `apachepatientresult` table. We exclude readmissions and neurological patients to help create a population suitable for our demonstration.\n",
179 | "\n",
180 | "`patient`テーブルから患者背景を、`apachepatientresult`テーブルから重症度に関する情報を抽出し、一つにまとめます。"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": 0,
186 | "metadata": {
187 | "colab": {},
188 | "colab_type": "code",
189 | "id": "ReCl7-aek1-k"
190 | },
191 | "outputs": [],
192 | "source": [
193 | "# Link the patient, apachepatientresult, and apacheapsvar tables on patientunitstayid\n",
194 | "# using an inner join.\n",
195 | "query = \"\"\"\n",
196 | "SELECT p.unitadmitsource, p.gender, p.age, p.unittype, p.unitstaytype, \n",
197 | " a.actualhospitalmortality, a.acutePhysiologyScore, a.apacheScore\n",
198 | "FROM `physionet-data.eicu_crd_demo.patient` p\n",
199 | "INNER JOIN `physionet-data.eicu_crd_demo.apachepatientresult` a\n",
200 | "ON p.patientunitstayid = a.patientunitstayid\n",
201 | "WHERE a.apacheversion LIKE 'IVa'\n",
202 | "AND LOWER(p.unitadmitsource) LIKE \"%emergency%\"\n",
203 | "AND LOWER(p.unitstaytype) LIKE \"admit%\"\n",
204 | "AND LOWER(p.unittype) NOT LIKE \"%neuro%\";\n",
205 | "\"\"\"\n",
206 | "\n",
207 | "cohort = dtn.run_query(query,project_id)"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 0,
213 | "metadata": {
214 | "colab": {},
215 | "colab_type": "code",
216 | "id": "yxLctVBpk9sO"
217 | },
218 | "outputs": [],
219 | "source": [
220 | "cohort.head()"
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {
226 | "colab_type": "text",
227 | "id": "NPlwRV2buYb1"
228 | },
229 | "source": [
230 | "## Preparing the data for analysis\n",
231 | "\n",
232 | "Before continuing, we want to review our data, paying attention to factors such as:\n",
233 | "- data types (for example, are values recorded as characters or numerical values?) (データの型)\n",
234 | "- missing data(データの欠落)\n",
235 | "- distribution of values(データの分布)"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 0,
241 | "metadata": {
242 | "colab": {},
243 | "colab_type": "code",
244 | "id": "v3OJ4LDvueKu"
245 | },
246 | "outputs": [],
247 | "source": [
248 | "# dataset info\n",
249 | "print(cohort.info())"
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "execution_count": 0,
255 | "metadata": {
256 | "colab": {},
257 | "colab_type": "code",
258 | "id": "s4wQ6o_RvLph"
259 | },
260 | "outputs": [],
261 | "source": [
262 | "# Encode the categorical data\n",
263 | "encoder = preprocessing.LabelEncoder()\n",
264 | "cohort['gender_code'] = encoder.fit_transform(cohort['gender'])\n",
265 | "cohort['actualhospitalmortality_code'] = encoder.fit_transform(cohort['actualhospitalmortality'])\n"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 0,
271 | "metadata": {
272 | "colab": {},
273 | "colab_type": "code",
274 | "id": "4ogi_ns-ylnP"
275 | },
276 | "outputs": [],
277 | "source": [
278 | "# Handle the deidentified ages\n",
279 | "cohort['age'] = pd.to_numeric(cohort['age'], downcast='integer', errors='coerce')\n",
280 | "cohort['age'] = cohort['age'].fillna(value=91.5)"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": 0,
286 | "metadata": {
287 | "colab": {},
288 | "colab_type": "code",
289 | "id": "77M0QJQ5wcPQ"
290 | },
291 | "outputs": [],
292 | "source": [
293 | "# Preview the encoded data\n",
294 | "cohort[['gender','gender_code']].head()"
295 | ]
296 | },
297 | {
298 | "cell_type": "code",
299 | "execution_count": 0,
300 | "metadata": {
301 | "colab": {},
302 | "colab_type": "code",
303 | "id": "GqvwTNPN3KZz"
304 | },
305 | "outputs": [],
306 | "source": [
307 | "# Check the outcome variable\n",
308 | "cohort['actualhospitalmortality_code'].unique()"
309 | ]
310 | },
311 | {
312 | "cell_type": "code",
313 | "execution_count": 0,
314 | "metadata": {
315 | "colab": {},
316 | "colab_type": "code",
317 | "id": "gIIsthy1WK3i"
318 | },
319 | "outputs": [],
320 | "source": [
321 | "# View summary statistics\n",
322 | "pd.set_option('display.height', 500)\n",
323 | "pd.set_option('display.max_rows', 500)\n",
324 | "TableOne(cohort,groupby='actualhospitalmortality')"
325 | ]
326 | },
327 | {
328 | "cell_type": "markdown",
329 | "metadata": {
330 | "colab_type": "text",
331 | "id": "IGtKlTG1gvRf"
332 | },
333 | "source": [
334 | "From these summary statistics, we can see that the average age is higher in the group of patients who do not survive. What other differences do you see?\n",
335 | "\n",
336 | "生存群・死亡群でどのような差を認めますか?"
337 | ]
338 | },
339 | {
340 | "cell_type": "markdown",
341 | "metadata": {
342 | "colab_type": "text",
343 | "id": "ze7y5J4Ioz8u"
344 | },
345 | "source": [
346 | "## Creating our train and test sets\n",
347 | "\n",
348 | "We only focus on two variables for our analysis, age and acute physiology score. Limiting ourselves to two variables will make it easier to visualize our models.\n",
349 | "\n",
350 | "今回は、年齢と重症度スコアの二つに注目してモデルを作成します。抽出したデータを、トレーニングセット(モデルを作るためのデータセット)とテストセット(モデルのパフォーマンスを測るためのデータセット)に分けて行います。"
351 | ]
352 | },
353 | {
354 | "cell_type": "code",
355 | "execution_count": 0,
356 | "metadata": {
357 | "colab": {},
358 | "colab_type": "code",
359 | "id": "i5zXkn_AlDJW"
360 | },
361 | "outputs": [],
362 | "source": [
363 | "features = ['age','acutePhysiologyScore']\n",
364 | "outcome = 'actualhospitalmortality_code'\n",
365 | "\n",
366 | "X = cohort[features]\n",
367 | "y = cohort[outcome]"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 0,
373 | "metadata": {
374 | "colab": {},
375 | "colab_type": "code",
376 | "id": "IHhIgDUwocmA"
377 | },
378 | "outputs": [],
379 | "source": [
380 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)"
381 | ]
382 | },
383 | {
384 | "cell_type": "code",
385 | "execution_count": 0,
386 | "metadata": {
387 | "colab": {},
388 | "colab_type": "code",
389 | "id": "NvQWkuY6nkZ8"
390 | },
391 | "outputs": [],
392 | "source": [
393 | "# Review the number of cases in each set\n",
394 | "print(\"Train data: {}\".format(len(X_train)))\n",
395 | "print(\"Test data: {}\".format(len(X_test)))"
396 | ]
397 | },
398 | {
399 | "cell_type": "markdown",
400 | "metadata": {
401 | "colab_type": "text",
402 | "id": "b2waK5qBqanC"
403 | },
404 | "source": [
405 | "## Decision trees\n",
406 | "\n",
407 | "Let's build the simplest tree model we can think of: a classification tree with only one split. Decision trees of this form are commonly referred to under the umbrella term Classification and Regression Trees (CART) [1]. \n",
408 | "\n",
409 | "Treeによるモデルの作成をCARTと総称し、今回は生存・死亡という2つのアウトカムへの分類(classification)を目的としたdesicion treeを用います。\n",
410 | "\n",
411 | "While we will only be looking at classification here, regression isn't too different. After grouping the data (which is essentially what a decision tree does), classification involves assigning all members of the group to the majority class of that group during training. Regression is the same, except you would assign the average value, not the majority.\n",
412 | "\n",
413 | "クラス分けを主とするdesicion treeでは、あるルールで分け(split)グループ化した後、その多数が属するクラスへとそのグループを割り当てます。連続変数をアウトカムとするregression treeも、平均値を用いること以外は基本的に同じです。\n",
414 | "\n",
415 | "In the case of a decision tree with one split, often called a \"stump\", the model will partition the data into two groups, and assign classes for those two groups based on majority vote. There are many parameters available for the DecisionTreeClassifier class; by specifying max_depth=1 we will build a decision tree with only one split - i.e. of depth 1.\n",
416 | "\n",
417 | "一回のみsplitした場合(stump)、データは二群に分けられ、それぞれの群は多数が属するクラスへと割り当てられます(モデルを使った予測)。以下のように`max_depth=1`と指定することで、一回のみのsplitとなります。\n",
418 | "\n",
419 | "[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984."
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": 0,
425 | "metadata": {
426 | "colab": {},
427 | "colab_type": "code",
428 | "id": "RlG3N3OYBqAm"
429 | },
430 | "outputs": [],
431 | "source": [
432 | "# specify max_depth=1 so we train a stump, i.e. a tree with only 1 split\n",
433 | "mdl = tree.DecisionTreeClassifier(max_depth=1)\n",
434 | "\n",
435 | "# fit the model to the data - trying to predict y from X\n",
436 | "mdl = mdl.fit(X_train,y_train)"
437 | ]
438 | },
439 | {
440 | "cell_type": "markdown",
441 | "metadata": {
442 | "colab_type": "text",
443 | "id": "8RlioUw8B_0O"
444 | },
445 | "source": [
446 | "Our model is so simple that we can look at the full decision tree."
447 | ]
448 | },
449 | {
450 | "cell_type": "code",
451 | "execution_count": 0,
452 | "metadata": {
453 | "colab": {},
454 | "colab_type": "code",
455 | "id": "G2t9Nz8pBqEb"
456 | },
457 | "outputs": [],
458 | "source": [
459 | "graph = dtn.create_graph(mdl,feature_names=features)\n",
460 | "Image(graph.create_png())"
461 | ]
462 | },
463 | {
464 | "cell_type": "markdown",
465 | "metadata": {
466 | "colab_type": "text",
467 | "id": "E-iPwWWKCGY9"
468 | },
469 | "source": [
470 | "Here we see three nodes: a node at the top, a node in the lower left, and a node in the lower right.\n",
471 | "\n",
472 | "図が今回のdecision treeの概要です。三つのnode(四角に囲まれた部分)に注目していきます。\n",
473 | "\n",
474 | "The top node is the root of the tree: it contains all the data. Let's read this node bottom to top:\n",
475 | "\n",
476 | "まず一番上にあるnodeは、全てのデータを含んでいます。\n",
477 | "- `value = [384, 44]`: Current class balance. There are 384 observations of class 0 and 44 observations of class 1.(実際のクラス:生存 or 死亡)\n",
478 | "- `samples = 428`: Number of samples assessed at this node.(サンプル合計)\n",
479 | "- `gini = 0.184`: Gini impurity, a measure of \"impurity\". The higher the value, the bigger the mix of classes. A 50/50 split of two classes would result in an index of 0.5.(\"Gini\"は、\"impurity\"(不純・混入)の指標です。Gini impurityが低いということは、クラス分けが成功していることを意味します。クラスが半々の場合、Giniは0.5となります)\n",
480 | "- `acutePhysiologyScore <=78.5`: Decision rule learned by the node. In this case, patients with a score of <= 78.5 are moved into the left node and >78.5 to the right. (Splitのルールです。今回は重症度スコアが78.5以下であれば左のnodeへ分けられます)"
481 | ]
482 | },
483 | {
484 | "cell_type": "markdown",
485 | "metadata": {
486 | "colab_type": "text",
487 | "id": "KS0UcZqUeJKz"
488 | },
489 | "source": [
490 | "The gini impurity is actually used by the algorithm to determine a split. The model evaluates every feature (in our case, age and score) at every possible split (46, 47, 48..) to find the point with the lowest gini impurity in two resulting nodes.\n",
491 | "\n",
492 | "Decision treeではGiniを指標にsplitのアルゴリズムが決定されます。すなわち、split後の二つのGini impurityが最も低くなるように、モデルの予測因子(年齢と重症度スコア)とそのカットオフポイントが選ばれます。\n",
493 | "\n",
494 | "The approach is referred to as \"greedy\" because we are choosing the optimal split given our current state. Let's take a closer look at our decision boundary.\n",
495 | "\n",
496 | "このようなsplitの選択方法を\"greedy\"と呼びます。"
497 | ]
498 | },
499 | {
500 | "cell_type": "code",
501 | "execution_count": 0,
502 | "metadata": {
503 | "colab": {},
504 | "colab_type": "code",
505 | "id": "uXl22sNTtpHa"
506 | },
507 | "outputs": [],
508 | "source": [
509 | "# look at the regions in a 2d plot\n",
510 | "# based on scikit-learn tutorial plot_iris.html\n",
511 | "plt.figure(figsize=[10,8])\n",
512 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, \n",
513 | " title=\"Decision tree (depth 1)\")"
514 | ]
515 | },
516 | {
517 | "cell_type": "markdown",
518 | "metadata": {
519 | "colab_type": "text",
520 | "id": "25zSX-inCNOJ"
521 | },
522 | "source": [
523 | "In this plot we can see the decision boundary on the y-axis, separating the predicted classes. The true classes are indicated at each point. Where the background and point colours are mismatched, there has been misclassification. Of course we are using a very simple model. Let's see what happens when we increase the depth.\n",
524 | "\n",
525 | "y軸に境界線(重症度スコア=78.5)があり、このモデルではその線を元にクラス分け(生存の予測 or 死亡の予測)を行っています。それぞれの点の色が本当のクラス(生存 or 死亡)を表しており、背景の色と点の色が異なる場合、クラス分けの不一致(misclassification)を示します。これより、depthを増やすことで複雑なモデルにしていきます。"
526 | ]
527 | },
528 | {
529 | "cell_type": "code",
530 | "execution_count": 0,
531 | "metadata": {
532 | "colab": {},
533 | "colab_type": "code",
534 | "id": "ZuO62CL3CSGm"
535 | },
536 | "outputs": [],
537 | "source": [
538 | "mdl = tree.DecisionTreeClassifier(max_depth=5)\n",
539 | "mdl = mdl.fit(X_train,y_train)"
540 | ]
541 | },
542 | {
543 | "cell_type": "code",
544 | "execution_count": 0,
545 | "metadata": {
546 | "colab": {},
547 | "colab_type": "code",
548 | "id": "A88Vi83LCSJ6"
549 | },
550 | "outputs": [],
551 | "source": [
552 | "plt.figure(figsize=[10,8])\n",
553 | "dtn.plot_model_pred_2d(mdl, X_train, y_train,\n",
554 | " title=\"Decision tree (depth 5)\")"
555 | ]
556 | },
557 | {
558 | "cell_type": "markdown",
559 | "metadata": {
560 | "colab_type": "text",
561 | "id": "B88XlKDtCYmn"
562 | },
563 | "source": [
564 | "Now our tree is more complicated! We can see a few vertical boundaries as well as the horizontal one from before. Some of these we may like, but some appear unnatural. Let's look at the tree itself.\n",
565 | "\n",
566 | "このモデルでは、x軸・y軸ともにsplitに使われているのがわかります。"
567 | ]
568 | },
569 | {
570 | "cell_type": "code",
571 | "execution_count": 0,
572 | "metadata": {
573 | "colab": {},
574 | "colab_type": "code",
575 | "id": "V1VLrOJJCcWo"
576 | },
577 | "outputs": [],
578 | "source": [
579 | "graph = dtn.create_graph(mdl,feature_names=features)\n",
580 | "Image(graph.create_png())"
581 | ]
582 | },
583 | {
584 | "cell_type": "markdown",
585 | "metadata": {
586 | "colab_type": "text",
587 | "id": "Ton_EnvFqHIO"
588 | },
589 | "source": [
590 | "Looking at the tree, we can see that there are some very specific rules. Consider our patient aged 65 years with an acute physiology score of 87. From the top of the tree, we would work our way down:\n",
591 | "\n",
592 | "重症度スコアが87である65歳の患者を例に考えてみましょう。\n",
593 | "\n",
594 | "- acutePhysiologyScore <= 78.5? No.\n",
595 | "- acutePhysiologyScore <= 106.5? Yes.\n",
596 | "- age <= 75.5? Yes\n",
597 | "- age <= 66. Yes.\n",
598 | "- age <= 62.5? No. \n",
599 | "\n",
600 | "This leads us to our single node with a gini impurity of 0. Having an entire rule based upon this one observation seems silly, but it is perfectly logical as at the moment. The only objective the algorithm cares about is minimizing the gini impurity.\n",
601 | "\n",
602 | "最終的にはGini impurityが0(100%生存 or 100%死亡)のnodeに行き着きます。臨床現場を考えるとこのように完璧なクラス分けは不可能に思われますが、このアルゴリズムではGiniのみを根拠にsplitするとこのようになります。\n",
603 | "\n",
604 | "We are at risk of overfitting our data! This is where \"pruning\" comes in.\n",
605 | "\n",
606 | "これがレクチャーで説明した\"overfit\"です。このoverfitを回避するため、\"pruning(枝を切り取る)\"を行います。"
607 | ]
608 | },
609 | {
610 | "cell_type": "code",
611 | "execution_count": 0,
612 | "metadata": {
613 | "colab": {},
614 | "colab_type": "code",
615 | "id": "VvsNIjCDDIo_"
616 | },
617 | "outputs": [],
618 | "source": [
619 | "# let's prune the model and look again\n",
620 | "mdl = dtn.prune(mdl, min_samples_leaf = 10)\n",
621 | "graph = dtn.create_graph(mdl,feature_names=features)\n",
622 | "Image(graph.create_png()) "
623 | ]
624 | },
625 | {
626 | "cell_type": "markdown",
627 | "metadata": {
628 | "colab_type": "text",
629 | "id": "8pRzzV2VvdxP"
630 | },
631 | "source": [
632 | "Above, we can see that our second tree is (1) smaller in depth, and (2) never splits a node with <= 10 samples. We can look at the decision surface for this tree:\n",
633 | "\n",
634 | "Treeのdepthは小さくなり、一つのnodeに10サンプル以下にはならないように設定されています。"
635 | ]
636 | },
637 | {
638 | "cell_type": "code",
639 | "execution_count": 0,
640 | "metadata": {
641 | "colab": {},
642 | "colab_type": "code",
643 | "id": "5LyGDz-Cr-mU"
644 | },
645 | "outputs": [],
646 | "source": [
647 | "plt.figure(figsize=[10,8])\n",
648 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=\"Pruned decision tree\")"
649 | ]
650 | },
651 | {
652 | "cell_type": "markdown",
653 | "metadata": {
654 | "colab_type": "text",
655 | "id": "xAnqmD_Dv_dh"
656 | },
657 | "source": [
658 | "Our pruned decision tree has a much more intuitive boundary, but does make some errors. We have reduced our performance in an effort to simplify the tree. This is the classic machine learning problem of trading off complexity with error.\n",
659 | "\n",
660 | "このようにpruningされたtreeは、臨床家としては直感的に理解しやすいモデルのことが多いですが、その代わりに予測のエラーを引き起こしてしまいます。モデルの複雑性とエラーは、トレードオフの関係にあります。\n",
661 | "\n",
662 | "Note that, in order to do this, we \"invented\" the minimum samples per leaf node of 10. Why 10? Why not 5? Why not 20? The answer is: it depends on the dataset. Heuristically choosing these parameters can be time consuming, and we will see later on how gradient boosting elegantly handles this task.\n",
663 | "\n",
664 | "今回はnodeの最小サンプル数を10に設定しましたが、どのように決めればよいのでしょうか。この問題に関しては、\"Boosting\"で説明します。"
665 | ]
666 | },
667 | {
668 | "cell_type": "markdown",
669 | "metadata": {
670 | "colab_type": "text",
671 | "id": "2EFINpj-wD7H"
672 | },
673 | "source": [
674 | "## Decision trees have high \"variance\"\n",
675 | "\n",
676 | "Before we move on to boosting, it will be useful to demonstrate how decision trees have high \"variance\". In this context, variance refers to a property of some models to have a wide range of performance given random samples of data. Let's take a look at randomly slicing the data we have too see what that means.\n",
677 | "\n",
678 | "Desicion treeでは、variance(分散)が大きくなる傾向にあります。すなわち、モデルの元となるトレーニングセットのサンプルによって、毎回そのモデルが大きく変わってきます。以下に、データをランダムに抜き取り、モデルを複数作ってこのvarianceについて説明します。"
679 | ]
680 | },
681 | {
682 | "cell_type": "code",
683 | "execution_count": 0,
684 | "metadata": {
685 | "colab": {},
686 | "colab_type": "code",
687 | "id": "JT7fuuj6vjKB"
688 | },
689 | "outputs": [],
690 | "source": [
691 | "np.random.seed(123)\n",
692 | "\n",
693 | "fig = plt.figure(figsize=[12,3])\n",
694 | "\n",
695 | "for i in range(3):\n",
696 | " ax = fig.add_subplot(1,3,i+1)\n",
697 | "\n",
698 | " # generate indices in a random order\n",
699 | " idx = np.random.permutation(X_train.shape[0])\n",
700 | " \n",
701 | " # only use the first 50\n",
702 | " idx = idx[:50]\n",
703 | " X_temp = X_train.iloc[idx]\n",
704 | " y_temp = y_train.values[idx]\n",
705 | " \n",
706 | " # initialize the model\n",
707 | " mdl = tree.DecisionTreeClassifier(max_depth=5)\n",
708 | " \n",
709 | " # train the model using the dataset\n",
710 | " mdl = mdl.fit(X_temp, y_temp)\n",
711 | " txt = 'Random sample {}'.format(i)\n",
712 | " dtn.plot_model_pred_2d(mdl, X_temp, y_temp, title=txt)"
713 | ]
714 | },
715 | {
716 | "cell_type": "markdown",
717 | "metadata": {
718 | "colab_type": "text",
719 | "id": "j6VTIDr-yRRZ"
720 | },
721 | "source": [
722 | "Above we can see that we are using random subsets of data, and as a result, our decision boundary can change quite a bit. As you could guess, we actually don't want a model that randomly works well and randomly works poorly, so you may wonder why this is useful. \n",
723 | "\n",
724 | "このように、データをランダムに抽出しモデルを作成した際、それぞれのモデルの境界線は毎回大きく異なっています。しかし実際には、時に良く時に悪いモデルは必要ではありません。\n",
725 | "\n",
726 | "The trick is that by combining many of instances of \"high variance\" classifiers (decision trees), we can end up with a single classifier with low variance. There is an old joke: two farmers and a statistician go hunting. They see a deer: the first farmer shoots, and misses to the left. The next farmer shoots, and misses to the right. The statistician yells \"We got it!!\".\n",
727 | "\n",
728 | "しかし、このように大きなvariance(分散)を持つdecision treeを組み合わせることで、小さなvarianceをもつ一つのモデルを作り上げることができます。このようなジョークがあります。二人のハンターが鹿を狙っています。一人が撃つと鹿は左に避けました。もう一人が撃つと今度は右に避けました。それを見た統計家は、鹿の逃げる方向を予測できる!としたのです。\n",
729 | "\n",
730 | "While it doesn't quite hold in real life, it turns out that this principle does hold for decision trees. Combining them in the right way ends up building powerful models.\n",
731 | "\n",
732 | "現実世界では奇妙ですが、decision treeの世界ではこの原則が成り立ちます。すなわち、幾つもモデルを組み合わせることでとても良いモデルができあがります。"
733 | ]
734 | },
735 | {
736 | "cell_type": "markdown",
737 | "metadata": {
738 | "colab_type": "text",
739 | "id": "iWnKvx6myf9Z"
740 | },
741 | "source": [
742 | "## Boosting\n",
743 | "\n",
744 | "The premise of boosting is the combination of many weak learners to form a single \"strong\" learner. In a nutshell, boosting involves building a models iteratively, and at each step we focus on the data we performed poorly on. In our context, we'll use decision trees, so the first step would be to build a tree using the data. Next, we'd look at the data that we misclassified, and re-weight the data so that we really wanted to classify those observations correctly, at a cost of maybe getting some of the other data wrong this time. Let's see how this works in practice.\n",
745 | "\n",
746 | "\"Boosting\"とは、モデルを何回も反復して作成し、毎回そのエラーに注目することです。ここではまず初めに、データを用いて一つのtreeを作成します。次に、間違って分類したデータに注目し、それらのデータにweight(重み)を置くことで、それらのデータの分類を正確にしようと試みます(それによって他のデータのmisclassificationが起こる可能性はあります)。"
747 | ]
748 | },
749 | {
750 | "cell_type": "code",
751 | "execution_count": 0,
752 | "metadata": {
753 | "colab": {},
754 | "colab_type": "code",
755 | "id": "YJWxu0bTwRzD"
756 | },
757 | "outputs": [],
758 | "source": [
759 | "# build the model\n",
760 | "clf = tree.DecisionTreeClassifier(max_depth=1)\n",
761 | "mdl = ensemble.AdaBoostClassifier(base_estimator=clf,n_estimators=6)\n",
762 | "mdl = mdl.fit(X_train,y_train)\n",
763 | "\n",
764 | "# plot each individual decision tree\n",
765 | "fig = plt.figure(figsize=[12,6])\n",
766 | "for i, estimator in enumerate(mdl.estimators_):\n",
767 | " ax = fig.add_subplot(2,3,i+1)\n",
768 | " txt = 'Tree {}'.format(i+1)\n",
769 | " dtn.plot_model_pred_2d(estimator, X_train, y_train, title=txt)"
770 | ]
771 | },
772 | {
773 | "cell_type": "markdown",
774 | "metadata": {
775 | "colab_type": "text",
776 | "id": "5zNfvDjTzh2U"
777 | },
778 | "source": [
779 | "Looking at the above, we can see that the first iteration builds the exact same simple decision tree as we had seen earlier. This makes sense. It's using the entire dataset with no special weighting. \n",
780 | "\n",
781 | "初めのモデルはweightを置いていないため、前述のモデルと同じになっています。\n",
782 | "\n",
783 | "In the next iteration we can see the model shift. It misclassified several observations in class 1, and now these are the most important observations. Consequently, it picks the boundary that, while prioritizing correctly classifies these observations, still tries to best classify the rest of the data too. \n",
784 | "\n",
785 | "次のモデルでは、初めのモデルで間違って分類したサンプルを重要視し、かつ残りのデータもできる限り正確に分類できる境界線を引いています。\n",
786 | "\n",
787 | "The iteration process continues, until the model is apparently creating boundaries to capture just one or two observations (see, for example, Tree 6 on the bottom right). \n",
788 | "\n",
789 | "この操作を繰り返し(iteration)、境界線が分けることのできるサンプルが少なくなるまで行います。\n",
790 | "\n",
791 | "One important point is that each tree is weighted by its global error. So, for example, Tree 6 would carry less weight in the final model. It is clear that we wouldn't want Tree 6 to carry the same importance as Tree 1, when Tree 1 is doing so much better overall. It turns out that weighting each tree by the inverse of its error is a pretty good way to do this.\n",
792 | "\n",
793 | "重要な点は、それぞれのtreeが全体のエラーによって違う重みを置かれていることです。例えば、tree 1はtree 6よりも全体のエラーが少ないので重要視されています。\n",
794 | "\n",
795 | "Let's look at final model's decision surface.\n"
796 | ]
797 | },
798 | {
799 | "cell_type": "code",
800 | "execution_count": 0,
801 | "metadata": {
802 | "colab": {},
803 | "colab_type": "code",
804 | "id": "3pVG5ytfzp_B"
805 | },
806 | "outputs": [],
807 | "source": [
808 | "# plot the final prediction\n",
809 | "plt.figure(figsize=[9,5])\n",
810 | "txt = 'Boosted tree (final decision surface)'\n",
811 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)"
812 | ]
813 | },
814 | {
815 | "cell_type": "markdown",
816 | "metadata": {
817 | "colab_type": "text",
818 | "id": "YRGRFjRgz26h"
819 | },
820 | "source": [
821 | "And that's AdaBoost! There are a few tricks we have glossed over here, but you understand the general principle. Now we'll move on to a different approach. With boosting, we iteratively changed the dataset to have new trees focus on the \"difficult\" observations. The next approach we discuss is similar as it also involves using changed versions of our dataset to build new trees.\n",
822 | "\n",
823 | "この一連の流れを\"AdaBoost(Adaptive Boosting)\"と呼びます。"
824 | ]
825 | },
826 | {
827 | "cell_type": "markdown",
828 | "metadata": {
829 | "colab_type": "text",
830 | "id": "EFNDNsIpfP7j"
831 | },
832 | "source": [
833 | "## Bagging\n",
834 | "\n",
835 | "Bootstrap aggregation, or \"Bagging\", is another form of *ensemble learning* where we aim to build a single good model by combining many models together. With AdaBoost, we modified the data to focus on hard to classify observations. We can imagine this as a form of resampling the data for each new tree. For example, say we have three observations: A, B, and C, `[A, B, C]`. If we correctly classify observations `[A, B]`, but incorrectly classify `C`, then AdaBoost involves building a new tree that focuses on `C`. Equivalently, we could say AdaBoost builds a new tree using the dataset `[A, B, C, C, C]`, where we have *intentionally* repeated observation `C` 3 times so that the algorithm thinks it is 3 times as important as the other observations. Makes sense?\n",
836 | "\n",
837 | "\"Boostrap aggegation\"や\"Bagging\"と呼ばれるこの手法も、一つのモデルを作り上げるため多くのモデルを組み合わせる方法です。しかしBoostapでは、トレーニングセットから毎回データを取り直す(resampling)してそれぞれのモデルを作ります。例えば、[A,B,C]という三つのサンプルがあり、Cを間違って分類したとします。AdaBoostの新しいtreeではこのCに重みを与えてモデルを作り直します。これは、Cというサンプルを意図的に何回もデータセットに組み込み(ex. [A,B,C,C,C])、Cを他のサンプルよりも重要視するような新しいモデルを作っていると言うこともできます。\n",
838 | "\n",
839 | "Bagging involves the exact same approach, except we don't selectively choose which observations to focus on, but rather we *randomly select subsets of data each time*. As you can see, while this is a similar process to AdaBoost, the concept is quite different. Whereas before we aimed to iteratively improve our overall model with new trees, we now build trees on what we hope are independent datasets.\n",
840 | "\n",
841 | "Baggingでは意図的なサンプルの抽出(C)は行わず、毎回ランダムにデータセットを作ります。すなわち、それぞれ独立したデータセットを元にtreeを作っています。\n",
842 | "\n",
843 | "Let's take a step back, and think about a practical example. Say we wanted a good model of heart disease. If we saw researchers build a model from a dataset of patients from their hospital, we would be happy. If they then acquired a new dataset from new patients, and built a new model, we'd be inclined to feel that the combination of the two models would be better than any one individually. This exact scenario is what bagging aims to replicate, except instead of actually going out and collecting new datasets, we instead use bootstrapping to create new sets of data from our current dataset. If you are unfamiliar with bootstrapping, you can treat it as \"magic\" for now (and if you are familiar with the bootstrap, you already know that it is magic).\n",
844 | "\n",
845 | "ある病院のデータを元に心疾患を予測するモデルを作ります。そして、その病院の違ったデータを元にもう一つのモデルを作ります。この二つモデルを組み合わせた場合、ここのモデルよりも良いモデルができる気がしませんか?基本的にはこのような考え方ですが、ただこのboostrappingでは、現在のデータセットから新しいデータセットを作り出す点に注意してください。\n",
846 | "\n",
847 | "Let's take a look at a simple bootstrap model."
848 | ]
849 | },
850 | {
851 | "cell_type": "code",
852 | "execution_count": 0,
853 | "metadata": {
854 | "colab": {},
855 | "colab_type": "code",
856 | "id": "JrXAspvrzv8x"
857 | },
858 | "outputs": [],
859 | "source": [
860 | "np.random.seed(321)\n",
861 | "clf = tree.DecisionTreeClassifier(max_depth=5)\n",
862 | "mdl = ensemble.BaggingClassifier(base_estimator=clf, n_estimators=6)\n",
863 | "mdl = mdl.fit(X_train, y_train)\n",
864 | "\n",
865 | "fig = plt.figure(figsize=[12,6])\n",
866 | "for i, estimator in enumerate(mdl.estimators_): \n",
867 | " ax = fig.add_subplot(2,3,i+1)\n",
868 | " txt = 'Tree {}'.format(i+1)\n",
869 | " dtn.plot_model_pred_2d(estimator, X_train, y_train, \n",
870 | " title=txt)"
871 | ]
872 | },
873 | {
874 | "cell_type": "markdown",
875 | "metadata": {
876 | "colab_type": "text",
877 | "id": "s3kKUPORfW9F"
878 | },
879 | "source": [
880 | "We can see that each individual tree is quite variable - this is a result of using a random set of data to train the classifier.\n",
881 | "\n",
882 | "個々のtreeは大きく異なっています。これは、モデルを作りに際し毎回ランダムにデータセットを作っているからです。"
883 | ]
884 | },
885 | {
886 | "cell_type": "code",
887 | "execution_count": 0,
888 | "metadata": {
889 | "colab": {},
890 | "colab_type": "code",
891 | "id": "w_D7_-0HfVMy"
892 | },
893 | "outputs": [],
894 | "source": [
895 | "# plot the final prediction\n",
896 | "plt.figure(figsize=[8,5])\n",
897 | "txt = 'Bagged tree (final decision surface)'\n",
898 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)"
899 | ]
900 | },
901 | {
902 | "cell_type": "markdown",
903 | "metadata": {
904 | "colab_type": "text",
905 | "id": "AOFnG0r6faLS"
906 | },
907 | "source": [
908 | "Not bad! Of course, since this is a simple dataset, we are not seeing that many dramatic changes between different models. Don't worry, we'll quantitatively evaluate them later.\n",
909 | "\n",
910 | "Next up, a minor addition creates one of the most popular models in machine learning."
911 | ]
912 | },
913 | {
914 | "cell_type": "markdown",
915 | "metadata": {
916 | "colab_type": "text",
917 | "id": "aiqrVfYtfcYk"
918 | },
919 | "source": [
920 | "## Random Forest\n",
921 | "\n",
922 | "Above, we used bagging to randomly resample our data to generate \"new\" datasets to build trees from. The Random Forest takes this one step further: instead of just resampling our data, we also select only a fraction of the features to include. It turns out that this subselection tends to improve the performance of our models. The odds of an individual being very good or very bad is higher (i.e. the variance of the trees is increased), and this ends up giving us a final model with better overall performance (lower bias).\n",
923 | "\n",
924 | "Random forestでは、baggingを更に進化させたもので、単にデータ全体をresamplingせずに、ランダムに選んだ一部の予測因子のみを用いてtreeを何度も作り、最後にそれらを組み合わせます。\n",
925 | "\n",
926 | "Let's train the model now."
927 | ]
928 | },
929 | {
930 | "cell_type": "code",
931 | "execution_count": 0,
932 | "metadata": {
933 | "colab": {},
934 | "colab_type": "code",
935 | "id": "u27LS36_fglG"
936 | },
937 | "outputs": [],
938 | "source": [
939 | "np.random.seed(321)\n",
940 | "mdl = ensemble.RandomForestClassifier(max_depth=5, n_estimators=6, max_features=1)\n",
941 | "mdl = mdl.fit(X_train,y_train)\n",
942 | "\n",
943 | "fig = plt.figure(figsize=[12,6])\n",
944 | "for i, estimator in enumerate(mdl.estimators_): \n",
945 | " ax = fig.add_subplot(2,3,i+1)\n",
946 | " txt = 'Tree {}'.format(i+1)\n",
947 | " dtn.plot_model_pred_2d(estimator, X_train, y_train, title=txt)"
948 | ]
949 | },
950 | {
951 | "cell_type": "code",
952 | "execution_count": 0,
953 | "metadata": {
954 | "colab": {},
955 | "colab_type": "code",
956 | "id": "5aG0PI8lruGN"
957 | },
958 | "outputs": [],
959 | "source": [
960 | "plt.figure(figsize=[9,5])\n",
961 | "txt = 'Random forest (final decision surface)'\n",
962 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)"
963 | ]
964 | },
965 | {
966 | "cell_type": "markdown",
967 | "metadata": {
968 | "colab_type": "text",
969 | "id": "2KmJuztXfjzm"
970 | },
971 | "source": [
972 | "Again, the visualization doesn't *really* show us the power of Random Forests, but we'll quantitatively evaluate them soon enough.\n",
973 | "\n",
974 | "Last, and not least, we move on to gradient boosting."
975 | ]
976 | },
977 | {
978 | "cell_type": "markdown",
979 | "metadata": {
980 | "colab_type": "text",
981 | "id": "LTP8zFIofl2v"
982 | },
983 | "source": [
984 | "## Gradient Boosting\n",
985 | "\n",
986 | "Gradient boosting (GB) is our last topic - and elegantly combines concepts from the previous methods. \n",
987 | "As a \"boosting\" method, GB involves iteratively building trees, aiming to improve upon misclassifications of the previous tree. GB also borrows the concept of sub sampling the number of columns (as was done in Random Forests), which tends to prevent overfitting.\n",
988 | "\n",
989 | "このgradient boosting (GB)は、misclassificationを減らすため前のtreeを考慮するというboostingの概念と、ある予測因子のサンプリングを行うというrandom forestの概念を組み合わせたものです。\n",
990 | "\n",
991 | "While it is hard to express in this non-technical tutorial, the biggest innovation in GB is that it provides a unifying mathematical framework for boosting models.\n",
992 | "GB explicitly casts the problem of building a tree as an optimization problem, defining mathematical functions for how well a tree is performing (which we had before) *and* how complex a tree is. In this light, one can actually treat AdaBoost as a \"special case\" of GB, where the loss function is chosen to be the exponential loss.\n",
993 | "\n",
994 | "詳細は割愛しますが、簡単に表現するならばGBはboosting modelを数学的に捉えることです。Treeがどの程度良いものでどの程度複雑であるかを、数学的functionを置いて考えることができます。その意味では、AdaBoostもloss function(test errorを考えるfunctionの一つ)を用いたGBの一部と考えることができます。\n",
995 | "\n",
996 | "Let's build a GB model."
997 | ]
998 | },
999 | {
1000 | "cell_type": "code",
1001 | "execution_count": 0,
1002 | "metadata": {
1003 | "colab": {},
1004 | "colab_type": "code",
1005 | "id": "L_QVZ9oNfnqk"
1006 | },
1007 | "outputs": [],
1008 | "source": [
1009 | "np.random.seed(321)\n",
1010 | "mdl = ensemble.GradientBoostingClassifier(n_estimators=10)\n",
1011 | "mdl = mdl.fit(X_train, y_train)\n",
1012 | "\n",
1013 | "plt.figure(figsize=[9,5])\n",
1014 | "txt = 'Gradient boosted tree (final decision surface)'\n",
1015 | "dtn.plot_model_pred_2d(mdl, X_train, y_train, title=txt)"
1016 | ]
1017 | },
1018 | {
1019 | "cell_type": "markdown",
1020 | "metadata": {
1021 | "colab_type": "text",
1022 | "id": "tcCzP4gAsd7L"
1023 | },
1024 | "source": [
1025 | "## Comparing model performance\n",
1026 | "\n",
1027 | "We've now learned the basics of the various tree methods and have visualized most of them. Let's finish by comparing the performance of our models on our held-out test data. Our goal, remember, is to predict whether or not a patient will survive their hospital stay using the patient's age and acute physiology score computed on the first day of their ICU stay.\n",
1028 | "\n",
1029 | "最後に、これまで説明してきた様々なモデルのパフォーマンスを、テストセットを用いて比べてみましょう。"
1030 | ]
1031 | },
1032 | {
1033 | "cell_type": "code",
1034 | "execution_count": 0,
1035 | "metadata": {
1036 | "colab": {},
1037 | "colab_type": "code",
1038 | "id": "tQST4TQAtHmU"
1039 | },
1040 | "outputs": [],
1041 | "source": [
1042 | "clf = dict()\n",
1043 | "clf['Decision Tree'] = tree.DecisionTreeClassifier(criterion='entropy', splitter='best').fit(X_train,y_train)\n",
1044 | "clf['Gradient Boosting'] = ensemble.GradientBoostingClassifier(n_estimators=10).fit(X_train, y_train)\n",
1045 | "clf['Random Forest'] = ensemble.RandomForestClassifier(n_estimators=10).fit(X_train, y_train)\n",
1046 | "clf['Bagging'] = ensemble.BaggingClassifier(n_estimators=10).fit(X_train, y_train)\n",
1047 | "clf['AdaBoost'] = ensemble.AdaBoostClassifier(n_estimators=10).fit(X_train, y_train)\n",
1048 | "\n",
1049 | "fig = plt.figure(figsize=[10,10])\n",
1050 | "\n",
1051 | "print('AUROC\\tModel')\n",
1052 | "for i, curr_mdl in enumerate(clf): \n",
1053 | " yhat = clf[curr_mdl].predict_proba(X_test)[:,1]\n",
1054 | " score = metrics.roc_auc_score(y_test, yhat)\n",
1055 | " print('{:0.3f}\\t{}'.format(score, curr_mdl))\n",
1056 | " ax = fig.add_subplot(3,2,i+1)\n",
1057 | " dtn. plot_model_pred_2d(clf[curr_mdl], X_test, y_test, title=curr_mdl)\n",
1058 | " "
1059 | ]
1060 | },
1061 | {
1062 | "cell_type": "markdown",
1063 | "metadata": {
1064 | "colab_type": "text",
1065 | "id": "osr6iM6ltLAP"
1066 | },
1067 | "source": [
1068 | "Here we can see that quantitatively, Gradient Boosting has produced the highest discrimination among all the models (~0.91). You'll see that some of the models appear to have simpler decision surfaces, which tends to result in improved generalization on a held-out test set (though not always!).\n",
1069 | "\n",
1070 | "定量的には、Gradient Boostingが最も高いdiscrimination(アウトカムカテゴリーを正確に区別できるかの指標で、ROC曲線下面積などが用いられる)を示しています。常にではありませんが、単純なモデルはテストセットへの一般化が有利にある傾向があります。\n",
1071 | "\n",
1072 | "To make appropriate comparisons, we should calculate 95% confidence intervals on these performance estimates. This can be done a number of ways; the easiest is to bootstrap the calculation.\n",
1073 | "\n",
1074 | "適切に比較するためには、それぞれの95%信頼区間を計算する必要があります。95%信頼区間を求める最も簡単な方法はbootstrapです。"
1075 | ]
1076 | },
1077 | {
1078 | "cell_type": "markdown",
1079 | "metadata": {
1080 | "colab_type": "text",
1081 | "id": "kABSe8ZmudSH"
1082 | },
1083 | "source": [
1084 | "## Further reading"
1085 | ]
1086 | },
1087 | {
1088 | "cell_type": "code",
1089 | "execution_count": 0,
1090 | "metadata": {
1091 | "colab": {},
1092 | "colab_type": "code",
1093 | "id": "bfFZSe0vue86"
1094 | },
1095 | "outputs": [],
1096 | "source": []
1097 | }
1098 | ],
1099 | "metadata": {
1100 | "colab": {
1101 | "collapsed_sections": [],
1102 | "name": "04-prediction",
1103 | "provenance": [],
1104 | "version": "0.3.2"
1105 | },
1106 | "kernelspec": {
1107 | "display_name": "Python 3",
1108 | "language": "python",
1109 | "name": "python3"
1110 | },
1111 | "language_info": {
1112 | "codemirror_mode": {
1113 | "name": "ipython",
1114 | "version": 3
1115 | },
1116 | "file_extension": ".py",
1117 | "mimetype": "text/x-python",
1118 | "name": "python",
1119 | "nbconvert_exporter": "python",
1120 | "pygments_lexer": "ipython3",
1121 | "version": "3.6.5"
1122 | }
1123 | },
1124 | "nbformat": 4,
1125 | "nbformat_minor": 1
1126 | }
1127 |
--------------------------------------------------------------------------------