├── data ├── README.md ├── PhysionetChallenge2012-set-a.csv.gz └── PhysionetChallenge2012-set-b.csv.gz ├── .gitignore ├── README.md ├── LICENSE └── prepare-data.ipynb /data/README.md: -------------------------------------------------------------------------------- 1 | For convenience, the already preprocessed data files are made available here. 2 | -------------------------------------------------------------------------------- /data/PhysionetChallenge2012-set-a.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alistairewj/challenge2012/HEAD/data/PhysionetChallenge2012-set-a.csv.gz -------------------------------------------------------------------------------- /data/PhysionetChallenge2012-set-b.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alistairewj/challenge2012/HEAD/data/PhysionetChallenge2012-set-b.csv.gz -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # challenge2012 2 | Python code parsing data from PhysioNet Challenge 2012. 3 | 4 | The PhysioNet 2012 challenge, [described here](https://physionet.org/challenge/2012/), focused on encouraging development of mortality prediction models for intensive care unit patients. Time-stamped observations for a number of measurements were made available, with the goal to predict in-hospital mortality. 5 | 6 | A big part of any machine learning project is data preprocessing. The notebook prep-data.ipynb transforms the original PhysioNet 2012 data to have one row per patient. This is done by applying a number of simple feature extraction steps to each measurement time-series independently, e.g. extracting the first heart rate, first blood pressure, and so on. The code also pre-processes the data to remove outliers based on physiologic limits (you cannot have a negative blood pressure, you cannot have an oxygen saturation above 100%, and so on). 7 | 8 | The logic used here was applied in our conference paper: [Patient specific predictions in the intensive care unit using a Bayesian ensemble](https://ieeexplore.ieee.org/abstract/document/6420377). Our entry had the highest score 1 in the competition, where score 1 was defined as the minimum of the sensitivity/positive predictive value. 9 | 10 | The code was originally written in MATLAB but with any luck this Python code is a faithful reproduction. 11 | 12 | ## Downloading the data 13 | 14 | For convenience, the already processed datasets are made available in the [data subfolder](/data). 15 | 16 | If you use this data in your research or otherwise, please do acknowledge the organizers of the PhysioNet 2012 Challenge: 17 | 18 | ``` 19 | @article{silva2012predicting, 20 | title={Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012}, 21 | author={Silva, Ikaro and Moody, George and Scott, Daniel J and Celi, Leo A and Mark, Roger G}, 22 | journal={Computing in cardiology}, 23 | volume={39}, 24 | pages={245}, 25 | year={2012}, 26 | publisher={NIH Public Access} 27 | } 28 | ``` 29 | 30 | We would also appreciate acknowledgement: 31 | 32 | ``` 33 | @inproceedings{johnson2012patient, 34 | title={Patient specific predictions in the intensive care unit using a Bayesian ensemble}, 35 | author={Johnson, Alistair EW and Dunkley, Nic and Mayaud, Louis and Tsanas, Athanasios and Kramer, Andrew A and Clifford, Gari D}, 36 | booktitle={Computing in Cardiology (CinC), 2012}, 37 | pages={249--252}, 38 | year={2012}, 39 | organization={IEEE} 40 | } 41 | ``` 42 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /prepare-data.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "1. Download data from URL to disk\n", 8 | "2. Parse data using pre-defined rules (mostly removing outliers)\n", 9 | "3. Output data as CSV." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "!wget https://physionet.org/challenge/2012/set-a.zip\n", 19 | "!wget https://physionet.org/challenge/2012/set-b.zip\n", 20 | "\n", 21 | "!wget https://physionet.org/challenge/2012/Outcomes-a.txt\n", 22 | "!wget https://physionet.org/challenge/2012/Outcomes-b.txt" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "!unzip -u set-a.zip\n", 32 | "!unzip -u set-b.zip" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": null, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "import pandas as pd\n", 42 | "import numpy as np\n", 43 | "import os\n", 44 | "\n", 45 | "# pick a set\n", 46 | "dataset = 'set-a'" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "scrolled": true 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "# load all files into list of lists\n", 58 | "txt_all = list()\n", 59 | "for f in os.listdir(dataset):\n", 60 | " with open(os.path.join(dataset, f), 'r') as fp:\n", 61 | " txt = fp.readlines()\n", 62 | " \n", 63 | " # get recordid to add as a column\n", 64 | " recordid = txt[1].rstrip('\\n').split(',')[-1]\n", 65 | " txt = [t.rstrip('\\n').split(',') + [int(recordid)] for t in txt]\n", 66 | " txt_all.extend(txt[1:])\n", 67 | " \n", 68 | " \n", 69 | "# convert to pandas dataframe\n", 70 | "df = pd.DataFrame(txt_all, columns=['time', 'parameter', 'value', 'recordid'])\n", 71 | "\n", 72 | "# extract static variables into a separate dataframe\n", 73 | "df_static = df.loc[df['time'] == '00:00', :].copy()\n", 74 | "\n", 75 | "# retain only one of the 6 static vars:\n", 76 | "static_vars = ['RecordID', 'Age', 'Gender', 'Height', 'ICUType', 'Weight']\n", 77 | "df_static = df_static.loc[df['parameter'].isin(static_vars)]\n", 78 | "\n", 79 | "# remove these from original df\n", 80 | "idxDrop = df_static.index\n", 81 | "df = df.loc[~df.index.isin(idxDrop), :]\n", 82 | "\n", 83 | "# to ensure there are no duplicates, group by recordid/parameter and take the last value\n", 84 | "# last will be chosen as last row in the loaded file\n", 85 | "# there was 1 row in set-b which had 2 weights (70.4, 70.8) and thus required this step\n", 86 | "df_static = df_static.groupby(['recordid', 'parameter'])[['value']].last()\n", 87 | "df_static.reset_index(inplace=True)\n", 88 | "\n", 89 | "# pivot on parameter so there is one column per parameter\n", 90 | "df_static = df_static.pivot(index='recordid', columns='parameter', values='value')\n", 91 | "\n", 92 | "# some conversions on columns for convenience\n", 93 | "df['value'] = pd.to_numeric(df['value'], errors='raise')\n", 94 | "df['time'] = df['time'].map(lambda x: int(x.split(':')[0])*60 + int(x.split(':')[1]))\n", 95 | "\n", 96 | "df.head()" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "features = {'Albumin': 'Serum Albumin (g/dL)',\n", 106 | " 'ALP': 'Alkaline phosphatase (IU/L)',\n", 107 | " 'ALT': 'Alanine transaminase (IU/L)',\n", 108 | " 'AST': 'Aspartate transaminase (IU/L)',\n", 109 | " 'Bilirubin': 'Bilirubin (mg/dL)',\n", 110 | " 'BUN': 'Blood urea nitrogen (mg/dL)',\n", 111 | " 'Cholesterol': 'Cholesterol (mg/dL)',\n", 112 | " 'Creatinine': 'Serum creatinine (mg/dL)',\n", 113 | " 'DiasABP': 'Invasive diastolic arterial blood pressure (mmHg)',\n", 114 | " 'FiO2': 'Fractional inspired O2 (0-1)',\n", 115 | " 'GCS': 'Glasgow Coma Score (3-15)',\n", 116 | " 'Glucose': 'Serum glucose (mg/dL)',\n", 117 | " 'HCO3': 'Serum bicarbonate (mmol/L)',\n", 118 | " 'HCT': 'Hematocrit (%)',\n", 119 | " 'HR': 'Heart rate (bpm)',\n", 120 | " 'K': 'Serum potassium (mEq/L)',\n", 121 | " 'Lactate': 'Lactate (mmol/L)',\n", 122 | " 'Mg': 'Serum magnesium (mmol/L)',\n", 123 | " 'MAP': 'Invasive mean arterial blood pressure (mmHg)',\n", 124 | " 'MechVent': 'Mechanical ventilation respiration (0:false or 1:true)',\n", 125 | " 'Na': 'Serum sodium (mEq/L)',\n", 126 | " 'NIDiasABP': 'Non-invasive diastolic arterial blood pressure (mmHg)',\n", 127 | " 'NIMAP': 'Non-invasive mean arterial blood pressure (mmHg)',\n", 128 | " 'NISysABP': 'Non-invasive systolic arterial blood pressure (mmHg)',\n", 129 | " 'PaCO2': 'partial pressure of arterial CO2 (mmHg)',\n", 130 | " 'PaO2': 'Partial pressure of arterial O2 (mmHg)',\n", 131 | " 'pH': 'Arterial pH (0-14)',\n", 132 | " 'Platelets': 'Platelets (cells/nL)',\n", 133 | " 'RespRate': 'Respiration rate (bpm)',\n", 134 | " 'SaO2': 'O2 saturation in hemoglobin (%)',\n", 135 | " 'SysABP': 'Invasive systolic arterial blood pressure (mmHg)',\n", 136 | " 'Temp': 'Temperature (°C)',\n", 137 | " 'TroponinI': 'Troponin-I (μg/L)',\n", 138 | " 'TroponinT': 'Troponin-T (μg/L)',\n", 139 | " 'Urine': 'Urine output (mL)',\n", 140 | " 'WBC': 'White blood cell count (cells/nL)',\n", 141 | " 'Weight': 'Weight (kg)'}" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "# convert static into numeric\n", 151 | "for c in df_static.columns:\n", 152 | " df_static[c] = pd.to_numeric(df_static[c])\n", 153 | " \n", 154 | "# preprocess\n", 155 | "for c in df_static.columns:\n", 156 | " x = df_static[c]\n", 157 | " if c == 'Age':\n", 158 | " # replace anon ages with 91.4\n", 159 | " idx = x > 130\n", 160 | " df_static.loc[idx, c] = 91.4\n", 161 | " elif c == 'Gender':\n", 162 | " idx = x < 0\n", 163 | " df_static.loc[idx, c] = np.nan\n", 164 | " elif c == 'Height':\n", 165 | " idx = x < 0\n", 166 | " df_static.loc[idx, c] = np.nan\n", 167 | " \n", 168 | " # fix incorrectly recorded heights\n", 169 | " \n", 170 | " # 1.8 -> 180\n", 171 | " idx = x < 10\n", 172 | " df_static.loc[idx, c] = df_static.loc[idx, c] * 100\n", 173 | " \n", 174 | " # 18 -> 180\n", 175 | " idx = x < 25\n", 176 | " df_static.loc[idx, c] = df_static.loc[idx, c] * 10\n", 177 | " \n", 178 | " # 81.8 -> 180 (inch -> cm)\n", 179 | " idx = x < 100\n", 180 | " df_static.loc[idx, c] = df_static.loc[idx, c] * 2.2\n", 181 | " \n", 182 | " # 1800 -> 180\n", 183 | " idx = x > 1000\n", 184 | " df_static.loc[idx, c] = df_static.loc[idx, c] * 0.1\n", 185 | " \n", 186 | " # 400 -> 157\n", 187 | " idx = x > 250\n", 188 | " df_static.loc[idx, c] = df_static.loc[idx, c] * 0.3937\n", 189 | " \n", 190 | " elif c == 'Weight':\n", 191 | " idx = x < 35\n", 192 | " df_static.loc[idx, c] = np.nan\n", 193 | " \n", 194 | " idx = x > 299\n", 195 | " df_static.loc[idx, c] = np.nan" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "def delete_value(df, c, value=0):\n", 205 | " idx = df['parameter'] == c\n", 206 | " idx = idx & (df['value'] == value)\n", 207 | " \n", 208 | " df.loc[idx, 'value'] = np.nan\n", 209 | " return df\n", 210 | "\n", 211 | "def replace_value(df, c, value=np.nan, below=None, above=None):\n", 212 | " idx = df['parameter'] == c\n", 213 | " \n", 214 | " if below is not None:\n", 215 | " idx = idx & (df['value'] < below)\n", 216 | " \n", 217 | " if above is not None:\n", 218 | " idx = idx & (df['value'] > above)\n", 219 | " \n", 220 | " \n", 221 | " if 'function' in str(type(value)):\n", 222 | " # value replacement is a function of the input\n", 223 | " df.loc[idx, 'value'] = df.loc[idx, 'value'].apply(value)\n", 224 | " else:\n", 225 | " df.loc[idx, 'value'] = value\n", 226 | " \n", 227 | " return df" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "Apply dynamic data rules." 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": null, 240 | "metadata": {}, 241 | "outputs": [], 242 | "source": [ 243 | "df = delete_value(df, 'DiasABP', -1)\n", 244 | "df = replace_value(df, 'DiasABP', value=np.nan, below=1)\n", 245 | "df = replace_value(df, 'DiasABP', value=np.nan, above=200)\n", 246 | "df = replace_value(df, 'SysABP', value=np.nan, below=1)\n", 247 | "df = replace_value(df, 'MAP', value=np.nan, below=1)\n", 248 | "\n", 249 | "df = replace_value(df, 'NIDiasABP', value=np.nan, below=1)\n", 250 | "df = replace_value(df, 'NISysABP', value=np.nan, below=1)\n", 251 | "df = replace_value(df, 'NIMAP', value=np.nan, below=1)\n", 252 | "\n", 253 | "df = replace_value(df, 'HR', value=np.nan, below=1)\n", 254 | "df = replace_value(df, 'HR', value=np.nan, above=299)\n", 255 | "\n", 256 | "df = replace_value(df, 'PaCO2', value=np.nan, below=1)\n", 257 | "df = replace_value(df, 'PaCO2', value=lambda x: x*10, below=10)\n", 258 | "\n", 259 | "df = replace_value(df, 'PaO2', value=np.nan, below=1)\n", 260 | "df = replace_value(df, 'PaO2', value=lambda x: x*10, below=20)\n", 261 | "\n", 262 | "# the order of these steps matters\n", 263 | "df = replace_value(df, 'pH', value=lambda x: x*10, below=0.8, above=0.65)\n", 264 | "df = replace_value(df, 'pH', value=lambda x: x*0.1, below=80, above=65)\n", 265 | "df = replace_value(df, 'pH', value=lambda x: x*0.01, below=800, above=650)\n", 266 | "df = replace_value(df, 'pH', value=np.nan, below=6.5)\n", 267 | "df = replace_value(df, 'pH', value=np.nan, above=8.0)\n", 268 | "\n", 269 | "# convert to farenheit\n", 270 | "df = replace_value(df, 'Temp', value=lambda x: x*9/5+32, below=10, above=1)\n", 271 | "df = replace_value(df, 'Temp', value=lambda x: (x-32)*5/9, below=113, above=95)\n", 272 | "\n", 273 | "df = replace_value(df, 'Temp', value=np.nan, below=25)\n", 274 | "df = replace_value(df, 'Temp', value=np.nan, above=45)\n", 275 | "\n", 276 | "df = replace_value(df, 'RespRate', value=np.nan, below=1)\n", 277 | "df = replace_value(df, 'WBC', value=np.nan, below=1)\n", 278 | "\n", 279 | "df = replace_value(df, 'Weight', value=np.nan, below=35)\n", 280 | "df = replace_value(df, 'Weight', value=np.nan, above=299)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "Create a design matrix X." 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "metadata": { 294 | "scrolled": true 295 | }, 296 | "outputs": [], 297 | "source": [ 298 | "# Initialize a dataframe with df_static\n", 299 | "X = df_static.copy()\n", 300 | "\n", 301 | "X.drop('RecordID', axis=1, inplace=True)\n", 302 | "\n", 303 | "# MICU is ICUType==3, and is used as the reference category\n", 304 | "X['CCU'] = (X['ICUType'] == 1).astype(int)\n", 305 | "X['CSRU'] = (X['ICUType'] == 2).astype(int)\n", 306 | "X['SICU'] = (X['ICUType'] == 4).astype(int)\n", 307 | "X.drop('ICUType', axis=1, inplace=True)\n", 308 | "\n", 309 | "# For the following features we extract: first, last, lowest, highest, median\n", 310 | "feats = ['DiasABP', 'GCS', 'Glucose', 'HR', 'MAP',\n", 311 | "'NIDiasABP', 'NIMAP', 'NISysABP', \n", 312 | "'RespRate', 'SaO2', 'Temp', ]\n", 313 | "\n", 314 | "idx = df['parameter'].isin(feats)\n", 315 | "df_tmp = df.loc[idx, :].copy()\n", 316 | "df_tmp = df_tmp.groupby(['recordid', 'parameter'])['value']\n", 317 | "\n", 318 | "for agg in ['first', 'last', 'lowest', 'highest', 'median']:\n", 319 | " if agg == 'first':\n", 320 | " X_add = df_tmp.first()\n", 321 | " elif agg == 'last':\n", 322 | " X_add = df_tmp.last()\n", 323 | " elif agg == 'lowest':\n", 324 | " X_add = df_tmp.min()\n", 325 | " elif agg == 'highest':\n", 326 | " X_add = df_tmp.max()\n", 327 | " elif agg == 'median':\n", 328 | " X_add = df_tmp.median()\n", 329 | " else:\n", 330 | " print('Unrecognized aggregation {}. Skipping.'.format(agg))\n", 331 | " \n", 332 | " X_add = X_add.reset_index()\n", 333 | " X_add = X_add.pivot(index='recordid', columns='parameter', values='value')\n", 334 | " X_add.columns = [x + '_' + agg for x in X_add.columns]\n", 335 | "\n", 336 | " X = X.merge(X_add, how='left', left_index=True, right_index=True)\n", 337 | "\n", 338 | "\n", 339 | "# For the following features we extract: first, last\n", 340 | "feats = ['Albumin', 'ALP', 'ALT', 'AST', 'Bilirubin', 'BUN', 'Cholesterol',\n", 341 | "'Creatinine', 'FiO2', 'HCO3', 'HCT', 'K', 'Lactate', 'Mg', 'Na',\n", 342 | "'PaCO2', 'PaO2', 'pH', 'Platelets', 'SysABP', 'TroponinI', 'TroponinT',\n", 343 | "'WBC', 'Weight']\n", 344 | "\n", 345 | "\n", 346 | "idx = df['parameter'].isin(feats)\n", 347 | "df_tmp = df.loc[idx, :].copy()\n", 348 | "df_tmp = df_tmp.groupby(['recordid', 'parameter'])['value']\n", 349 | "\n", 350 | "for agg in ['first', 'last']:\n", 351 | " if agg == 'first':\n", 352 | " X_add = df_tmp.first()\n", 353 | " elif agg == 'last':\n", 354 | " X_add = df_tmp.last()\n", 355 | " elif agg == 'lowest':\n", 356 | " X_add = df_tmp.min()\n", 357 | " elif agg == 'highest':\n", 358 | " X_add = df_tmp.max()\n", 359 | " elif agg == 'median':\n", 360 | " X_add = df_tmp.median()\n", 361 | " else:\n", 362 | " print('Unrecognized aggregation {}. Skipping.'.format(agg))\n", 363 | " \n", 364 | " X_add = X_add.reset_index()\n", 365 | " X_add = X_add.pivot(index='recordid', columns='parameter', values='value')\n", 366 | " X_add.columns = [x + '_' + agg for x in X_add.columns]\n", 367 | "\n", 368 | " X = X.merge(X_add, how='left', left_index=True, right_index=True)\n", 369 | "\n", 370 | "# For the following features we extract custom data\n", 371 | "idx = df['parameter'] == 'MechVent'\n", 372 | "df_tmp = df.loc[idx, :].copy().groupby('recordid')\n", 373 | "\n", 374 | "X0 = df_tmp[['time']].min()\n", 375 | "X0.columns = ['MechVentStartTime']\n", 376 | "\n", 377 | "X1 = df_tmp[['time']].max()\n", 378 | "X1.columns = ['MechVentEndTime']\n", 379 | "\n", 380 | "X_add = X0.merge(X1, how='inner',\n", 381 | " left_index=True, right_index=True)\n", 382 | "X_add['MechVentDuration'] = X_add['MechVentEndTime'] - X_add['MechVentStartTime']\n", 383 | "\n", 384 | "X_add['MechVentLast8Hour'] = (X_add['MechVentEndTime'] >= 2400).astype(int)\n", 385 | "X_add.drop('MechVentEndTime', axis=1, inplace=True)\n", 386 | "\n", 387 | "X = X.merge(X_add, how='left', left_index=True, right_index=True)\n", 388 | "\n", 389 | "# Urine output\n", 390 | "idx = df['parameter'] == 'MechVent'\n", 391 | "df_tmp = df.loc[idx, :].copy().groupby('recordid')\n", 392 | "\n", 393 | "X_add = df_tmp[['value']].sum()\n", 394 | "X_add.columns = ['UrineOutputSum']\n", 395 | "\n", 396 | "X = X.merge(X_add, how='left', left_index=True, right_index=True)\n", 397 | "\n", 398 | "print(X.shape)\n", 399 | "X.head()" 400 | ] 401 | }, 402 | { 403 | "cell_type": "code", 404 | "execution_count": null, 405 | "metadata": {}, 406 | "outputs": [], 407 | "source": [ 408 | "# load in outcomes\n", 409 | "if dataset == 'set-a':\n", 410 | " y = pd.read_csv('Outcomes-a.txt')\n", 411 | "elif dataset == 'set-b':\n", 412 | " y = pd.read_csv('Outcomes-b.txt')\n", 413 | " \n", 414 | "y.set_index('RecordID', inplace=True)\n", 415 | "y.index.name = 'recordid'\n", 416 | "X = y.merge(X, how='inner', left_index=True, right_index=True)\n", 417 | "X.head()" 418 | ] 419 | }, 420 | { 421 | "cell_type": "code", 422 | "execution_count": null, 423 | "metadata": {}, 424 | "outputs": [], 425 | "source": [ 426 | "# output to file\n", 427 | "X.to_csv('PhysionetChallenge2012-{}.csv.gz'.format(dataset),\n", 428 | " sep=',', index=True, compression='gzip')" 429 | ] 430 | } 431 | ], 432 | "metadata": { 433 | "kernelspec": { 434 | "display_name": "tree-tutorial", 435 | "language": "python", 436 | "name": "tree-tutorial" 437 | }, 438 | "language_info": { 439 | "codemirror_mode": { 440 | "name": "ipython", 441 | "version": 3 442 | }, 443 | "file_extension": ".py", 444 | "mimetype": "text/x-python", 445 | "name": "python", 446 | "nbconvert_exporter": "python", 447 | "pygments_lexer": "ipython3", 448 | "version": "3.7.0" 449 | } 450 | }, 451 | "nbformat": 4, 452 | "nbformat_minor": 2 453 | } 454 | --------------------------------------------------------------------------------