├── images ├── r2.png ├── RMSE.png ├── RMSE_compare.png ├── compare_models.png ├── rms_comparison.png └── feature_importance.png ├── LICENSE.txt ├── requirements.txt ├── environment.yml └── README.md /images/r2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-codes/hospital-los-predictor/HEAD/images/r2.png -------------------------------------------------------------------------------- /images/RMSE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-codes/hospital-los-predictor/HEAD/images/RMSE.png -------------------------------------------------------------------------------- /images/RMSE_compare.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-codes/hospital-los-predictor/HEAD/images/RMSE_compare.png -------------------------------------------------------------------------------- /images/compare_models.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-codes/hospital-los-predictor/HEAD/images/compare_models.png -------------------------------------------------------------------------------- /images/rms_comparison.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-codes/hospital-los-predictor/HEAD/images/rms_comparison.png -------------------------------------------------------------------------------- /images/feature_importance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/daniel-codes/hospital-los-predictor/HEAD/images/feature_importance.png -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Daniel Cummings 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # This file may be used to create an environment using: 2 | # $ conda create --name --file 3 | # platform: win-64 4 | appdirs=1.4.3=py36_0 5 | asn1crypto=0.24.0=py36_0 6 | attrs=18.1.0=py36_0 7 | automat=0.7.0=py36_0 8 | backcall=0.1.0=py36_0 9 | blas=1.0=mkl 10 | bleach=2.1.3=py36_0 11 | boost=1.66.0=py36_vc14_1 12 | boost-cpp=1.66.0=vc14_1 13 | brewer2mpl=1.4.1=py_3 14 | bzip2=1.0.6=vc14_1 15 | ca-certificates=2018.03.07=0 16 | certifi=2018.1.18=py36_0 17 | cffi=1.11.5=py36h945400d_0 18 | colorama=0.3.9=py36h029ae33_0 19 | constantly=15.1.0=py36_0 20 | cryptography=2.2.2=py36hfa6e2cd_0 21 | curl=7.60.0=vc14_0 22 | cycler=0.10.0=py_1 23 | decorator=4.3.0=py36_0 24 | entrypoints=0.2.3=py36_2 25 | expat=2.2.5=vc14_0 26 | freetype=2.7=vc14_1 27 | freexl=1.0.2=vc14_2 28 | gdal=2.2.4=py36h2fc6367_1 29 | geojson=2.3.0=py_0 30 | geos=3.6.2=he025d50_2 31 | geotiff=1.4.2=vc14_1 32 | ggplot=0.11.5=py_3 33 | hdf4=4.2.13=vc14_0 34 | hdf5=1.10.1=vc14_2 35 | html5lib=1.0.1=py36_0 36 | hyperlink=18.0.0=py36_0 37 | icc_rt=2017.0.4=h97af966_0 38 | icu=58.2=vc14_0 39 | idna=2.7=py36_0 40 | incremental=17.5.0=py36he5b1da3_0 41 | intel-openmp=2018.0.3=0 42 | ipykernel=4.8.2=py36_0 43 | ipython=6.4.0=py36_1 44 | ipython_genutils=0.2.0=py36h3c5d0ee_0 45 | jedi=0.12.0=py36_1 46 | jinja2=2.10=py36_0 47 | jpeg=9b=vc14_2 48 | jsonschema=2.6.0=py36h7636477_0 49 | jupyter_client=5.2.3=py36_0 50 | jupyter_core=4.4.0=py36_0 51 | kealib=1.4.7=vc14_4 52 | kiwisolver=1.0.1=py36_1 53 | krb5=1.14.6=vc14_0 54 | libgdal=2.2.4=vc14_5 55 | libiconv=1.14=vc14_4 56 | libnetcdf=4.6.1=vc14_2 57 | libpng=1.6.34=vc14_0 58 | libpq=9.6.3=vc14_0 59 | libsodium=1.0.16=vc14_0 60 | libspatialite=4.3.0a=vc14_19 61 | libssh2=1.8.0=hc4dcbb0_2 62 | libtiff=4.0.9=vc14_0 63 | libxml2=2.9.5=vc14_1 64 | m2w64-gcc-libgfortran=5.3.0=6 65 | m2w64-gcc-libs=5.3.0=7 66 | m2w64-gcc-libs-core=5.3.0=7 67 | m2w64-gmp=6.1.0=2 68 | m2w64-libwinpthread-git=5.0.0.4634.697f757=2 69 | markupsafe=1.0=py36hfa6e2cd_1 70 | matplotlib=1.5.3=np113py36_8 71 | mistune=0.8.3=py36hfa6e2cd_1 72 | mkl=2018.0.3=1 73 | mkl_fft=1.0.2=py36hb217b18_0 74 | mkl_random=1.0.1=py36h77b88f5_1 75 | msys2-conda-epoch=20160418=1 76 | nbconvert=5.3.1=py36_0 77 | nbformat=4.4.0=py36h3a5bc1b_0 78 | notebook=5.6.0=py36_0 79 | numpy=1.13.3=py36h5c71026_4 80 | numpy-base=1.14.5=py36h5c71026_4 81 | oauthlib=2.0.1=py36_0 82 | openjpeg=2.3.0=vc14_2 83 | openssl=1.0.2o=vc14_0 84 | pandas=0.22.0=py36_1 85 | pandoc=2.2.1=h1a437c5_0 86 | pandocfilters=1.4.2=py36_1 87 | parso=0.2.1=py36_0 88 | patsy=0.5.0=py36_0 89 | pickleshare=0.7.4=py36h9de030f_0 90 | pip=9.0.1=py36h226ae91_4 91 | proj4=4.9.3=vc14_5 92 | prometheus_client=0.2.0=py36_0 93 | prompt_toolkit=1.0.15=py36h60b8f86_0 94 | pyasn1=0.4.3=py36_0 95 | pyasn1-modules=0.2.2=py36_0 96 | pycparser=2.18=py36_1 97 | pygments=2.2.0=py36hb010967_0 98 | pyopenssl=18.0.0=py36_0 99 | pyparsing=2.2.0=py_1 100 | pyqt=4.11.4=py36_2 101 | pysocks=1.6.8=py36_1 102 | python=3.6.4=h6538335_1 103 | python-dateutil=2.7.3=py36_0 104 | pytz=2018.5=py36_0 105 | pywin32=223=py36hfa6e2cd_1 106 | pywinpty=0.5.4=py36_0 107 | pyzmq=17.0.0=py36hfa6e2cd_3 108 | qt=4.8.7=vc14_7 109 | requests=2.13.0=py36_0 110 | requests-oauthlib=1.0.0=py36_0 111 | scikit-learn=0.19.1=py36h53aea1b_0 112 | scipy=1.1.0=py36h672f292_0 113 | seaborn=0.9.0=py36_0 114 | send2trash=1.5.0=py36_0 115 | service_identity=17.0.0=py36_0 116 | setuptools=38.4.0=py36_0 117 | simplegeneric=0.8.1=py36_2 118 | sip=4.18=py36_1 119 | six=1.11.0=py36_1 120 | sqlite=3.20.1=vc14_2 121 | statsmodels=0.9.0=py36h452e1ab_0 122 | terminado=0.8.1=py36_1 123 | testpath=0.3.1=py36h2698cfe_0 124 | tornado=5.0.2=py36hfa6e2cd_0 125 | traitlets=4.3.2=py36h096827d_0 126 | tweepy=3.6.0=py36_0 127 | twisted=18.4.0=py36hfa6e2cd_0 128 | vc=14=0 129 | vincent=0.4.4=py_1 130 | vs2015_runtime=15.5.2=3 131 | wcwidth=0.1.7=py36h3d5aa90_0 132 | webencodings=0.5.1=py36_1 133 | wheel=0.30.0=py36h6c3ec14_1 134 | win_inet_pton=1.0.1=py36_1 135 | wincertstore=0.2=py36h7fe50ca_0 136 | winpty=0.4.3=4 137 | xerces-c=3.2.0=vc14_0 138 | xz=5.2.4=h2fa13f4_1 139 | zeromq=4.2.5=vc14_2 140 | zlib=1.2.11=vc14_0 141 | zope=1.0=py36_0 142 | zope.interface=4.5.0=py36hfa6e2cd_0 143 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: py3 2 | channels: 3 | - conda-forge 4 | - anaconda 5 | - defaults 6 | dependencies: 7 | - appdirs=1.4.3=py36_0 8 | - asn1crypto=0.24.0=py36_0 9 | - attrs=18.1.0=py36_0 10 | - automat=0.7.0=py36_0 11 | - backcall=0.1.0=py36_0 12 | - blas=1.0=mkl 13 | - bleach=2.1.3=py36_0 14 | - ca-certificates=2018.03.07=0 15 | - cffi=1.11.5=py36h945400d_0 16 | - colorama=0.3.9=py36h029ae33_0 17 | - constantly=15.1.0=py36_0 18 | - cryptography=2.2.2=py36hfa6e2cd_0 19 | - decorator=4.3.0=py36_0 20 | - entrypoints=0.2.3=py36_2 21 | - html5lib=1.0.1=py36_0 22 | - hyperlink=18.0.0=py36_0 23 | - icc_rt=2017.0.4=h97af966_0 24 | - idna=2.7=py36_0 25 | - incremental=17.5.0=py36he5b1da3_0 26 | - intel-openmp=2018.0.3=0 27 | - ipykernel=4.8.2=py36_0 28 | - ipython=6.4.0=py36_1 29 | - ipython_genutils=0.2.0=py36h3c5d0ee_0 30 | - jedi=0.12.0=py36_1 31 | - jinja2=2.10=py36_0 32 | - jsonschema=2.6.0=py36h7636477_0 33 | - jupyter_client=5.2.3=py36_0 34 | - jupyter_core=4.4.0=py36_0 35 | - markupsafe=1.0=py36hfa6e2cd_1 36 | - mistune=0.8.3=py36hfa6e2cd_1 37 | - mkl=2018.0.3=1 38 | - mkl_fft=1.0.2=py36hb217b18_0 39 | - mkl_random=1.0.1=py36h77b88f5_1 40 | - nbconvert=5.3.1=py36_0 41 | - nbformat=4.4.0=py36h3a5bc1b_0 42 | - notebook=5.6.0=py36_0 43 | - numpy-base=1.14.5=py36h5c71026_4 44 | - pandoc=2.2.1=h1a437c5_0 45 | - pandocfilters=1.4.2=py36_1 46 | - parso=0.2.1=py36_0 47 | - patsy=0.5.0=py36_0 48 | - pickleshare=0.7.4=py36h9de030f_0 49 | - prometheus_client=0.2.0=py36_0 50 | - prompt_toolkit=1.0.15=py36h60b8f86_0 51 | - pyasn1=0.4.3=py36_0 52 | - pyasn1-modules=0.2.2=py36_0 53 | - pycparser=2.18=py36_1 54 | - pygments=2.2.0=py36hb010967_0 55 | - pyopenssl=18.0.0=py36_0 56 | - python-dateutil=2.7.3=py36_0 57 | - pytz=2018.5=py36_0 58 | - pywin32=223=py36hfa6e2cd_1 59 | - pywinpty=0.5.4=py36_0 60 | - pyzmq=17.0.0=py36hfa6e2cd_3 61 | - scikit-learn=0.19.1=py36h53aea1b_0 62 | - scipy=1.1.0=py36h672f292_0 63 | - seaborn=0.9.0=py36_0 64 | - send2trash=1.5.0=py36_0 65 | - service_identity=17.0.0=py36_0 66 | - simplegeneric=0.8.1=py36_2 67 | - statsmodels=0.9.0=py36h452e1ab_0 68 | - terminado=0.8.1=py36_1 69 | - testpath=0.3.1=py36h2698cfe_0 70 | - tornado=5.0.2=py36hfa6e2cd_0 71 | - traitlets=4.3.2=py36h096827d_0 72 | - twisted=18.4.0=py36hfa6e2cd_0 73 | - wcwidth=0.1.7=py36h3d5aa90_0 74 | - webencodings=0.5.1=py36_1 75 | - winpty=0.4.3=4 76 | - zope=1.0=py36_0 77 | - zope.interface=4.5.0=py36hfa6e2cd_0 78 | - boost=1.66.0=py36_vc14_1 79 | - boost-cpp=1.66.0=vc14_1 80 | - brewer2mpl=1.4.1=py_3 81 | - bzip2=1.0.6=vc14_1 82 | - curl=7.60.0=vc14_0 83 | - cycler=0.10.0=py_1 84 | - expat=2.2.5=vc14_0 85 | - freetype=2.7=vc14_1 86 | - freexl=1.0.2=vc14_2 87 | - gdal=2.2.4=py36h2fc6367_1 88 | - geojson=2.3.0=py_0 89 | - geos=3.6.2=he025d50_2 90 | - geotiff=1.4.2=vc14_1 91 | - ggplot=0.11.5=py_3 92 | - hdf4=4.2.13=vc14_0 93 | - hdf5=1.10.1=vc14_2 94 | - icu=58.2=vc14_0 95 | - jpeg=9b=vc14_2 96 | - kealib=1.4.7=vc14_4 97 | - kiwisolver=1.0.1=py36_1 98 | - krb5=1.14.6=vc14_0 99 | - libgdal=2.2.4=vc14_5 100 | - libiconv=1.14=vc14_4 101 | - libnetcdf=4.6.1=vc14_2 102 | - libpng=1.6.34=vc14_0 103 | - libpq=9.6.3=vc14_0 104 | - libsodium=1.0.16=vc14_0 105 | - libspatialite=4.3.0a=vc14_19 106 | - libssh2=1.8.0=hc4dcbb0_2 107 | - libtiff=4.0.9=vc14_0 108 | - libxml2=2.9.5=vc14_1 109 | - matplotlib=1.5.3=np113py36_8 110 | - oauthlib=2.0.1=py36_0 111 | - openjpeg=2.3.0=vc14_2 112 | - openssl=1.0.2o=vc14_0 113 | - pandas=0.22.0=py36_1 114 | - proj4=4.9.3=vc14_5 115 | - pyparsing=2.2.0=py_1 116 | - pyqt=4.11.4=py36_2 117 | - pysocks=1.6.8=py36_1 118 | - qt=4.8.7=vc14_7 119 | - requests=2.13.0=py36_0 120 | - requests-oauthlib=1.0.0=py36_0 121 | - sip=4.18=py36_1 122 | - six=1.11.0=py36_1 123 | - sqlite=3.20.1=vc14_2 124 | - tweepy=3.6.0=py36_0 125 | - vc=14=0 126 | - vincent=0.4.4=py_1 127 | - win_inet_pton=1.0.1=py36_1 128 | - xerces-c=3.2.0=vc14_0 129 | - xz=5.2.4=h2fa13f4_1 130 | - zeromq=4.2.5=vc14_2 131 | - zlib=1.2.11=vc14_0 132 | - certifi=2018.1.18=py36_0 133 | - m2w64-gcc-libgfortran=5.3.0=6 134 | - m2w64-gcc-libs=5.3.0=7 135 | - m2w64-gcc-libs-core=5.3.0=7 136 | - m2w64-gmp=6.1.0=2 137 | - m2w64-libwinpthread-git=5.0.0.4634.697f757=2 138 | - msys2-conda-epoch=20160418=1 139 | - numpy=1.13.3=py36h5c71026_4 140 | - pip=9.0.1=py36h226ae91_4 141 | - python=3.6.4=h6538335_1 142 | - setuptools=38.4.0=py36_0 143 | - vs2015_runtime=15.5.2=3 144 | - wheel=0.30.0=py36h6c3ec14_1 145 | - wincertstore=0.2=py36h7fe50ca_0 146 | prefix: C:\Users\djcummin\anaconda3\envs\py3 147 | 148 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Predicting hospital length-of-stay at time of admission 2 | 3 | **Medium Story**: https://medium.com/@daniel.j.cummings/predicting-hospital-length-of-stay-at-time-of-admission-55dfdfe69598 4 | 5 | ## Project Overview 6 | 7 | Predictive analytics is an increasingly important tool in the healthcare field since modern machine learning (ML) methods can use large amounts of available data to predict individual outcomes for patients. For example, ML predictions can help healthcare providers determine likelihoods of disease, aid in diagnosis, recommend treatment, and predict future wellness. For this project, I chose to focus on a more logistical metric of healthcare, hospital length-of-stay (LOS). LOS is defined as the time between hospital admission and discharge measured in days. 8 | 9 | **The goal of this project is to create a model that predicts the length-of-stay for each patient at time of admission.** The project makes use of the [MIMIC](https://mimic.physionet.org/) database: "MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising de-identified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more." 10 | 11 | ## Summary of Results 12 | 13 | I fit five different regression models (from the scikit-learn library) using default settings and compared the r-squared (R2) scores. The GradientBoostingRegressor took the win with an R2 score of ~37% with the testing set so I decided to focus on refining this particular ensemble model. The root mean squared error (RMSE) was used to compare the prediction model versus the industry standard average and median LOS metrics. The gradient boosting model RMSE is better by more than 24% (percent difference) versus the constant average or median models. 14 | 15 | 16 | 17 | Another way I looked at the model was to plot the proportion of accurate predictions in the test set versus an allowed margin of error. Other studies qualify a LOS prediction as correct if it falls within a certain margin of error. It follows that as the margin of error allowance increases, so should the proportion of accurate predictions for all models. The gradient boosting prediction model performs better than the other constant models across the margin of error range up to 50%. 18 | 19 | 20 | 21 | 22 | ## Getting Started 23 | 24 | Cloning the git repository and installing the provided packages will help you get a copy of the project up and running on your local machine. The analysis for this project was performed using Jupyter Notebook (.ipynb) and the packages were managed using the Ananconda platform. 25 | 26 | ``` 27 | git clone https://github.com/daniel-codes/hospital-los-predictor 28 | pip install -r /path/to/requirements.txt 29 | ``` 30 | 31 | File Description: 32 | * hospital_los_prediction.ipynb - Jupyter Notebook for this project including data exploration, feature engineering, and prediction modeling 33 | * requirements.txt - packages used to perform this analysis 34 | * environment.yml - Anaconda environment file (alternative to requirements.txt) 35 | 36 | \*note: The MIMIC database has special access requirements so I am not able post the source dataset in this REPO. If you are granted access, as of 12/15/2018, the filenames will be listed exactly as: 37 | * ADMISSIONS.csv.gz - Admissions information including admission time, discharge time, ethnicity, religion, insurance, and admission type 38 | * PATIENTS.csv.gz - Patient specific info such as gender and de-identified date of birth 39 | * DIAGNOSES_ICD.csv.gz - ICD-9 diagnosis for each admission to hospital 40 | * ICUSTAYS.csv.gz - Intensive Care Unit (ICU) ward information for each admission to hospital 41 | 42 | ## Authors 43 | 44 | - **Daniel Cummings** - [daniel-codes](https://github.com/daniel-codes) 45 | 46 | ## License 47 | 48 | This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details 49 | 50 | ## Acknowledgments 51 | 52 | MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35. Available from: http://www.nature.com/articles/sdata201635 53 | 54 | I found these resources particularly helpful for this project: 55 | - https://towardsdatascience.com/running-random-forests-inspect-the-feature-importances-with-this-code-2b00dd72b92e 56 | - https://matplotlib.org/examples/api/barchart_demo.html 57 | - https://stackoverflow.com/questions/46168450/replace-specific-range-of-values-in-data-frame-pandas 58 | - https://en.wikipedia.org/wiki/Root-mean-square_deviation 59 | - https://en.wikipedia.org/wiki/Coefficient_of_determination 60 | - https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/ 61 | - https://www.healthcatalyst.com/success_stories/reducing-length-of-stay-in-hospital 62 | - http://bok.ahima.org/Pages/Long%20Term%20Care%20Guidelines%20TOC/Practice%20Guidelines/Reporting --------------------------------------------------------------------------------