├── README.md └── _config.yml /README.md: -------------------------------------------------------------------------------- 1 | # Missing Data, Data Imputation 2 | > In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation ([Wiki](https://en.wikipedia.org/wiki/Missing_data)) 3 | 4 | - **Overview** 5 | - A free book on data imputation by the author of [mice](https://stefvanbuuren.name/mice/) package: 6 | [Flexible Imputation of Missing Data](https://stefvanbuuren.name/fimd/) (2018) *Stef van Buuren* 7 | - [Missing-data imputation](www.stat.columbia.edu/~gelman/arm/missing.pdf)(Ch. 25) of [Data Analysis Using Regression and Multilevel/Hierarchical Models](http://www.stat.columbia.edu/~gelman/arm/) (2006) ' *Andrew Gelman, Jennifer Hill* 8 | - [Missing Data: Our View of the State of the Art](https://stat.ethz.ch/education/semesters/ss2013/ams/paper/missing_data_1.pdf) (2002) *Joseph L. Schafer, John W. Graham* 9 | - [A Review of Methods for Missing Data](https://galton.uchicago.edu/~eichler/stat24600/Admin/MissingDataReview.pdf) (2001) *Therese D. Pigott* 10 | 11 | ### Types of missing data ([Wiki](https://en.wikipedia.org/wiki/Missing_data#Types)) 12 | - Missed completely at random 13 | - Missed at random 14 | - Missed data that depends on unobserved variables 15 | - Missed data that depends on the missing value itself 16 | 17 | ### Discarding data 18 | - **Listwise deletion** Complete-case analysis 19 | - Samples (rows) are removed from a dataset if they have missing values. Probably the most simple and popular approach. Often done automatically by many ML packages 20 | - When dealing with big number of variables that have missing values, the number of samples after deletion can be too small 21 | - May lead to biased estimates. Also smaller sample size increases standard errors 22 | - **Available-case analysis** Complete-variables analysis 23 | - Excluding variables from data if their missing-values rate is lower than some threshold 24 | 25 | ### Imputation ([Wiki](https://en.wikipedia.org/wiki/Imputation_(statistics))) 26 | > Whenever a single imputation strategy is used, the standard errors of estimates tend to be too low. The intuition here is that we have substantial uncertainty about the missing values, but by choosing a single imputation we inessence pretend that we know the true value with certainty ([Data Analysis Using Regression and Multilevel/Hierarchical Models](www.stat.columbia.edu/~gelman/arm/missing.pdf)) 27 | 28 | - **Mean/Mode Replacement** 29 | - Replaces missing values with a variable (column) mean, mode or median 30 | - Distorts a probability distribution of an imputed variable 31 | - Distorts relationship between variables 32 | - **LOCF** Last Observation Carried Forward imputation ([Wiki](https://en.wikipedia.org/wiki/Analysis_of_clinical_trials#Last_observation_carried_forward)) 33 | - In time-series data the last observed value before a missing one is "carried forward" to fill in the blank points 34 | - [Does analysis using “last observation carried forward” introduce bias in dementia research?]() *Frank J. Molnar, Brian Hutton, Dean Fergusson* 35 | - **Indicator variables** 36 | - Extra category that indicates missingness of a variable 37 | - Extra binary indicator variable used together with a variable that includes missing data (works with continuous data too) 38 | - **Regression Imputation** 39 | - A regression model is created on a variable with missing values, then used to predict blank points 40 | - Deterministic regression imputation uses the original prediction of the regression model to impute missing values 41 | - Stochastic regression imputation adds random error 42 | - Article about the regression imputation method with examples: 43 | [Regression Imputation (Stochastic vs. Deterministic & R Example)](https://statisticsglobe.com/regression-imputation-stochastic-vs-deterministic/) *Statistics Globe* 44 | - **Iterative Regression Imputation** 45 | - When multiple variables have missing values, IRI imputes them iteratively: using non-missing variables first to impute the first missing variable, then using the imputed variable together with non-missing predictors to predict missing values of the second one, etc 46 | - **SRMI** Sequential Regression Multivariate Imputation 47 | - [A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models](https://pdfs.semanticscholar.org/13b3/0e35b9a54dad07094cfe4f50d40ff15d8370.pdf) (2001) *Trivellore E. Raghunathan, James M. Lepkowski, John Van Hoewyk, Peter Solenberger* 48 | - **Cold-deck Imputation** 49 | - Impute missing data using previously collected datasets 50 | - **Hot-deck Imputation** 51 | - For each sample with a missing value, find a similar complete-case sample in the same dataset and use it for imputation 52 | - Uses a scoring function to measure similarity 53 | - [A Review of Hot Deck Imputation for Survey Non-response](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130338/) (2010) *Rebecca R. Andridge, Roderick J. A. Little* 54 | - **KNN** k-Nearest Neighbors 55 | - **Model-based Imputation** 56 | - When something is known about why missing data exist, it's possible to directly model the missingness 57 | - **Multiple Imputation** ([Wiki](https://en.wikipedia.org/wiki/Imputation_(statistics)#Multiple_imputation)) 58 | - Drawing imputed values multiple times from some distribution. Then each realization of imputed data is analysed. Aggregated results from all realizations are used to get uncertainty estimation. 59 | - [Multiple Imputation for Nonresponse in Surveys](https://www.onlinelibrary.wiley.com/doi/pdf/10.1002/9780470316696.fmatter) (1987) *Donald B. Rubin* 60 | - [Analyzing Incomplete Political Science Data: An Alternative Algorithm forMultiple Imputation] (2001) *Gary King, James Honaker, Anne Joseph, Kenneth Scheve* 61 | - **MICE** Multivariate Imputation by Chained Equations ([Homepage](https://stefvanbuuren.name/mice/), [Code](https://github.com/stefvanbuuren/mice), [CRAN](https://cran.r-project.org/web/packages/mice/)) 62 | - [A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.405.4540) (2001) *Trivellore E. Raghunathan, James M. Lepkowski, John Van Hoewyk, Peter Solenberger* 63 | - [Multiple imputation of discrete and continuousdata by fully conditional specification](https://stefvanbuuren.name/publications/MI%20by%20FCS%20-%20SMMR%202007.pdf) (2007) *Stef van Buuren* 64 | - [Multiple imputation by chained equations: what is it and how does it work?](https://onlinelibrary.wiley.com/doi/epdf/10.1002/mpr.329) (2011) *Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, Philip J. Leaf* 65 | - [mice: Multivariate Imputation by Chained Equations in R](https://www.jstatsoft.org/article/view/v045i03) (2011) *Stef van Buuren, Karin Groothuis-Oudshoorn* 66 | - **MissForest** ([Code](https://github.com/stekhoven/missForest), [CRAN](https://cran.r-project.org/web/packages/missForest/)) 67 | - [MissForest - nonparametric missing value imputation for mixed-type data](https://stat.ethz.ch/Manuscripts/buhlmann/missforest-advacc.pdf) (2011) *Daniel J.Stekhoven, Peter Buhlmann* 68 | - **Optimal Transport** ([Code](https://github.com/BorisMuzellec/MissingDataOT)) 69 | - [Missing Data Imputation using Optimal Transport](https://arxiv.org/abs/2002.03860) (2020) *Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi* 70 | - **Autoencoders** 71 | - [Missing data imputation in the electronic health record using deeply learned autoencoders](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5144587/pdf/nihms831925.pdf) *Brett K. Beaulieu-Jones, Jason H. Moore* 72 | - [Multiple Imputation for Biomedical Datausing Monte Carlo Dropout Autoencoders](https://arxiv.org/pdf/2005.06173.pdf) (2020) *Kristian Miok, Dong Nguyen-Doan, Marko Robnik-Šikonja, Daniela Zaharie* 73 | - **GAIN** Missing data imputation with GANs 74 | - [GAIN: Missing Data Imputation using Generative Adversarial Nets](https://arxiv.org/pdf/1806.02920.pdf) (2018) *insung Yoon, James Jordon, Mihaela van der Schaar* 75 | 76 | ### Timeseries imputation 77 | - **RNNs** 78 | - [Modeling Missing Data in Clinical Time Series with RNNs](http://proceedings.mlr.press/v56/Lipton16.pdf) (2016) *Zachary C. Lipton, David C. Kale, Randall Wetzel* 79 | - [Estimating Missing Data in Temporal Data Streams Using Multi-directional Recurrent Neural Networks](https://arxiv.org/abs/1711.08742) (2017) *Jinsung Yoon, William R. Zame, Mihaela van der Schaar* 80 | - [BRITS: Bidirectional Recurrent Imputation for Time Series](https://papers.nips.cc/paper/7911-brits-bidirectional-recurrent-imputation-for-time-series.pdf) (2018) *Wei Cao, Dong Wang, Jian Li, Hao Zhou, Yitan Li, Lei Li* 81 | - **GPs** 82 | - [GP-VAE: Deep Probabilistic Time Series Imputation](https://arxiv.org/pdf/1907.04155.pdf) (2019) *Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, Stephan Mandt* 83 | 84 | ### Other methods, packages 85 | - **MIDAS** Multiple Imputation with Denoising Autoencoders ([Code](https://github.com/Oracen/MIDAS), [Paper](https://arxiv.org/pdf/1705.02737.pdf)) 86 | - **Impute.jl** ([Code](https://github.com/invenia/Impute.jl)) 87 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | remote_theme: mlpapers/minimal 2 | --------------------------------------------------------------------------------