├── README.md
└── _config.yml


/README.md:
--------------------------------------------------------------------------------
 1 | # Missing Data, Data Imputation
 2 | > In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation ([Wiki](https://en.wikipedia.org/wiki/Missing_data))
 3 | 
 4 | - **Overview**
 5 |   - A free book on data imputation by the author of [mice](https://stefvanbuuren.name/mice/) package:  
 6 |     [Flexible Imputation of Missing Data](https://stefvanbuuren.name/fimd/) (2018) *Stef van Buuren*
 7 |   - [Missing-data imputation](www.stat.columbia.edu/~gelman/arm/missing.pdf)(Ch. 25) of [Data Analysis Using Regression and Multilevel/Hierarchical Models](http://www.stat.columbia.edu/~gelman/arm/) (2006) ' *Andrew Gelman, Jennifer Hill*
 8 |   - [Missing Data: Our View of the State of the Art](https://stat.ethz.ch/education/semesters/ss2013/ams/paper/missing_data_1.pdf) (2002) *Joseph L. Schafer, John W. Graham*
 9 |   - [A Review of Methods for Missing Data](https://galton.uchicago.edu/~eichler/stat24600/Admin/MissingDataReview.pdf) (2001) *Therese D. Pigott*
10 | 
11 | ### Types of missing data ([Wiki](https://en.wikipedia.org/wiki/Missing_data#Types))
12 |   - Missed completely at random
13 |   - Missed at random
14 |   - Missed data that depends on unobserved variables
15 |   - Missed data that depends on the missing value itself
16 | 
17 | ### Discarding data
18 | - **Listwise deletion** Complete-case analysis
19 |   - Samples (rows) are removed from a dataset if they have missing values. Probably the most simple and popular approach. Often done automatically by many ML packages
20 |   - When dealing with big number of variables that have missing values, the number of samples after deletion can be too small
21 |   - May lead to biased estimates. Also smaller sample size increases standard errors
22 | - **Available-case analysis** Complete-variables analysis
23 |   - Excluding variables from data if their missing-values rate is lower than some threshold
24 | 
25 | ### Imputation ([Wiki](https://en.wikipedia.org/wiki/Imputation_(statistics)))
26 | > Whenever a single imputation strategy is used, the standard errors of estimates tend to be too low. The intuition here is that we have substantial uncertainty about the missing values, but by choosing a single imputation we inessence pretend that we know the true value with certainty ([Data Analysis Using Regression and Multilevel/Hierarchical Models](www.stat.columbia.edu/~gelman/arm/missing.pdf))
27 | 
28 | - **Mean/Mode Replacement**
29 |   - Replaces missing values with a variable (column) mean, mode or median
30 |   - Distorts a probability distribution of an imputed variable
31 |   - Distorts relationship between variables
32 | - **LOCF** Last Observation Carried Forward imputation ([Wiki](https://en.wikipedia.org/wiki/Analysis_of_clinical_trials#Last_observation_carried_forward))
33 |   - In time-series data the last observed value before a missing one is "carried forward" to fill in the blank points
34 |   - [Does analysis using “last observation carried forward” introduce bias in dementia research?]() *Frank J. Molnar, Brian Hutton, Dean Fergusson*
35 | - **Indicator variables**
36 |   - Extra category that indicates missingness of a variable
37 |   - Extra binary indicator variable used together with a variable that includes  missing data (works with continuous data too)
38 | - **Regression Imputation**
39 |   - A regression model is created on a variable with missing values, then used to predict blank points
40 |   - Deterministic regression imputation uses the original prediction of the regression model to impute missing values
41 |   - Stochastic regression imputation adds random error
42 |   - Article about the regression imputation method with examples:  
43 |     [Regression Imputation (Stochastic vs. Deterministic & R Example)](https://statisticsglobe.com/regression-imputation-stochastic-vs-deterministic/) *Statistics Globe*
44 | - **Iterative Regression Imputation**
45 |   - When multiple variables have missing values, IRI imputes them iteratively: using non-missing variables first to impute the first missing variable, then using the imputed variable together with non-missing predictors to predict missing values of the second one, etc
46 | - **SRMI** Sequential Regression Multivariate Imputation
47 |   - [A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models](https://pdfs.semanticscholar.org/13b3/0e35b9a54dad07094cfe4f50d40ff15d8370.pdf) (2001) *Trivellore E. Raghunathan, James M. Lepkowski, John Van Hoewyk, Peter Solenberger*
48 | - **Cold-deck Imputation**
49 |   - Impute missing data using previously collected datasets
50 | - **Hot-deck Imputation**
51 |   - For each sample with a missing value, find a similar complete-case sample in the same dataset and use it for imputation
52 |   - Uses a scoring function to measure similarity
53 |   - [A Review of Hot Deck Imputation for Survey Non-response](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130338/) (2010) *Rebecca R. Andridge, Roderick J. A. Little*
54 | - **KNN** k-Nearest Neighbors
55 | - **Model-based Imputation**
56 |   - When something is known about why missing data exist, it's possible to directly model the missingness
57 | - **Multiple Imputation** ([Wiki](https://en.wikipedia.org/wiki/Imputation_(statistics)#Multiple_imputation))
58 |   - Drawing imputed values multiple times from some distribution. Then each realization of imputed data is analysed. Aggregated results from all realizations are used to get uncertainty estimation.
59 |   - [Multiple Imputation for Nonresponse in Surveys](https://www.onlinelibrary.wiley.com/doi/pdf/10.1002/9780470316696.fmatter) (1987) *Donald B. Rubin*
60 |   - [Analyzing Incomplete Political Science Data: An Alternative Algorithm forMultiple Imputation] (2001) *Gary King, James Honaker, Anne Joseph, Kenneth Scheve*
61 | - **MICE** Multivariate Imputation by Chained Equations ([Homepage](https://stefvanbuuren.name/mice/), [Code](https://github.com/stefvanbuuren/mice), [CRAN](https://cran.r-project.org/web/packages/mice/))
62 |   - [A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.405.4540) (2001) *Trivellore E. Raghunathan, James M. Lepkowski, John Van Hoewyk, Peter Solenberger*
63 |   - [Multiple imputation of discrete and continuousdata by fully conditional specification](https://stefvanbuuren.name/publications/MI%20by%20FCS%20-%20SMMR%202007.pdf) (2007) *Stef van Buuren*
64 |   - [Multiple imputation by chained equations: what is it and how does it work?](https://onlinelibrary.wiley.com/doi/epdf/10.1002/mpr.329) (2011) *Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, Philip J. Leaf*
65 |   - [mice: Multivariate Imputation by Chained Equations in R](https://www.jstatsoft.org/article/view/v045i03) (2011) *Stef van Buuren, Karin Groothuis-Oudshoorn*
66 | - **MissForest** ([Code](https://github.com/stekhoven/missForest), [CRAN](https://cran.r-project.org/web/packages/missForest/))
67 |   - [MissForest - nonparametric missing value imputation for mixed-type data](https://stat.ethz.ch/Manuscripts/buhlmann/missforest-advacc.pdf) (2011) *Daniel J.Stekhoven, Peter Buhlmann*
68 | - **Optimal Transport** ([Code](https://github.com/BorisMuzellec/MissingDataOT))
69 |   - [Missing Data Imputation using Optimal Transport](https://arxiv.org/abs/2002.03860) (2020) *Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi*
70 | - **Autoencoders**
71 |   - [Missing data imputation in the electronic health record using deeply learned autoencoders](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5144587/pdf/nihms831925.pdf) *Brett K. Beaulieu-Jones, Jason H. Moore*
72 |   - [Multiple Imputation for Biomedical Datausing Monte Carlo Dropout Autoencoders](https://arxiv.org/pdf/2005.06173.pdf) (2020) *Kristian Miok, Dong Nguyen-Doan, Marko Robnik-Šikonja, Daniela Zaharie*
73 | - **GAIN** Missing data imputation with GANs
74 |   - [GAIN: Missing Data Imputation using Generative Adversarial Nets](https://arxiv.org/pdf/1806.02920.pdf) (2018) *insung Yoon, James Jordon, Mihaela van der Schaar*
75 | 
76 | ### Timeseries imputation
77 | - **RNNs**
78 |   - [Modeling Missing Data in Clinical Time Series with RNNs](http://proceedings.mlr.press/v56/Lipton16.pdf) (2016) *Zachary C. Lipton, David C. Kale, Randall Wetzel*
79 |   - [Estimating Missing Data in Temporal Data Streams Using Multi-directional Recurrent Neural Networks](https://arxiv.org/abs/1711.08742) (2017) *Jinsung Yoon, William R. Zame, Mihaela van der Schaar*
80 |   - [BRITS: Bidirectional Recurrent Imputation for Time Series](https://papers.nips.cc/paper/7911-brits-bidirectional-recurrent-imputation-for-time-series.pdf) (2018) *Wei Cao, Dong Wang, Jian Li, Hao Zhou, Yitan Li, Lei Li*
81 | - **GPs**
82 |   - [GP-VAE: Deep Probabilistic Time Series Imputation](https://arxiv.org/pdf/1907.04155.pdf) (2019) *Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, Stephan Mandt*
83 | 
84 | ### Other methods, packages
85 | - **MIDAS** Multiple Imputation with Denoising Autoencoders ([Code](https://github.com/Oracen/MIDAS), [Paper](https://arxiv.org/pdf/1705.02737.pdf))
86 | - **Impute.jl** ([Code](https://github.com/invenia/Impute.jl))
87 | 


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | remote_theme: mlpapers/minimal
2 | 


--------------------------------------------------------------------------------