The Ability of Different Imputation Methods to Capture Complex Dependencies in High Dimensions

Humera Razzak (humera.Razzak@stat.uni-muenchen.de)
Department of Statistics, LMU Munich
Christian Heumann (chris@stat.uni-muenchen.de)
Department of Statistics, LMU Munich

ABSTRACT

Multiple-imputation (MI) is a method for treating the problem of missing data. There are various competing computational algorithms available in the R environment to address missing data problems of categorical and continuous variables. In the case of a high amount of missing information, large sample sizes and complex dependency structures among categorical variables, the utility of the provided R packages is somewhat limited. A computationally expedient, fully Bayesian, joint modeling (JM) approach known as “Dirichlet process mixtures of multinomial distributions” (DPMD), automatically models complex dependencies among variables. But this approach is limited to categorical variables only. We propose a simple and easy to implement combining algorithm which imputes continuous variables using various algorithms and uses the JM approach to detect complex dependency structures among categorical variables. We review, describe and evaluate software packages commonly available in R and compare the results with the proposed MI method by using as example an artificial data set. The results suggest that the MI approach which combines the JM approach and various algorithms based on generalized linear models dominates various algorithms when applied solely.

Keywords: Survey data; Multiple Imputation; Complex dependencies; Hybrid; Dirichlet process prior distributions, R ‑ project.

[Full Text]