Multiple Imputation for Missing Laboratory Data: An Example from Infectious Disease Epidemiology
Purpose
To present multiple imputation (MI) as an appropriate method to address missing values for a laboratory parameter (serum albumin) in an epidemiologic study.
Methods
A data set of patients who were hospitalized for invasive group A streptococcal infections was accessed. Age was the exposure of interest. The outcome was hospital mortality. Several variables, including serum albumin, were considered to be potential confounders. Of the 201 records, 91 had missing values for serum albumin. The MI procedure in SAS was used to perform 20 imputations of serum albumin by using a Markov chain Monte Carlo approach. Logistic regression was then performed on each of the 20 filled-in data sets, and the results were appropriately combined by using the MIANALYZE procedure.
Results
Age (≥55 years vs. 0–54 years) was not a risk factor for hospital mortality in the complete-case analysis (n=110): adjusted odds ratio (OR)=2.43 (95% confidence interval [CI]: 0.79–7.53). Age was a significant risk factor in the imputed data set (n=201): adjusted OR=3.08 (95% CI: 1.22–7.78).
Conclusions
Epidemiologists frequently encounter data sets that contain missing values. Traditional missing data techniques such as the complete-subject analysis may lead to biased results. We have demonstrated the use of a novel technique, MI, to account for missing data.
Key Words: Streptococcus pyogenes, Serum Albumin, Missing Data, Multiple Imputation, Markov Chains, Monte Carlo Methods
Selected Abbreviations and Acronyms: CI, confidence interval, EM, expectation-maximization, GAS, group A streptococcal, MAR, missing at random, MCMC, Markov chain Monte Carlo, MI, multiple imputation, OR, odds ratio
To access this article, please choose from the options below
PII: S1047-2797(09)00285-3
doi:10.1016/j.annepidem.2009.08.002
© 2009 Elsevier Inc. All rights reserved.
