Journal of Big Data (Feb 2023)

Multicollinearity applied stepwise stochastic imputation: a large dataset imputation through correlation-based regression

  • Benjamin D. Leiby,
  • Darryl K. Ahner

DOI
https://doi.org/10.1186/s40537-023-00698-4
Journal volume & issue
Vol. 10, no. 1
pp. 1 – 20

Abstract

Read online

Abstract This paper presents a stochastic imputation approach for large datasets using a correlation selection methodology when preferred commercial packages struggle to iterate due to numerical problems. A variable range-based guard rail modification is proposed that benefits the convergence rate of data elements while simultaneously providing increased confidence in the plausibility of the imputations. A large country conflict dataset motivates the search to impute missing values well over a common threshold of 20% missingness. The Multicollinearity Applied Stepwise Stochastic imputation methodology (MASS-impute) capitalizes on correlation between variables within the dataset and uses model residuals to estimate unknown values. Examination of the methodology provides insight toward choosing linear or nonlinear modeling terms. Tailorable tolerances exploit residual information to fit each data element. The methodology evaluation includes observing computation time, model fit, and the comparison of known values to replaced values created through imputation. Overall, the methodology provides useable and defendable results in imputing missing elements of a country conflict dataset.

Keywords