Статистика України (Dec 2019)

New Trends in Evidence-based Statistics: Data Imputation Problems

  • N. V. Kovtun,
  • A.-N. Ya. Fataliieva

DOI
https://doi.org/10.31767/su.4(87)2019.04.01
Journal volume & issue
Vol. 87, no. 4
pp. 4 – 13

Abstract

Read online

The main reasons for omissions are: 1. Exclusion of the subject from the study due to non-compliance with study requirements; 2. The occurrence of an adverse event; 3. Missing result; 4. Lack of registration; 5. Researchers’ act of omission and / or commission.We can define the following data gap limits: 1) Less than 5% of omissions are insignificant and they do not affect the research results; 2) Data losses of 20% and more question the integrity of research results. The higher the share of the missing data, the less reliable the conclusions are, and the more difficult to prove the treatment efficiency is. Consequently, missing data is a potential source of bias when analyzing data. Exclusion of subjects can affect the compatibility of groups and subgroups, which leads to bias in the estimates.There are different ways to deal with missing data. The simplest is to exclude the subject from the calculations. But the consequences of this approach are: reduction in sample size; compromise in the extent of relevance for statistical inferences; change of a confidence interval (e.g. narrowing resulting from underestimation of variances). Hence, it is important to identify the nature of the omission when dealing with missing data which can be of missing completely at random (MCAR), missing at random (MAR) and missing not at random. This necessitates using an appropriate method of data processing with missing values: exclusion, filling, weighing and modeling. All these methods give different results with different volumes and nature of omissions.We attempted to evaluate the results of different imputation methods by using a sample with different proportions of missing data that were simulated. Thus, with 10% of the MCAR omissions, parameter estimates and p-value for two factors, resulting from the application of the first group of methods, were close to the result from complete data. Average square errors that were calculated by using the method of the absolute average, and the method of filling blank spaces with successive selection, were closer to the standard; all other methods overvalued this estimate. Coefficient of determination was almost similar to the initial data when the method of filling blank spaces with successive selection was applied. Data with 25% of missing MCAR: factor – treatment group became insignificant when the method of filling with absolute and conditional averages was applied. The lowest estimate for coefficient of determination was found when the method of filling with absolute average values was applied, and overestimation was the least when the method of filling blank spaces with successive selection was applied. The changes were minimal with other approaches. Thus, parameter estimates and p-value resulting from the application of the analysis method of available cases were closer to the result available from the regression on the complete data.Data with 50% of missing MCAR: Pre-treatment weight became insignificant when the analysis method of complete observations was applied. Factor treatment group became insignificant when the method of filling blank spaces with successive selection was applied. The most accurate estimate of pre-treatment weight variable was received from the result of the method of conditional average. But, the method of filling with absolute average can be singled out - its results were the closest to the initial data.According to the results of imputation with 10% and 50% of missing MAR data by each method, the change in parameter estimate for an intercept and two factors were minimal. It is with the application of the methods of multiple imputation that average square error and determination coefficient were the closest to the results, received from using complete data.This study identifies the weaknesses and the strengths of different methods of data imputation, and presents the effectiveness of applying the one method over the other one with different shares of missed information. Undisputedly, the result from this study established that the approach to the imputation process cannot be an “one-size-fits-all” and the imputation problem should be solved on a case-by-case basis by analysis of the existing database, taking into account not only the characteristics of the data itself and the volume of omissions, but also the expected contribution(s) from a particular study.

Keywords