BMC Bioinformatics (Apr 2021)

An efficient ensemble method for missing value imputation in microarray gene expression data

  • Xinshan Zhu,
  • Jiayu Wang,
  • Biao Sun,
  • Chao Ren,
  • Ting Yang,
  • Jie Ding

DOI
https://doi.org/10.1186/s12859-021-04109-4
Journal volume & issue
Vol. 22, no. 1
pp. 1 – 25

Abstract

Read online

Abstract Background The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. Results In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. Conclusion The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way.

Keywords