An efficient ensemble method for missing value imputation in microarray gene expression data

Xinshan Zhu; Jiayu Wang; Biao Sun; Chao Ren; Ting Yang; Jie Ding

doi:10.1186/s12859-021-04109-4

BMC Bioinformatics (Apr 2021)

An efficient ensemble method for missing value imputation in microarray gene expression data

Xinshan Zhu,
Jiayu Wang,
Biao Sun,
Chao Ren,
Ting Yang,
Jie Ding

Affiliations

Xinshan Zhu: School of Electrical and Information Engineering, Tianjin University
Jiayu Wang: School of Electrical and Information Engineering, Tianjin University
Biao Sun: School of Electrical and Information Engineering, Tianjin University
Chao Ren: School of Electrical and Information Engineering, Tianjin University
Ting Yang: School of Electrical and Information Engineering, Tianjin University
Jie Ding: China Institute of FTZ Supply Chain, Shanghai Maritime University

DOI: https://doi.org/10.1186/s12859-021-04109-4
Journal volume & issue: Vol. 22, no. 1
pp. 1 – 25

Abstract

Read online

Abstract Background The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. Results In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. Conclusion The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords