A novel meta-analysis based on data augmentation and elastic data shared lasso regularization for gene expression

Hai-Hui Huang; Hao Rao; Rui Miao; Yong Liang

doi:10.1186/s12859-022-04887-5

BMC Bioinformatics (Aug 2022)

A novel meta-analysis based on data augmentation and elastic data shared lasso regularization for gene expression

Hai-Hui Huang,
Hao Rao,
Rui Miao,
Yong Liang

Affiliations

Hai-Hui Huang: Provincial Demonstration Software Institute, Shaoguan University
Hao Rao: Provincial Demonstration Software Institute, Shaoguan University
Rui Miao: Faculty of Information Technology, Macau University of Science and Technology
Yong Liang: The Peng Cheng Laboratory

DOI: https://doi.org/10.1186/s12859-022-04887-5
Journal volume & issue: Vol. 23, no. S10
pp. 1 – 24

Abstract

Read online

Abstract Background Gene expression analysis can provide useful information for analyzing complex biological mechanisms. However, many reported findings are unrepeatable due to small sample sizes relative to a large number of genes and the low signal-to-noise ratios of most gene expression datasets. Results Meta-analysis of multi-data sets is an efficient method for tackling the above problem. To improve the performance of meta-analysis, we propose a novel meta-analysis framework. It consists of two parts: (1) a novel data augmentation strategy. Various cross-platform normalization methods exist, which can preserve original biological information of gene expression datasets from different angles and add different “perturbations” to the dataset. Using such perturbation, we provide a feasible means for gene expression data augmentation; (2) elastic data shared lasso (DSL- $${{\varvec{L}}}_{\mathbf{2}}$$ L 2 ). The DSL- $${\mathbf{L}}_{\mathbf{2}}$$ L 2 method spans the continuum between individual models for each dataset and one model for all datasets. It also overcomes the shortcomings of the data shared lasso method when dealing with highly correlated features. Comprehensive simulation experiment results show that the proposed method has high prediction and gene selection performance. We then apply the proposed method to non-small cell lung cancer (NSCLC) blood gene expression data in order to identify key tumor-related genes. The outcomes of our experiment indicate that the method could be used for identifying a set of robust disease-related gene signatures that may be used for NSCLC early diagnosis or prognosis or even targeting. Conclusion We propose a novel and effective meta-analysis method for biological research, extrapolating and integrating information from multiple gene expression datasets.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords