BMC Bioinformatics (Aug 2022)

A novel meta-analysis based on data augmentation and elastic data shared lasso regularization for gene expression

  • Hai-Hui Huang,
  • Hao Rao,
  • Rui Miao,
  • Yong Liang

DOI
https://doi.org/10.1186/s12859-022-04887-5
Journal volume & issue
Vol. 23, no. S10
pp. 1 – 24

Abstract

Read online

Abstract Background Gene expression analysis can provide useful information for analyzing complex biological mechanisms. However, many reported findings are unrepeatable due to small sample sizes relative to a large number of genes and the low signal-to-noise ratios of most gene expression datasets. Results Meta-analysis of multi-data sets is an efficient method for tackling the above problem. To improve the performance of meta-analysis, we propose a novel meta-analysis framework. It consists of two parts: (1) a novel data augmentation strategy. Various cross-platform normalization methods exist, which can preserve original biological information of gene expression datasets from different angles and add different “perturbations” to the dataset. Using such perturbation, we provide a feasible means for gene expression data augmentation; (2) elastic data shared lasso (DSL- $${{\varvec{L}}}_{\mathbf{2}}$$ L 2 ). The DSL- $${\mathbf{L}}_{\mathbf{2}}$$ L 2 method spans the continuum between individual models for each dataset and one model for all datasets. It also overcomes the shortcomings of the data shared lasso method when dealing with highly correlated features. Comprehensive simulation experiment results show that the proposed method has high prediction and gene selection performance. We then apply the proposed method to non-small cell lung cancer (NSCLC) blood gene expression data in order to identify key tumor-related genes. The outcomes of our experiment indicate that the method could be used for identifying a set of robust disease-related gene signatures that may be used for NSCLC early diagnosis or prognosis or even targeting. Conclusion We propose a novel and effective meta-analysis method for biological research, extrapolating and integrating information from multiple gene expression datasets.

Keywords