IEEE Access (Jan 2020)

ItLnc-BXE: A Bagging-XGBoost-Ensemble Method With Comprehensive Sequence Features for Identification of Plant lncRNAs

  • Guangyan Zhang,
  • Ziru Liu,
  • Jichen Dai,
  • Zilan Yu,
  • Shuai Liu,
  • Wen Zhang

DOI
https://doi.org/10.1109/ACCESS.2020.2985114
Journal volume & issue
Vol. 8
pp. 68811 – 68819

Abstract

Read online

Since long non-coding RNAs (lncRNAs) have involved in a wide range of functions in cellular and developmental processes, an increasing number of methods have been proposed for distinguishing lncRNAs from coding RNAs. However, most of the existing methods are designed for lncRNAs in animal systems, and only a few methods focus on the plant lncRNA identification. Different from lncRNAs in animal systems, plant lncRNAs have distinct characteristics. It is desirable to develop a computational method for accurate and robust identification of plant lncRNAs. Herein, we present a plant lncRNA identification method ItLnc-BXE, which utilizes comprehensive features and the ensemble learning strategy. First, a diversity of sequence features is collected and filtered by feature selection to represent transcripts. Then, several base learners are trained and further combined into a single meta-learner by ensemble learning, and thus an ItLnc-BXE model is constructed. ItLnc-BXE models are evaluated on datasets of six plant species, the results show that ItLnc-BXE outperforms other state-of-the-art plant lncRNA identification methods, achieving better and robust performance (AUC>95.91%). We also perform some experiments about cross-species lncRNA identification, and the results indicate that dicots-based and monocots-based models can be used to accurately identify lncRNAs in lower plant species, such as mosses and algae. In addition, source codes and supplementary data are available at https://github.com/BioMedicalBigDataMiningLab/ItLnc-BXE.

Keywords