Biomolecules (Jun 2024)

mRCat: A Novel CatBoost Predictor for the Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features

  • Xiao Wang,
  • Lixiang Yang,
  • Rong Wang

DOI
https://doi.org/10.3390/biom14070767
Journal volume & issue
Vol. 14, no. 7
p. 767

Abstract

Read online

The subcellular localization of messenger RNAs (mRNAs) is a pivotal aspect of biomolecules, tightly linked to gene regulation and protein synthesis, and offers innovative insights into disease diagnosis and drug development in the field of biomedicine. Several computational methods have been proposed to predict the subcellular localization of mRNAs within cells. However, there remains a deficiency in the accuracy of these predictions. In this study, we propose an mRCat predictor based on the gradient boosting tree algorithm specifically to predict whether mRNAs are localized in the nucleus or in the cytoplasm. This predictor firstly uses large language models to thoroughly explore hidden information within sequences and then integrates traditional sequence features to collectively characterize mRNA gene sequences. Finally, it employs CatBoost as the base classifier for predicting the subcellular localization of mRNAs. The experimental validation on an independent test set demonstrates that mRCat obtained accuracy of 0.761, F1 score of 0.710, MCC of 0.511, and AUROC of 0.751. The results indicate that our method has higher accuracy and robustness compared to other state-of-the-art methods. It is anticipated to offer deep insights for biomolecular research.

Keywords