IEEE Access (Jan 2023)

A New Approach Based on Feature Selection of Light Gradient Boosting Machine and Transformer to Predict circRNA-Disease Associations

  • Chen Ma,
  • Yuhong Chi,
  • Donglai Hao,
  • Xiongfei Ji

DOI
https://doi.org/10.1109/ACCESS.2023.3275967
Journal volume & issue
Vol. 11
pp. 47187 – 47201

Abstract

Read online

Circular RNA (circRNA) is a type of single-stranded RNA with a closed circular structure. Recent studies have shown that circRNA has a relatively more stable structure than its linear counterparts. The circRNA has become a biological marker in medicine and plays a crucial role in disease prediction. However, traditional biological experiments are often time-consuming and laborious. More researchers are taking computational approaches to predict the circRNA-disease associations more rapidly and reliably. In this paper, we propose a novel method for predicting the circRNA-disease associations based on the feature selection using Light Gradient Boosting Machine (LightGBM) and a self-attention neural network-Transformer (LGFRCDA). Firstly, the histogram-based decision tree algorithm in LightGBM is used to discretize the continuous floating-point features in circRNA-disease into the histogram of integer numbers. While traversing samples, the difference between histograms is used to optimize the calculation, greatly improving the construction speed. Then a leaf-wise algorithm is employed to calculate the node with the maximum split gain, resulting in the final feature vector. Finally, these features are sorted in order of importance and introduced into the Transformer for information fusion and prediction. Our study demonstrates that after feature processing and dimension reduction, LGFRCDA achieved a prediction accuracy of 95.44% for AUC (Area Under the receiver operating characteristic Curve), which is 3.11% higher than the latest algorithms for the same dataset. We also conducted a search in published literature to cross-validate the predicted result. Out of the top 15 circRNA-disease pairs predicted by the LGFRCDA model, 13 were confirmed by existing literature. These results indicate that the proposed model is suitable for predicting circRNA-disease associations and can provide reliable candidates for biological experiments.

Keywords