IEEE Access (Jan 2024)
Prediction of Protein-Coding Small ORFs Based on Supervised Contrastive Learning
Abstract
Prediction of protein-coding small open reading frames (sORFs) stands as a fundamental task in bioinformatics, and coding sORFs play an indispensable role in biological life activities. Until now, a variety of tools have been developed for protein-coding sORFs prediction. However, the existing approaches neglect the concordance (discordance) among similar (dissimilar) sORFs and need to extract sORFs features to distinguish coding sORFs from non-coding sORFs, resulting in limited generality and high complexity. To address this limitation, we design an end-to-end framework CLsORFs based on contrastive learning. The framework can autonomously exploit the local features and global logical correlation features of sORFs, and capture the similarities and differences between sORFs through supervised contrastive learning. We comprehensively evaluate the CLsORFs framework on multiple datasets of different species. The experimental results show that CLsORFs performs well in multi-species prediction and is superior to other state-of-the-art methods in protein-coding sORFs prediction. CLsORFs has improved the MCC index in cross-species prediction, especially in eukaryotic datasets. Moreover, CLsORFs has demonstrated excellent performance on datasets from other species, validating the effectiveness of contrastive learning in the task of protein-coding sORFs prediction and the good generalization ability of the model.
Keywords