PUStackNGly: Positive-Unlabeled and Stacking Learning for N-Linked Glycosylation Site Prediction

Alhasan Alkuhlani; Walaa Gad; Mohamed Roushdy; Abdel-Badeeh M. Salem

doi:10.1109/ACCESS.2022.3146395

IEEE Access (Jan 2022)

PUStackNGly: Positive-Unlabeled and Stacking Learning for N-Linked Glycosylation Site Prediction

Alhasan Alkuhlani,
Walaa Gad,
Mohamed Roushdy,
Abdel-Badeeh M. Salem

Affiliations

Alhasan Alkuhlani: ORCiD; Faculty of Computer and Information Technology, Sana’a University, Sana’a, Yemen
Walaa Gad: ORCiD; Faculty of Computer and Information Science, Ain Shams University, Cairo, Egypt
Mohamed Roushdy: Faculty of Computers and Information Technology, Future University in Egypt, New Cairo, Egypt
Abdel-Badeeh M. Salem: Faculty of Computer and Information Science, Ain Shams University, Cairo, Egypt

DOI: https://doi.org/10.1109/ACCESS.2022.3146395
Journal volume & issue: Vol. 10
pp. 12702 – 12713

Abstract

Read online

N-linked glycosylation is one of the most common protein post-translation modifications (PTMs) in humans where the Asparagine (N) amino acid of the protein is attached to the glycan. It is involved in most biological processes and associated with various human diseases as diabetes, cancer, coronavirus, influenza, and Alzheimer’s. Accordingly, identifying N-linked glycosylation sites will be beneficial to understanding the system and mechanism of glycosylation. Due to the experimental challenges of glycosylation site identification, machine learning becomes very important to predict the glycosylation sites. This paper proposes a novel N-linked glycosylation predictor based on bagging positive-unlabeled (PU) learning and stacking ensemble machine learning (PUStackNGly). In the proposed PUStackNGly, comprehensive sequence and structural-based features are extracted using different feature extraction descriptors. Then, ensemble-based feature selection is employed to select the most significant and stable features. The ensemble bagging PU learning selects the reliable negative samples from the unlabeled samples using four supervised learning methods (support vector machines, random forest, logistic regression, and XGBoost). Then, stacking ensemble learning is applied using four base classifiers: logistic regression, artificial neural networks, random forest, and support vector machine. The experiments results show that PUStackNGly has a promising predicting performance compared to supervised learning methods. Furthermore, the proposed PUStackNgly outperforms the existing N-linked glycosylation prediction tools on an independent dataset with 95.11% accuracy, 100% recall 80.7% precision, 89.32% F1 score, 96.93% AUC, and 0.87 MCC.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords