Graph-Based Semi-Supervised Learning with Bipartite Graph for Large-Scale Data and Prediction of Unseen Data

Mohammad Alemi; Alireza Bosaghzadeh; Fadi Dornaika

doi:10.3390/info15100591

Information (Sep 2024)

Graph-Based Semi-Supervised Learning with Bipartite Graph for Large-Scale Data and Prediction of Unseen Data

Mohammad Alemi,
Alireza Bosaghzadeh,
Fadi Dornaika

Affiliations

Mohammad Alemi: Department of Computer Engineering, Shahid Rajaee Teacher Training University, Tehran 16785-163, Iran
Alireza Bosaghzadeh: Department of Computer Engineering, Shahid Rajaee Teacher Training University, Tehran 16785-163, Iran
Fadi Dornaika: Faculty of Computer Engineering, University of the Basque Country, 20018 San Sebastian, Spain

DOI: https://doi.org/10.3390/info15100591
Journal volume & issue: Vol. 15, no. 10
p. 591

Abstract

Read online

Recently, considerable attention has been directed toward graph-based semi-supervised learning (GSSL) as an effective approach for data labeling. Despite the progress achieved by current methodologies, several limitations persist. Firstly, many studies treat all samples equally in terms of weight and influence, disregarding the potential increased importance of samples near decision boundaries. Secondly, the detection of outlier-labeled data is crucial, as it can significantly impact model performance. Thirdly, existing models often struggle with predicting labels for unseen test data, restricting their utility in practical applications. Lastly, most graph-based algorithms rely on affinity matrices that capture pairwise similarities across all data points, thus limiting their scalability to large-scale databases. In this paper, we propose a novel GSSL algorithm tailored for large-scale databases, leveraging anchor points to mitigate the challenges posed by large affinity matrices. Additionally, our method enhances the influence of nodes near decision boundaries by assigning different weights based on their importance and using a mapping function from feature space to label space. Leveraging this mapping function enables direct label prediction for test samples without requiring iterative learning processes. Experimental evaluations on two extensive datasets (Norb and Covtype) demonstrate that our approach is scalable and outperforms existing GSSL methods in terms of performance metrics.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords