Information (Sep 2024)
Graph-Based Semi-Supervised Learning with Bipartite Graph for Large-Scale Data and Prediction of Unseen Data
Abstract
Recently, considerable attention has been directed toward graph-based semi-supervised learning (GSSL) as an effective approach for data labeling. Despite the progress achieved by current methodologies, several limitations persist. Firstly, many studies treat all samples equally in terms of weight and influence, disregarding the potential increased importance of samples near decision boundaries. Secondly, the detection of outlier-labeled data is crucial, as it can significantly impact model performance. Thirdly, existing models often struggle with predicting labels for unseen test data, restricting their utility in practical applications. Lastly, most graph-based algorithms rely on affinity matrices that capture pairwise similarities across all data points, thus limiting their scalability to large-scale databases. In this paper, we propose a novel GSSL algorithm tailored for large-scale databases, leveraging anchor points to mitigate the challenges posed by large affinity matrices. Additionally, our method enhances the influence of nodes near decision boundaries by assigning different weights based on their importance and using a mapping function from feature space to label space. Leveraging this mapping function enables direct label prediction for test samples without requiring iterative learning processes. Experimental evaluations on two extensive datasets (Norb and Covtype) demonstrate that our approach is scalable and outperforms existing GSSL methods in terms of performance metrics.
Keywords