IEEE Access (Jan 2023)

Semi-Supervised Gaussian Processes Active Learning Model for Imbalanced Small Data Based on Tri-Training With Data Enhancement

  • Chenxiao Zhou,
  • Lianying Zou

DOI
https://doi.org/10.1109/ACCESS.2023.3244682
Journal volume & issue
Vol. 11
pp. 17510 – 17524

Abstract

Read online

To solve the problem that some imbalanced small sample datasets only contain a few labeled samples, a semi-supervised gaussian processes active learning model based on improved tri-training with enhanced data is proposed. Firstly, the label samples are balanced and enhanced, and we present a quantitative enhanced data evaluation criteria based on the JS distance and the similarity of information entropy between enhanced data and original data to select the best enhanced data. Secondly, an improved semi-supervised learning method based on tri-training is proposed to find the unlabeled samples which have high confidence, so the certainty of the labeled samples group can be increased, in order to ensure that the three classifiers of tri-training have both difference and robustness, random forest is introduced to divide the features of the dataset into three groups with equal contribution, and each classifier trains different combinations of two feature groups. Thirdly, in order to query and classify the most informative unlabeled samples more precisely, active learning based on the Gaussian process and JS distribution range is structured, because of the high uncertainty of the unlabeled samples predicted by active learning, the similarity distribution range of JS distance is introduced to compare the similarity of unlabeled samples and labeled samples in active learning‘s classifier, so the model can classify more diverse samples. The final experimental results show that compared with several traditional models, the proposed model performs better on artificial datasets and imbalanced small-size UCI datasets.

Keywords