Mathematics (Mar 2022)

A Novel Low-Query-Budget Active Learner with Pseudo-Labels for Imbalanced Data

  • Alaa Tharwat,
  • Wolfram Schenck

DOI
https://doi.org/10.3390/math10071068
Journal volume & issue
Vol. 10, no. 7
p. 1068

Abstract

Read online

Despite the availability of a large amount of free unlabeled data, collecting sufficient training data for supervised learning models is challenging due to the time and cost involved in the labeling process. The active learning technique we present here provides a solution by querying a small but highly informative set of unlabeled data. It ensures high generalizability across space, improving classification performance with test data that we have never seen before. Most active learners query either the most informative or the most representative data to annotate them. These two criteria are combined in the proposed algorithm by using two phases: exploration and exploitation phases. The former aims to explore the instance space by visiting new regions at each iteration. The second phase attempts to select highly informative points in uncertain regions. Without any predefined knowledge, such as initial training data, these two phases improve the search strategy of the proposed algorithm so that it can explore the minority class space with imbalanced data using a small query budget. Further, some pseudo-labeled points geometrically located in trusted explored regions around the new labeled points are added to the training data, but with lower weights than the original labeled points. These pseudo-labeled points play several roles in our model, such as (i) increasing the size of the training data and (ii) decreasing the size of the version space by reducing the number of hypotheses that are consistent with the training data. Experiments on synthetic and real datasets with different imbalance ratios and dimensions show that the proposed algorithm has significant advantages over various well-known active learners.

Keywords