IEEE Access (Jan 2022)

Extended a Priori Probability (EAPP): A Data-Driven Approach for Machine Learning Binary Classification Tasks

  • Vicent Ortiz Castello,
  • Francisco Javier Perez-Benito,
  • Omar Del Tejo Catala,
  • Ismael Salvador Igual,
  • Rafael Llobet,
  • Juan-Carlos Perez-Cortes

DOI
https://doi.org/10.1109/ACCESS.2022.3221936
Journal volume & issue
Vol. 10
pp. 120074 – 120085

Abstract

Read online

The a priori probability of a dataset is usually used as a baseline for comparing a particular algorithm’s accuracy in a given binary classification task. ZeroR is the simplest algorithm for this, predicting the majority class for all examples. However, this is an extremely simple approach that has no predictive power and does not describe other dataset features that could lead to a more demanding baseline. In this paper, we present the Extended A Priori Probability (EAPP), a novel semi-supervised baseline metric for binary classification tasks that considers not only the a priori probability but also some possible bias present in the dataset as well as other features that could provide a relatively trivial separability of the target classes. The approach is based on the area under the ROC curve (AUC ROC), known to be quite insensitive to class imbalance. The procedure involves multiobjective feature extraction and a clustering stage in the input space with autoencoders and a subsequent combinatory weighted assignation from clusters to classes depending on the distance to nearest clusters for each class. Class labels are then assigned to establish the combination that maximizes AUC ROC for each number of clusters considered. To avoid overfit in the combined feature extraction and clustering method, a cross-validation scheme is performed in each case. EAPP is defined for different numbers of clusters, starting from the inverse of the minority class proportion, which is useful for a fair comparison among diversely imbalanced datasets. A high EAPP usually relates to an easy binary classification task, but it also may be due to a significant coarse-grained bias in the dataset, when the task is previously known to be difficult. This metric represents a baseline beyond the a priori probability to assess the actual capabilities of binary classification models.

Keywords