IEEE Access (Jan 2021)
Adjacent Inputs With Different Labels and Hardness in Supervised Learning
Abstract
An important aspect of the design of effective machine learning algorithms is the complexity analysis of classification problems. In this paper, we propose a study aimed at determining the relation between the number of adjacent inputs with different labels and the required number of examples for the task of inducing a classification model. To this aim, we first quantified the adjacent inputs with different labels as a property, using a measure denoted as Neighbour Input Variation (NIV). We analyzed the relation that NIV has to random data and overfitting. We then demonstrated that a threshold of NIV may determine if a classification model can generalize to unseen data. We also presented a case study aimed at analyzing threshold neural networks and the required first hidden layer size in function of NIV. Finally, we performed experiments with five popular algorithms analyzing the relation between NIV and the classification error on problems with few dimensions. We conclude that functions whose similar inputs have different outputs with high probability, considerably reduce the generalization capacity of classification algorithms.
Keywords