Complexity curve: a graphical measure of data complexity and classifier performance

Julian Zubek; Dariusz M. Plewczynski

doi:10.7717/peerj-cs.76

PeerJ Computer Science (Aug 2016)

Complexity curve: a graphical measure of data complexity and classifier performance

Julian Zubek,
Dariusz M. Plewczynski

Affiliations

Julian Zubek: Centre of New Technologies, University of Warsaw, Warsaw, Poland
Dariusz M. Plewczynski: Centre of New Technologies, University of Warsaw, Warsaw, Poland

DOI: https://doi.org/10.7717/peerj-cs.76
Journal volume & issue: Vol. 2
p. e76

Abstract

Read online Read online

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of available data with respect to the attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. It demonstrates the relative increase of available information with the growth of sample size. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We then compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining performance of specific classifiers on these sets. We also apply our methodology to a panel of simple benchmark data sets, demonstrating how it can be used in practice to gain insights into data characteristics. Moreover, we show that the complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without compromising classification accuracy. The associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords