Kernel Matrix Approximation on Class-Imbalanced Data With an Application to Scientific Simulation

Parisa Hajibabaee; Farhad Pourkamali-Anaraki; Mohammad Amin Hariri-Ardebili

doi:10.1109/ACCESS.2021.3087730

IEEE Access (Jan 2021)

Kernel Matrix Approximation on Class-Imbalanced Data With an Application to Scientific Simulation

Parisa Hajibabaee,
Farhad Pourkamali-Anaraki,
Mohammad Amin Hariri-Ardebili

Affiliations

Parisa Hajibabaee: ORCiD; Department of Computer Science, University of Massachusetts at Lowell, Lowell, MA, USA
Farhad Pourkamali-Anaraki: ORCiD; Department of Computer Science, University of Massachusetts at Lowell, Lowell, MA, USA
Mohammad Amin Hariri-Ardebili: ORCiD; Department of Civil Engineering, University of Colorado, Boulder, CO, USA

DOI: https://doi.org/10.1109/ACCESS.2021.3087730
Journal volume & issue: Vol. 9
pp. 83579 – 83591

Abstract

Read online

Generating low-rank approximations of kernel matrices that arise in nonlinear machine learning techniques holds the potential to significantly alleviate the memory and computational burdens. A compelling approach centers on finding a concise set of exemplars or landmarks to reduce the number of similarity measure evaluations from quadratic to linear concerning the data size. However, a key challenge is to regulate tradeoffs between the quality of landmarks and resource consumption. Despite the volume of research in this area, current understanding is limited regarding the performance of landmark selection techniques in the presence of class-imbalanced data sets that are becoming increasingly prevalent in many applications. Hence, this paper provides a comprehensive empirical investigation using several real-world imbalanced data sets, including scientific data, by evaluating the quality of approximate low-rank decompositions and examining their influence on the accuracy of downstream tasks. Furthermore, we present a new landmark selection technique called Distance-based Importance Sampling and Clustering (DISC), in which the relative importance scores are computed for improving accuracy-efficiency tradeoffs compared to existing works that range from probabilistic sampling to clustering methods. The proposed landmark selection method follows a coarse-to-fine strategy to capture the intrinsic structure of complex data sets, allowing us to substantially reduce the computational complexity and memory footprint with minimal loss in accuracy.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords