An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data

Ismael Ramos-Pérez; José Antonio Barbero-Aparicio; Antonio Canepa-Oneto; Álvar Arnaiz-González; Jesús Maudes-Raedo

doi:10.3390/info15040223

Information (Apr 2024)

An Extensive Performance Comparison between Feature Reduction and Feature Selection Preprocessing Algorithms on Imbalanced Wide Data

Ismael Ramos-Pérez,
José Antonio Barbero-Aparicio,
Antonio Canepa-Oneto,
Álvar Arnaiz-González,
Jesús Maudes-Raedo

Affiliations

Ismael Ramos-Pérez: Department of Computer Engineering, Escuela Politécnica Superior, Universidad de Burgos, Avda. Cantabria s/n, 09006 Burgos, Spain
José Antonio Barbero-Aparicio: Department of Computer Engineering, Escuela Politécnica Superior, Universidad de Burgos, Avda. Cantabria s/n, 09006 Burgos, Spain
Antonio Canepa-Oneto: Department of Computer Engineering, Escuela Politécnica Superior, Universidad de Burgos, Avda. Cantabria s/n, 09006 Burgos, Spain
Álvar Arnaiz-González: Department of Computer Engineering, Escuela Politécnica Superior, Universidad de Burgos, Avda. Cantabria s/n, 09006 Burgos, Spain
Jesús Maudes-Raedo: Department of Computer Engineering, Escuela Politécnica Superior, Universidad de Burgos, Avda. Cantabria s/n, 09006 Burgos, Spain

DOI: https://doi.org/10.3390/info15040223
Journal volume & issue: Vol. 15, no. 4
p. 223

Abstract

Read online

The most common preprocessing techniques used to deal with datasets having high dimensionality and a low number of instances—or wide data—are feature reduction (FR), feature selection (FS), and resampling. This study explores the use of FR and resampling techniques, expanding the limited comparisons between FR and filter FS methods in the existing literature, especially in the context of wide data. We compare the optimal outcomes from a previous comprehensive study of FS against new experiments conducted using FR methods. Two specific challenges associated with the use of FR are outlined in detail: finding FR methods that are compatible with wide data and the need for a reduction estimator of nonlinear approaches to process out-of-sample data. The experimental study compares 17 techniques, including supervised, unsupervised, linear, and nonlinear approaches, using 7 resampling strategies and 5 classifiers. The results demonstrate which configurations are optimal, according to their performance and computation time. Moreover, the best configuration—namely, k Nearest Neighbor (KNN) + the Maximal Margin Criterion (MMC) feature reducer with no resampling—is shown to outperform state-of-the-art algorithms.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords