Distributed Evolutionary Feature Selection for Big Data Processing

Waad Bouaguel; Chiheb Eddine Ben NCir

doi:10.1142/S2196888822500154

Vietnam Journal of Computer Science (Aug 2022)

Distributed Evolutionary Feature Selection for Big Data Processing

Waad Bouaguel,
Chiheb Eddine Ben NCir

Affiliations

Waad Bouaguel: College of Business, University of Jeddah, Saudi Arabia, LARODEC, ISG, University of Tunis, Tunisia
Chiheb Eddine Ben NCir: College of Business, University of Jeddah, Saudi Arabia, LARODEC, ISG, University of Tunis, Tunisia

DOI: https://doi.org/10.1142/S2196888822500154
Journal volume & issue: Vol. 09, no. 03
pp. 313 – 332

Abstract

Read online

Feature selection has become a powerful dimensional reduction strategy and an effective tool in handling high-dimensional data. Feature selection aims to reduce the dimension of the feature space, to speed up and reduce the cost of the learning model and that by selecting the most relevant feature subset to data mining and machine learning tasks. The selection of optimal feature subset is an optimization problem that proved to be NP-hard. Metaheuristics are traditionally used to deal with NP-hard problems since they are well known for solving complex and real-world problems in reasonable period of time. Genetic algorithm (GA) is one of the most popular metaheuristics algorithms, which proved to be effective for an accurate feature selection task. However, in the last few decades, data have become progressively larger in both numbers of instances and features. This paradigm is being popularly termed as Big Data. With the tremendous growth of dataset sizes, most current feature selection algorithms and exceptionally GA become unscalable. To improve the scalability of a feature selection algorithm on big data, the distributed computing strategy is always adopted such as MapReduce model and Hadoop system. In this paper, we first present a review for the most recent works which handle the use of Parallel Genetic algorithm in large datasets. Then, we will propose a new Parallel Genetic algorithm based on the Coarse-grained parallelization model (island model). The parallelization of the process and the distribution of the partitioning of data will be performed using Hadoop system with an Amazon cluster. The performance and the scalability of the proposed method were theoretically and empirically compared to the existing feature selection methods when handling large-scale datasets and results confirm the effectiveness of our proposed method.

Published in Vietnam Journal of Computer Science

ISSN: 2196-8888 (Print); 2196-8896 (Online)
Publisher: World Scientific Publishing
Country of publisher: Singapore
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.worldscientific.com/worldscinet/vjcs

About the journal

Abstract

Keywords