Vietnam Journal of Computer Science (Aug 2022)
Distributed Evolutionary Feature Selection for Big Data Processing
Abstract
Feature selection has become a powerful dimensional reduction strategy and an effective tool in handling high-dimensional data. Feature selection aims to reduce the dimension of the feature space, to speed up and reduce the cost of the learning model and that by selecting the most relevant feature subset to data mining and machine learning tasks. The selection of optimal feature subset is an optimization problem that proved to be NP-hard. Metaheuristics are traditionally used to deal with NP-hard problems since they are well known for solving complex and real-world problems in reasonable period of time. Genetic algorithm (GA) is one of the most popular metaheuristics algorithms, which proved to be effective for an accurate feature selection task. However, in the last few decades, data have become progressively larger in both numbers of instances and features. This paradigm is being popularly termed as Big Data. With the tremendous growth of dataset sizes, most current feature selection algorithms and exceptionally GA become unscalable. To improve the scalability of a feature selection algorithm on big data, the distributed computing strategy is always adopted such as MapReduce model and Hadoop system. In this paper, we first present a review for the most recent works which handle the use of Parallel Genetic algorithm in large datasets. Then, we will propose a new Parallel Genetic algorithm based on the Coarse-grained parallelization model (island model). The parallelization of the process and the distribution of the partitioning of data will be performed using Hadoop system with an Amazon cluster. The performance and the scalability of the proposed method were theoretically and empirically compared to the existing feature selection methods when handling large-scale datasets and results confirm the effectiveness of our proposed method.
Keywords