A Distributed Approach for High-Dimensionality Heterogeneous Data Reduction

Rania Mkhinini Gahar; Olfa Arfaoui; Minyar Sassi Hidri; Nejib Ben Hadj-Alouane

doi:10.1109/ACCESS.2019.2945889

IEEE Access (Jan 2019)

A Distributed Approach for High-Dimensionality Heterogeneous Data Reduction

Rania Mkhinini Gahar,
Olfa Arfaoui,
Minyar Sassi Hidri,
Nejib Ben Hadj-Alouane

Affiliations

Rania Mkhinini Gahar: ORCiD; OASIS Research Lab, National Engineering School of Tunis, University of Tunis El Manar, Tunis, Tunisia
Olfa Arfaoui: OASIS Research Lab, National Engineering School of Tunis, University of Tunis El Manar, Tunis, Tunisia
Minyar Sassi Hidri: ORCiD; Computer Department, Deanship of Preparatory Year and Supporting Studies, Imam Abdulrahman Bin Faisal University, Dammam, Saudi Arabia
Nejib Ben Hadj-Alouane: OASIS Research Lab, National Engineering School of Tunis, University of Tunis El Manar, Tunis, Tunisia

DOI: https://doi.org/10.1109/ACCESS.2019.2945889
Journal volume & issue: Vol. 7
pp. 151006 – 151022

Abstract

Read online

The recent explosion of data size in number of records and attributes has triggered the development of a number of Big Data analytics as well as parallel data processing methods and algorithms. At the same time though, it has pushed for usage of data Dimensionality Reduction (DR) procedures. Indeed, more is not always better. Large amounts of data might sometimes produce worse performance in data analytics applications, and this may be caused by the presence of missing data. These latter are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. In this work, we propose a new distributed statistical approach for high-dimensionality reduction of heterogeneous data that is based on the MapReduce paradigm, limits the curse of dimensionality and deals with missing values. To handle these latter, we propose to use the Random Forest imputation's method. The main purpose here is to extract useful information and reduce the search space to facilitate the data exploration process. Several illustrative numeric examples using data coming from publicly available machine learning repositories are also included. The experimental component of the study shows the efficiency of the proposed analytical approach.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords