Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Athanasios Alexopoulos; Georgios Drakopoulos; Andreas Kanavos; Phivos Mylonas; Gerasimos Vonitsanos

doi:10.3390/a13030071

Algorithms (Mar 2020)

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

Athanasios Alexopoulos,
Georgios Drakopoulos,
Andreas Kanavos,
Phivos Mylonas,
Gerasimos Vonitsanos

Affiliations

Athanasios Alexopoulos: Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece
Georgios Drakopoulos: Department of Informatics, Ionian University, 49100 Corfu, Greece
Andreas Kanavos: Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece
Phivos Mylonas: Department of Informatics, Ionian University, 49100 Corfu, Greece
Gerasimos Vonitsanos: Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece

DOI: https://doi.org/10.3390/a13030071
Journal volume & issue: Vol. 13, no. 3
p. 71

Abstract

Read online

At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F 1 . The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics.

Published in Algorithms

ISSN: 1999-4893 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/algorithms

About the journal

Abstract

Keywords