Leveraging resource management for efficient performance of Apache Spark

Khadija Aziz; Dounia Zaidouni; Mostafa Bellafkih

doi:10.1186/s40537-019-0240-1

Journal of Big Data (Aug 2019)

Leveraging resource management for efficient performance of Apache Spark

Khadija Aziz,
Dounia Zaidouni,
Mostafa Bellafkih

Affiliations

Khadija Aziz: STRS Laboratory, National Institute of Posts and Telecommunications
Dounia Zaidouni: STRS Laboratory, National Institute of Posts and Telecommunications
Mostafa Bellafkih: STRS Laboratory, National Institute of Posts and Telecommunications

DOI: https://doi.org/10.1186/s40537-019-0240-1
Journal volume & issue: Vol. 6, no. 1
pp. 1 – 23

Abstract

Read online

Abstract Apache Spark is one of the most widely used open source processing framework for big data, it allows to process large datasets in parallel using a large number of nodes. Often, applications of this framework use resource management systems like YARN, which provide jobs a specific amount of resources for their execution. In addition, a distributed file system such as HDFS stores the data that is to be analyzed by the framework. This design allows sharing cluster resources effectively by running jobs on a single-node cluster or multi-nodes cluster infrastructure. Thus, one challenging issue is to realize effective resource management of these large cluster infrastructures in order to run distributed data analytics in an economically viable way. In this study, we use the Machine Learning library (MLlib) of Spark to implement different machine learning algorithms, then we manage the resources (CPU, memory, and Disk) in order to assess the performance of Apache Spark. In this paper, we present a review of various works that focus on resource management and data processing in Big Data platforms. Furthermore, we perform a scalability analysis using Spark. We analyze the speedup and processing time. We deduce that from a certain number of nodes in the cluster, it is no longer necessary to add additional nodes to improve the speedup and the processing Time. Then, we investigate the tuning of the resource allocation in Spark. We showed that it is not only by allocating all the available resources we get better performance but it depends on how to tune the resource allocation. We propose new managed parameters and we show that they give better total processing time than the default parameters used by Spark. Finally, we study the Persistence of Resilient Distributed Datasets (RDDs) in Spark using machine learning algorithms. We show that one storage level gives the best execution time among all tested storage levels.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords