Comparative Analysis of Skew-Join Strategies for Large-Scale Datasets with MapReduce and Spark

Anh-Cang Phan; Thuong-Cang Phan; Hung-Phi Cao; Thanh-Ngoan Trieu

doi:10.3390/app12136554

Applied Sciences (Jun 2022)

Comparative Analysis of Skew-Join Strategies for Large-Scale Datasets with MapReduce and Spark

Anh-Cang Phan,
Thuong-Cang Phan,
Hung-Phi Cao,
Thanh-Ngoan Trieu

Affiliations

Anh-Cang Phan: Faculty of Information Technology, Vinh Long University of Technology Education, Vinh Long 85110, Vietnam
Thuong-Cang Phan: College of Information and Communication Technology, Can Tho University, Can Tho 94115, Vietnam
Hung-Phi Cao: Faculty of Information Technology, Vinh Long University of Technology Education, Vinh Long 85110, Vietnam
Thanh-Ngoan Trieu: College of Information and Communication Technology, Can Tho University, Can Tho 94115, Vietnam

DOI: https://doi.org/10.3390/app12136554
Journal volume & issue: Vol. 12, no. 13
p. 6554

Abstract

Read online

In the era of data deluge, Big Data gradually offers numerous opportunities, but also poses significant challenges to conventional data processing and analysis methods. MapReduce has become a prominent parallel and distributed programming model for efficiently handling such massive datasets. One of the most elementary and extensive operations in MapReduce is the join operation. These joins have become ever more complex and expensive in the context of skewed data, in which some common join keys appear with a greater frequency than others. Some of the reduction tasks processing these join keys will finish later than others; thus, the benefits of parallel computation become meaningless. Some studies on the problem of skew joins have been conducted, but an adequate and systematic comparison in the Spark environment has not been presented. They have only provided experimental tests, so there is still a shortage of representations of mathematical models on which skew-join algorithms can be compared. This study is, therefore, designed to provide the theoretical and practical basics for evaluating skew-join strategies for large-scale datasets with MapReduce and Spark—both analytically with cost models and practically with experiments. The objectives of the study are, first, to present the implementation of prominent skew-join algorithms in Spark, second, to evaluate the algorithms by using cost models and experiments, and third, to show the advantages and disadvantages of each one and to recommend strategies for the better use of skew joins in Spark.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords