A Distributed Method for Fast Mining Frequent Patterns From Big Data

Peng-Yu Huang; Wan-Shu Cheng; Ju-Chin Chen; Wen-Yu Chung; Young-Lin Chen; Kawuu W. Lin

doi:10.1109/ACCESS.2021.3115514

IEEE Access (Jan 2021)

A Distributed Method for Fast Mining Frequent Patterns From Big Data

Peng-Yu Huang,
Wan-Shu Cheng,
Ju-Chin Chen,
Wen-Yu Chung,
Young-Lin Chen,
Kawuu W. Lin

Affiliations

Peng-Yu Huang: ORCiD; Department of Computer Science and Information Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
Wan-Shu Cheng: Department of Electrical Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
Ju-Chin Chen: Department of Computer Science and Information Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
Wen-Yu Chung: Department of Computer Science and Information Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
Young-Lin Chen: Foxconn Technology Group, Taipei, Taiwan
Kawuu W. Lin: ORCiD; Department of Computer Science and Information Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan

DOI: https://doi.org/10.1109/ACCESS.2021.3115514
Journal volume & issue: Vol. 9
pp. 135144 – 135159

Abstract

Read online

In recent years, knowledge discovery in databases provides a powerful capability to discover meaningful and useful information. For numerous real-life applications, frequent pattern mining and association rule mining have been extensively studied. In traditional mining algorithms, data are centralized and memory-resident. As a result of the large amount of data, bandwidth limitation, and energy limitations when applying these methods to distributed databases, especially in this era of big data, the performance is not effective enough. Hence, data mining on distributed environments has emerged as an important research area. To improve the performance, we propose a set of algorithms based on FP growth that discover FPs that are capable of providing fast and scalable service in distributed computing environments and a brief data structure to store items and counts to minimize the data for transmission on the network. To ensure completeness and execution capability, DistEclat and BigFIM were considered for the experiment comparison. Experiments show that the proposed method has superior cost-effectiveness for processing massive datasets and good capabilities under various experiment conditions. The proposed method on average required only 33% of the execution time and 45% of the transmission cost of DistEclat. Compared to BigFIM, The proposed method on average required 23.3% of the execution time and 14.2% of the transmission cost of BigFIM.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords