IEEE Access (Jan 2022)

Towards Enhancing the Performance of Parallel FP-Growth on Spark

  • Amr Essam,
  • Manal A. Abdel-Fattah,
  • Laila Abdelhamid

DOI
https://doi.org/10.1109/ACCESS.2021.3137789
Journal volume & issue
Vol. 10
pp. 286 – 296

Abstract

Read online

Frequent itemset mining (FIM) is a crucial tool for identifying hidden patterns in information. FP-Growth is an FIM algorithm used to find associations. When the data size increases, the execution of FIM algorithms on a single machine suffers from computational problems, such as memory and time consumption. For these reasons, parallel and distributed processing on platforms such as Spark is essential. The parallel frequent pattern (PFP) is the implementation of FP-Growth in Spark. The main problem with PFP is that it does not consider the load balancing between cluster units. This research proposes an enhanced balanced parallel frequent pattern “EBPFP” algorithm to enhance and balance the PFP. The proposed algorithm (EBPFP) proposes two ideas. First, a strategy for load balancing between groups is proposed to ensure that the items are evenly divided between the nodes, and the cluster resources are used more effectively. Second, the improved conditional pattern base (ICPB) method aims to remove infrequent items from the conditional pattern base before constructing local FP-Trees. The experimental results show that the proposed EBPFP algorithm outperforms PFP, and the difference in running time between EBPFP and PFP was 21.56% and 39.72%, respectively.

Keywords