IEEE Access (Jan 2020)

RCD+: A Partitioning Method for Data Streams Based on Multiple Queries

  • Ruichang Li,
  • Chunkai Wang,
  • Fan Liao,
  • Honglei Zhu

DOI
https://doi.org/10.1109/ACCESS.2020.2980554
Journal volume & issue
Vol. 8
pp. 52517 – 52527

Abstract

Read online

Big data stream management systems often must transform a query application into multiple query tasks, simultaneously and dynamically partitioning data streams based on attribute values or partitioning keys. However, due to different partitioning orders or strategies of partitioning keys, the redundant and repetitive transmission of data streams at different nodes leads to system performance degradation. In addition, with the change of data skewness, the problem of unbalanced data stream partitioning still exists between different processing units within the same node. This paper presents the partitioning framework RCD+ (Runtime Correlation Discovery) according to runtime correlation discovery. RCD+ implements the full granularity partitioning strategy, which includes runtime positive correlation partitioning (RPC-partitioning) and clustering partitioning (Clu-partitioning). First, in the process of RPC-partitioning, we introduce the mini-batch scheme to reduce the number of output stream caches and partition data streams using the relevance of partitioning keys. Furthermore, in the process of Clu-partitioning, we re-partition data streams by clustering of skewed data streams between the inter-node and the intra-node. Then, we construct the routing table to manage partition states in order to ensure the correctness of multiple query tasks. Finally, we have implemented this framework on Apache Storm. Experiments with synthetic data and real data show that our proposed framework exhibits better query performance.

Keywords