RCD&#x002B;: A Partitioning Method for Data Streams Based on Multiple Queries

Ruichang Li; Chunkai Wang; Fan Liao; Honglei Zhu

doi:10.1109/ACCESS.2020.2980554

IEEE Access (Jan 2020)

RCD+: A Partitioning Method for Data Streams Based on Multiple Queries

Ruichang Li,
Chunkai Wang,
Fan Liao,
Honglei Zhu

Affiliations

Ruichang Li: ORCiD; School of Information and Technology, Henan University of Chinese Medicine, Zhengzhou, China
Chunkai Wang: ORCiD; Post-Doctoral Research Center, China Reinsurance (Group) Corporation, Beijing, China
Fan Liao: ORCiD; School of Information and Technology, Henan University of Chinese Medicine, Zhengzhou, China
Honglei Zhu: ORCiD; School of Information and Technology, Henan University of Chinese Medicine, Zhengzhou, China

DOI: https://doi.org/10.1109/ACCESS.2020.2980554
Journal volume & issue: Vol. 8
pp. 52517 – 52527

Abstract

Read online

Big data stream management systems often must transform a query application into multiple query tasks, simultaneously and dynamically partitioning data streams based on attribute values or partitioning keys. However, due to different partitioning orders or strategies of partitioning keys, the redundant and repetitive transmission of data streams at different nodes leads to system performance degradation. In addition, with the change of data skewness, the problem of unbalanced data stream partitioning still exists between different processing units within the same node. This paper presents the partitioning framework RCD+ (Runtime Correlation Discovery) according to runtime correlation discovery. RCD+ implements the full granularity partitioning strategy, which includes runtime positive correlation partitioning (RPC-partitioning) and clustering partitioning (Clu-partitioning). First, in the process of RPC-partitioning, we introduce the mini-batch scheme to reduce the number of output stream caches and partition data streams using the relevance of partitioning keys. Furthermore, in the process of Clu-partitioning, we re-partition data streams by clustering of skewed data streams between the inter-node and the intra-node. Then, we construct the routing table to manage partition states in order to ensure the correctness of multiple query tasks. Finally, we have implemented this framework on Apache Storm. Experiments with synthetic data and real data show that our proposed framework exhibits better query performance.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords