IEEE Access (Jan 2023)
DHSDJArch: An Efficient Design of Distributed Heterogeneous Stream-Disk Join Architecture
Abstract
Heterogeneity is the key aspect of complex networks and smart devices for using it as nature of live streams. The heterogeneous stream-disk join is a significant research topic in real-time processing applications because it can directly affect the data analytics. Multiple issues, including stream loss, scalability, disk access cost, and data accuracy, should be considered during heterogeneous stream-disk join transformation. In this work we overcome these issues by introducing a distributed heterogeneous stream-disk join architecture (DHSDJArch) which can prevent stream data loss as well as maintaining balance between heterogeneous distributed data sources and accuracy of stream-disk join. A four phased distributed architecture is proposed for the multi-objective optimization to transform heterogeneous incomplete stream. To prevent stream loss, configuration of log retention is proposed based on the characteristics of distributed event streaming platform (DESP). Specifically, two transformations are proposed to pre-process heterogeneous streams and to join pre-processed stream with distributed disk data by performing real-time disk access while compensating the differences between data sources and streaming application, respectively. We conduct comprehensive experimental study on real datasets to verify the performance of proposed architecture in terms of accuracy, log retention policy, scaling, stability and cloud data storage.
Keywords