IEEE Access (Jan 2021)
An Efficient Data Access Approach With Queue and Stack in Optimized Hybrid Join
Abstract
As rapid decision making in business organizations gain in popularity, the complexity and adaptability of extract, transform, and load (ETL) process of near real-time data warehousing has dramatically increased. The most important part of near real-time data warehouse is to feed new data from different data sources on near-real-time basis. However, this new data is not in the format of the data warehouse therefore, it needs to be transformed into the required format by using transformation algorithms which is essential part of ETL process. A semi-stream join algorithm is required to implement this transformation, for this purpose a HYBRIDJOIN (hybrid join) algorithm has been presented in the literature. However, major design issue with this algorithm is that it uses a single buffer to load the disk partitions and therefore, the algorithm has to wait until the next disk partition overwrites the exiting partition in the disk buffer. As the cost of loading disk partition into disk buffer is the major cost of overall algorithm processing cost, this leaves the performance of algorithm sub-optimal. Moreover, existing approaches only considering the oldest key join attributes for finding the matches with master data and maintaining the Queue of key join attribute. However, performance can be improved if recent and oldest attributes process in parallel. This article addresses the limitation of HYBRIDJOIN by presenting two optimized new algorithms named: Parallel-Hybrid Join (P-HYBRIDJOIN) and Hybrid Join with Queue and Stack (QaS-HYBRIDJOIN). Proposed algorithms aim to reduce major processing cost that is disk I/O as well as to increase number of matching stream tuples. Both of these algorithms perform significantly better in terms of throughput and number of matching tuples as compared to existing approaches. Performance analysis and cost model for proposed algorithms show the best performance using intermittent stream data under limited resources.
Keywords