Distributed Data Strategies to Support Large-Scale Data Analysis Across Geo-Distributed Data Centers

Tamer Z. Emara; Joshua Zhexue Huang

doi:10.1109/ACCESS.2020.3027675

IEEE Access (Jan 2020)

Distributed Data Strategies to Support Large-Scale Data Analysis Across Geo-Distributed Data Centers

Tamer Z. Emara,
Joshua Zhexue Huang

Affiliations

Tamer Z. Emara: ORCiD; National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China
Joshua Zhexue Huang: ORCiD; National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China

DOI: https://doi.org/10.1109/ACCESS.2020.3027675
Journal volume & issue: Vol. 8
pp. 178526 – 178538

Abstract

Read online

As the volume of data grows rapidly, storing big data in a single data center is no longer feasible. Hence, companies have developed two scenarios to store their big data in multiple data centers. In the first scenario, the company's big data are distributed in multiple data centers without data replication. In the second scenario, data are also stored in multiple data centers but important data are replicated in these data centers to increase data safety and availability. However, in these scenarios, analyzing big data distributed in multiple data centers becomes a challenging task. In this paper, we propose two data distribution strategies to support big data analysis across geo-distributed data centers. In these strategies, we use the recent Random Sample Partition data model to convert big data into sets of random sample data blocks and distribute these data blocks into multiple data centers either without replication or with replication. In analyzing big data in multiple data centers without replication, we randomly select samples of data blocks from multiple data centers and download the sample data blocks to one data center for analysis. In the second strategy with replication of data blocks, we can analyze big data on any data center by randomly selecting a sample of data blocks replicated from other data centers. This strategy avoids data transformation between data centers. We demonstrate the performance of the two strategies in big data analysis by using simulation results produced on one local data center and four AWS data centers in North America, Asia, and Australia.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords