Dynamic Deduplication Decision in a Hadoop Distributed File System

Ruay-Shiung Chang; Chih-Shan Liao; Kuo-Zheng Fan; Chia-Ming Wu

doi:10.1155/2014/630380

International Journal of Distributed Sensor Networks (Apr 2014)

Dynamic Deduplication Decision in a Hadoop Distributed File System

Ruay-Shiung Chang,
Chih-Shan Liao,
Kuo-Zheng Fan,
Chia-Ming Wu

Affiliations

Ruay-Shiung Chang
Chih-Shan Liao
Kuo-Zheng Fan
Chia-Ming Wu

DOI: https://doi.org/10.1155/2014/630380
Journal volume & issue: Vol. 10

Abstract

Read online

Data are generated and updated tremendously fast by users through any devices in anytime and anywhere in big data. Coping with these multiform data in real time is a heavy challenge. Hadoop distributed file system (HDFS) is designed to deal with data for building a distributed data center. HDFS uses the data duplicates to increase data reliability. However, data duplicates need a lot of extra storage space and funding in infrastructure. Using the deduplication technique can improve utilization of the storage space effectively. In this paper, we propose a dynamic deduplication decision to improve the storage utilization of a data center which uses HDFS as its file system. Our proposed system can formulate a proper deduplication strategy to sufficiently utilize the storage space under the limited storage devices. Our deduplication strategy deletes useless duplicates to increase the storage space. The experimental results show that our method can efficiently improve the storage utilization of a data center using the HDFS system.

Published in International Journal of Distributed Sensor Networks

ISSN: 1550-1329 (Print); 1550-1477 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://onlinelibrary.wiley.com/journal/dsn

About the journal