IEEE Access (Jan 2022)
FASR: An Efficient Feature-Aware Deduplication Method in Distributed Storage Systems
Abstract
Deduplication technology can obtain higher space utilization by keeping only one duplicate. But in a distributed storage system, the overall deduplication ratio will be limited due to redundancy elimination across nodes. The traditional deduplication methods usually utilize data similarity and data locality to improve the deduplication ratio. However, higher system overhead is caused by frequent similarity calculations. To deal with this problem, this paper proposes a new Feature-Aware Stateful Routing method (FASR), aiming to reduce the system overhead and keep a high deduplication ratio in the distributed environment. Firstly, we design a feature-aware nodes selection strategy to choose similar nodes by extracting data feature and data distribution characteristics. This strategy will save the similarity calculation with the nodes that are not similar to the data. Then, we present a stateful routing algorithm to determine the target node using super-chunk and handprint technology. Meanwhile, the algorithm maintain load balance of the entire distributed system. Finally, the data is deduplicated locally based on similarity index and fingerprint cache. Extensive experiments demonstrate that FASR can reduce system overhead by around 30% at most and also effectively obtain a higher deduplication ratio.
Keywords