IEEE Access (Jan 2024)

ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios

  • Ying Song,
  • Peisen Zheng,
  • Yingai Tian,
  • Bo Wang

DOI
https://doi.org/10.1109/ACCESS.2023.3346881
Journal volume & issue
Vol. 12
pp. 4631 – 4641

Abstract

Read online

Erasure codes are widely used in large-scale distributed storage systems due to their high efficiency and reliability, but they also face extremely high repair penalties when data corruption occurs. At present, machine learning methods can accurately predict the next failure time and type of machine nodes. Based on this, in order to solve the problem of unnecessary repair traffic caused by temporary failures, as well as the more degraded reads of high-frequency accessed data due to longer failure time of such data in existing repair methods, we propose an Adaptive Classification Predictive Repair method (ACPR) for different fault scenarios. By categorizing the failed blocks into high-risk and low-risk based on the failure type of the soon-to-fail (STF) node and the access heat of STF blocks, ACPR can perform adaptive predictive repair. By quickly repair high-risk blocks to ensure data availability while delaying the repair of low-risk blocks, a large amount of unnecessary repair traffic caused by temporary node failures in the cluster is avoided. Alibaba Cloud Elastic Compute Service (ECS) experiments results show that compared with FastPR and ECPipe, ACPR can shorten the repair time per data block by up to 15.2% and 33.5%, respectively. Moreover, ACPR can reduce repair traffic by up to 74.1% and 84.4%, respectively.

Keywords