IEEE Access (Jan 2018)

A Large-Scale Study of I/O Workload’s Impact on Disk Failure

  • Song Wu,
  • Yusheng Yi,
  • Jiang Xiao,
  • Hai Jin,
  • Mao Ye

DOI
https://doi.org/10.1109/ACCESS.2018.2866522
Journal volume & issue
Vol. 6
pp. 47385 – 47396

Abstract

Read online

In large-scale data centers, disk failure is the norm rather than an exception. Frequent disk failure noticeably hurts user experience and results in unavailability of data in the worst case. Previous researches from both industry and academia have studied the reasons of disk failure; however, there is a lack of knowledge of the intrinsic relation between failed disks and their I/O workload. In this paper, we collect and investigate about four billion drive hours I/O traces over 500 000 disks in Tencent's data centers. Our focus is to first exploit the key characteristics of I/O workload that influences disk reliability. We further present the impact of these I/O workload features on lifespan of disks and uncover the root causes. Finally, we introduce a new metric to accurately identify the "dangerous" I/O workload which is extremely harmful to disk health. To the best of our knowledge, this research is by far the first in-depth analysis of the I/O workload's impact on disk reliability and opens up a new dimension for I/O schedule policy in data centers.

Keywords