Machine Learning with Applications (Sep 2022)

Predicting severely imbalanced data disk drive failures with machine learning models

  • Jishan Ahmed,
  • Robert C. Green II

Journal volume & issue
Vol. 9
p. 100361

Abstract

Read online

Datasets related to hard drive failure, particularly BackBlaze Hard Drive Data, have been widely studied in the literature using many statistical, machine learning, and deep learning techniques. These datasets are severely imbalanced due to the presence of a small number of failed drives compared to huge amounts of healthy drives in the operational data centers. It is challenging to mitigate the adverse consequence of the class imbalance due to the presence of bias towards the majority class during learning. SMART (self monitoring analysis and reporting technology) attributes of the disk drives were utilized in the past to design standard classification or regression algorithms. Although few machine learning (ML) models, for instance, tree based methods and ensemble learning algorithms, addressed the failure prediction, the effects of class imbalance were rarely properly considered under the ML framework. This study, based on a review of the state-of-the-art in the area, evaluates current methodologies to identify areas that were either overlooked or lacking, proposes methods for remediating these issues, and performs some baseline experiments to demonstrate the proposed methodologies including data sampling techniques and cost-sensitive learning.

Keywords