Jisuanji kexue (Oct 2022)

Error Log Analysis and System Optimization for Lustre Cluster Storage

  • CHENG Wen, LI Yan, ZENG Ling-fang, WANG Fang, TANG Shi-cheng, YANG Li-ping, FENG Dan, ZENG Wen-jun

DOI
https://doi.org/10.11896/jsjkx.220100134
Journal volume & issue
Vol. 49, no. 10
pp. 1 – 9

Abstract

Read online

Cluster storage system error messages can help to optimize the availability and reliability of storage system.Previous research of storage system error analysis focuses on the local file system or a part of the cluster storage system.There is a lack of research on storage system error messages for a long-time and multi-dimension in practical applications.With the continuous integration of new functional modules,the cluster storage system is becoming more and more complex,and the errors caused by cluster storage system emerge endlessly,which brings troubles and challenges to the researcher and developer.To address the pro-blems,we conduct a comprehensive study of the Lustre system error log.By collecting the error log in 1 673 consecutive days,we study nearly 2.26 GB of Lustre error logs,analyze the characteristics and problems of the Lustre system errors in multiple Lustre versions.We show that correlated errors between different subsystems and study the possible impacting factors on different Lustre versions.We also summarize the common errors in the Lustre system and show the corresponding solutions.We derive nume-rous new insights into the Lustre system development process and report 14 findings.Finally,we collect new error logs for 333 consecutive days to verify the 14 findings and give some cases about error optimization.Experimental results show that the error optimization cases can significantly reduce the number of errors and improve the availability and stability of the system.Our results and suggestions should be useful for both the development of the cluster storage system themselves as well as the Lustre operation and maintenance.

Keywords