Automatic Monitoring of Large-Scale Computing Infrastructure

Kim Bockjoo; Bourilkov Dimitri

doi:10.1051/epjconf/202429507007

EPJ Web of Conferences (Jan 2024)

Automatic Monitoring of Large-Scale Computing Infrastructure

Kim Bockjoo,
Bourilkov Dimitri

Affiliations

Kim Bockjoo: Department of Physics, University of Florida
Bourilkov Dimitri: Department of Physics, University of Florida

DOI: https://doi.org/10.1051/epjconf/202429507007
Journal volume & issue: Vol. 295
p. 07007

Abstract

Read online

Modern distributed computing systems produce large amounts of monitoring data. For these systems to operate smoothly, underperforming or failing components must be identified quickly, and preferably automatically, enabling the system managers to react accordingly. In this contribution, we analyze jobs and transfer data collected in the running of the LHC computing infrastructure. The monitoring data is harvested from the Elasticsearch database and converted to formats suitable for further processing. Based on various machine and deep learning techniques, we develop automatic tools for continuous monitoring of the health of the underlying systems. Our initial implementation is based on publicly available deep learning tools, PyTorch or TensorFlow packages, running on state-of-the-art GPU systems.

Published in EPJ Web of Conferences

ISSN: 2100-014X (Online)
Publisher: EDP Sciences
Country of publisher: France
LCC subjects: Science: Physics
Website: http://www.epj-conferences.org/

About the journal