EPJ Web of Conferences (Jan 2024)

Lifecycle Management, Business Continuity and Disaster Recovery Planning for the LHCb Experiment Control System Infrastructure

  • Cifra Pierfrancesco,
  • Sborzacchi Francesco,
  • Neufeld Niko,
  • Cardoso Luis Granado

DOI
https://doi.org/10.1051/epjconf/202429507028
Journal volume & issue
Vol. 295
p. 07028

Abstract

Read online

LHCb (Large Hadron Collider beauty) is one of the four large particle physics experiments aimed at studying differences between particles and anti-particles and very rare decays in the charm and beauty sector of the standard model at the LHC. The Experiment Control System (ECS) is in charge of the configuration, control, and monitoring of the various subdetectors as well as all areas of the online system, and it is built on top of hundreds of Linux virtual machines (VM) running on a Red Hat Enterprise Virtualisation cluster. For such a mission-critical project, it is essential to keep the system operational; it is not possible to run the LHCb’s Data Acquisition without the ECS, and a failure would likely mean the loss of valuable data. In the event of a disruptive fault, it is important to recover as quickly as possible in order to restore normal operations. In addition, the VM’s lifecycle management is a complex task that needs to be simplified, automated, and validated in all of its aspects, with a particular focus on deployment, provisioning, and monitoring. The paper describes the LHCb’s approach to this challenge, including the methods, solutions, technology, and architecture adopted. We also show limitations and problems encountered, and we present the results of tests performed.