EPJ Web of Conferences (Jan 2019)
Evaluating Kubernetes as an orchestrator of the Event Filter computing farm of the Trigger and Data Acquisition system of the ATLAS experiment at the Large Hadron Collider
Abstract
The ATLAS experiment at the LHC relies on a complex and distributed Trigger and Data Acquisition (TDAQ) system to gather and select particle collision data. The Event Filter (EF) component of the TDAQ system is responsible for executing advanced selection algorithms, reducing the data rate to a level suitable for recording to permanent storage. The EF functionality is provided by a computing farm made up of thousands of commodity servers, each executing one or more processes. Moving the EF farm management towards a solution based on software containers is one of the main themes of the ATLAS TDAQ Phase-II upgrades in the area of the online software; it would make it possible to open new possibilities for fault tolerance, reliability and scalability. This paper presents the results of an evaluation of Kubernetes as a possible orchestrator of the ATLAS TDAQ EF computing farm. Kubernetes is a system for advanced management of containerized applications in large clusters. This paper will first highlight some of the technical solutions adopted to run the offline version of today’s EF software in a Docker container. Then it will focus on some scaling performance measurements executed with a cluster of 1000 CPU cores. In particular, this paper will report about the way Kubernetes scales in deploying containers as a function of the cluster size and show how a proper tuning of the Query per Second (QPS) Kubernetes parameter set can improve the scaling of applications in terms of running replicas. Finally, an assessment will be given about the possibility to use Kubernetes as an orchestrator of the EF computing farm in LHC’s Run 4.