Distributed Machine Learning Workflow with PanDA and iDDS in LHC ATLAS

Guan Wen; Maeno Tadashi; Zhang Rui; Weber Christian; Wenaus Torre; Alekseev Aleksandr; Barreiro Megino Fernando Harald; De Kaushik; Karavakis Edward; Klimentov Alexei; Korchuganova Tatiana; Lin FaHui; Nilsson Paul; Yang Zhaoyu; Zhao Xin

doi:10.1051/epjconf/202429504019

EPJ Web of Conferences (Jan 2024)

Distributed Machine Learning Workflow with PanDA and iDDS in LHC ATLAS

Guan Wen,
Maeno Tadashi,
Zhang Rui,
Weber Christian,
Wenaus Torre,
Alekseev Aleksandr,
Barreiro Megino Fernando Harald,
De Kaushik,
Karavakis Edward,
Klimentov Alexei,
Korchuganova Tatiana,
Lin FaHui,
Nilsson Paul,
Yang Zhaoyu,
Zhao Xin

Affiliations

Guan Wen: Brookhaven National Laboratory
Maeno Tadashi: Brookhaven National Laboratory
Zhang Rui: University of Wisconsin-Madison
Weber Christian: Brookhaven National Laboratory
Wenaus Torre: Brookhaven National Laboratory
Alekseev Aleksandr: University of Texas at Arlington
Barreiro Megino Fernando Harald: University of Texas at Arlington
De Kaushik: University of Texas at Arlington
Karavakis Edward: Brookhaven National Laboratory
Klimentov Alexei: Brookhaven National Laboratory
Korchuganova Tatiana: University of Pittsburgh
Lin FaHui: University of Texas at Arlington
Nilsson Paul: Brookhaven National Laboratory
Yang Zhaoyu: Brookhaven National Laboratory
Zhao Xin: Brookhaven National Laboratory

DOI: https://doi.org/10.1051/epjconf/202429504019
Journal volume & issue: Vol. 295
p. 04019

Abstract

Read online

Machine Learning (ML) has become one of the important tools for High Energy Physics analysis. As the size of the dataset increases at the Large Hadron Collider (LHC), and at the same time the search spaces become bigger and bigger in order to exploit the physics potentials, more and more computing resources are required for processing these ML tasks. In addition, complex advanced ML workflows are developed in which one task may depend on the results of previous tasks. How to make use of vast distributed CPUs/GPUs in WLCG for these big complex ML tasks has become a popular research area. In this paper, we present our efforts enabling the execution of distributed ML workflows on the Production and Distributed Analysis (PanDA) system and intelligent Data Delivery Service (iDDS). First, we describe how PanDA and iDDS deal with large-scale ML workflows, including the implementation to process workloads on diverse and geographically distributed computing resources. Next, we report real-world use cases, such as HyperParameter Optimization, Monte Carlo Toy confidence limits calculation, and Active Learning. Finally, we conclude with future plans.

Published in EPJ Web of Conferences

ISSN: 2100-014X (Online)
Publisher: EDP Sciences
Country of publisher: France
LCC subjects: Science: Physics
Website: http://www.epj-conferences.org/

About the journal