EPJ Web of Conferences (Jan 2020)

Exploiting network restricted compute resources with HTCondor: a CMS experiment experience

  • Acosta-Silva Carles,
  • Delgado Peris Antonio,
  • Flix Molina José,
  • Frey Jaime,
  • Hernández José M.,
  • Livny Miron,
  • Pérez-Calero Yzquierdo Antonio,
  • Tannenbaum Todd

DOI
https://doi.org/10.1051/epjconf/202024509007
Journal volume & issue
Vol. 245
p. 09007

Abstract

Read online

In view of the increasing computing needs for the HL-LHC era, the LHC experiments are exploring new ways to access, integrate and use non-Grid compute resources. Accessing and making efficient use of Cloud and High Performance Computing (HPC) resources present a diversity of challenges for the CMS experiment. In particular, network limitations at the compute nodes in HPC centers prevent CMS pilot jobs to connect to its central HTCondor pool in order to receive payload jobs to be executed. To cope with this limitation, new features have been developed in both HTCondor and the CMS resource acquisition and workload management infrastructure. In this novel approach, a bridge node is set up outside the HPC center and the communications between HTCondor daemons are relayed through a shared file system. This conforms the basis of the CMS strategy to enable the exploitation of the Barcelona Supercomputing Center (BSC) resources, the main Spanish HPC site. CMS payloads are claimed by HTCondor condor_startd daemons running at the nearby PIC Tier-1 center and routed to BSC compute nodes through the bridge. This fully enables the connectivity of CMS HTCondor-based central infrastructure to BSC resources via the PIC HTCondor pool. Other challenges include building custom singularity images with CMS software releases, bringing conditions data to payload jobs, and custom data handling between BSC and PIC. This report describes the initial technical prototype, its deployment and tests, and future steps. A key aspect of the technique described in this contribution is that it could be universally employed in similar network-restrictive HPC environments elsewhere.