EPJ Web of Conferences (Jan 2020)
Running HTC and HPC applications opportunistically across private, academic and public clouds
Abstract
The Fusion Science Demonstrator in the European Open Science Cloud for Research Pilot Project aimed to demonstrate that the fusion community can make use of distributed cloud resources. We developed a platform, Prominence, which enables users to transparently exploit idle cloud resources for running scientific workloads. In addition to standard HTC jobs, HPC jobs such as multi-node MPI are supported. All jobs are run in containers to ensure they will reliably run anywhere and are reproduceable. Cloud infrastructure is invisible to users, as all provisioning, including extensive failure handling, is completely automated. On-premises cloud resources can be utilised and at times of peak demand burst onto external clouds. In addition to the traditional “cloud-bursting” onto a single cloud, Prominence allows for bursting across many clouds in a hierarchical manner. Job requirements are taken into account, so jobs with special requirements, e.g. high memory or access to GPUs, are sent only to appropriate clouds. Here we describe Prominence, its architecture, the challenges of using many clouds opportunistically and report on our experiences with several fusion use cases.