IEEE Access (Jan 2024)

Optimizing Multi-Level Checkpointing for Distributed Deep Learning Workloads on Cloud Spot VM Clusters

  • Yonghyeon Cho,
  • Yoochan Kim,
  • Kihyun Kim,
  • Jinwoo Kim,
  • Hong-Yeon Kim,
  • Youngjae Kim

DOI
https://doi.org/10.1109/ACCESS.2024.3446770
Journal volume & issue
Vol. 12
pp. 116891 – 116904

Abstract

Read online

Spot Virtual Machines (Spot VMs) offer access to underutilized computing resources at significant discounts, sometimes up to 90% off regular on-demand pricing. For budget-conscious organizations, using clusters of Spot VMs is an effective strategy for training large-scale distributed deep learning (DDL) models. However, the risk of preemption by cloud providers poses a challenge, as it can result in the loss of unsaved data in memory and local storage. To mitigate this risk, one solution involves using networked storage systems for checkpoints, though their low write throughput can slow down training. An alternative approach is to use the memory of a remote, on-demand computing node for temporary checkpoint storage, balancing data protection with training efficiency. In this paper, we propose a novel approach, ACUTE, to optimize temporary checkpointing in the memory of on-demand nodes during DDL training. ACUTE includes three key optimizations: 1) Check-Mem, which reduces memory copying overhead on the training node; 2) Check-Trans, which accelerates checkpoint data transfer through parallel processing; and 3) Check-Pack, which eliminates unnecessary data unpacking and repacking. Implemented using PyTorch’s distributed data-parallel library, ACUTE was evaluated against two other checkpointing schemes on AWS VM instances. Results show that ACUTE reduces makespan delay to nearly zero and achieves, on average, 43.30% faster checkpointing compared to a baseline multi-level checkpointing scheme, without compromising the precision of Deep Neural Network (DNN) models.

Keywords