IEEE Open Journal of the Communications Society (Jan 2024)

A Worst-Case Latency and Age Analysis of Coded Distributed Computing With Unreliable Workers and Periodic Tasks

  • Federico Chiariotti,
  • Beatriz Soret,
  • Petar Popovski

DOI
https://doi.org/10.1109/OJCOMS.2024.3458802
Journal volume & issue
Vol. 5
pp. 5874 – 5889

Abstract

Read online

Over the past decade, the deep learning revolution has led to ever-increasing demands for computing power and working memory to support larger and larger neural networks. As this coincided with the end of Moore’s law, distributed solutions have emerged as a natural answer: in particular, the novel Coded Distributed Computing (CDC) paradigm exploits results from coding theory to divide large tasks into redundant sets of smaller subtasks to be processed across multiple workers, making the computation more robust to stragglers and malicious worker nodes. Optimizing the use of these distributed computing resources is critical, as excessive redundancy might impact on performance and energy consumption. This work considers a CDC system receiving periodic tasks, deriving the full distribution of the latency, reliability, and Peak Age of Information (PAoI) under worker diversity and random failures. The CDC system is modeled as a fork-join $D/M/(K, N)/L$ queue, where only K of the coded N subtasks are necessary to solve the overall task, and workers can hold up to L subtasks in their queues. Our results are useful for resource optimization, showing the relationship between system load, redundancy, and latency, as well as the trade-off between latency, reliability, and age performance.

Keywords