SoftwareX (Dec 2024)

Version [1.0]- [SAMbA-RaP is music to scientists’ ears: Adding provenance support to spark-based scientific workflows]

  • Thaylon Guedes,
  • Marta Mattoso,
  • Marcos Bedo,
  • Daniel de Oliveira

Journal volume & issue
Vol. 28
p. 101927

Abstract

Read online

While researchers benefit from Apache Spark for executing scientific workflows at scale, they often lack provenance support due to the framework’s design limitations. This paper presents SAMbA-RaP, a provenance extension for Apache Spark. It focuses on: (i) Executing external, black-box applications with intensive I/O operations within the workflow while leveraging Spark’s in-memory data structures, (ii) Extracting domain-specific data from in-memory data structures and (iii) Implementing data versioning and capturing the provenance graph in a workflow execution. SAMbA-RaP also provides real-time reports via a web interface, enabling scientists to explore dataflow transformations and content evolution as they run workflows.

Keywords