FAIRly big: A framework for computationally reproducible processing of large-scale data

Adina S. Wagner; Laura K. Waite; Małgorzata Wierzba; Felix Hoffstaedter; Alexander Q. Waite; Benjamin Poldrack; Simon B. Eickhoff; Michael Hanke

doi:10.1038/s41597-022-01163-2

Scientific Data (Mar 2022)

FAIRly big: A framework for computationally reproducible processing of large-scale data

Adina S. Wagner,
Laura K. Waite,
Małgorzata Wierzba,
Felix Hoffstaedter,
Alexander Q. Waite,
Benjamin Poldrack,
Simon B. Eickhoff,
Michael Hanke

Affiliations

Adina S. Wagner: Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich
Laura K. Waite: Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich
Małgorzata Wierzba: Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich
Felix Hoffstaedter: Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich
Alexander Q. Waite: Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich
Benjamin Poldrack: Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich
Simon B. Eickhoff: Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich
Michael Hanke: Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich

DOI: https://doi.org/10.1038/s41597-022-01163-2
Journal volume & issue: Vol. 9, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Large-scale datasets present unique opportunities to perform scientific investigations with unprecedented breadth. However, they also pose considerable challenges for the findability, accessibility, interoperability, and reusability (FAIR) of research outcomes due to infrastructure limitations, data usage constraints, or software license restrictions. Here we introduce a DataLad-based, domain-agnostic framework suitable for reproducible data processing in compliance with open science mandates. The framework attempts to minimize platform idiosyncrasies and performance-related complexities. It affords the capture of machine-actionable computational provenance records that can be used to retrace and verify the origins of research outcomes, as well as be re-executed independent of the original computing infrastructure. We demonstrate the framework’s performance using two showcases: one highlighting data sharing and transparency (using the studyforrest.org dataset) and another highlighting scalability (using the largest public brain imaging dataset available: the UK Biobank dataset).

Published in Scientific Data

ISSN: 2052-4463 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/sdata/

About the journal