Sustainable data analysis with Snakemake [version 2; peer review: 2 approved]
Felix Mölder,
Kim Philipp Jablonski,
Brice Letcher,
Michael B. Hall,
Christopher H. Tomkins-Tinch,
Vanessa Sochat,
Jan Forster,
Soohyun Lee,
Sven O. Twardziok,
Alexander Kanitz,
Andreas Wilm,
Manuel Holtgrewe,
Sven Rahmann,
Sven Nahnsen,
Johannes Köster
Affiliations
Felix Mölder
Algorithms for reproducible bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
Kim Philipp Jablonski
Swiss Institute of Bioinformatics (SIB), Basel, Switzerland
Brice Letcher
EMBL-EBI, Hinxton, UK
Michael B. Hall
EMBL-EBI, Hinxton, UK
Christopher H. Tomkins-Tinch
Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, USA
Vanessa Sochat
Stanford University Research Computing Center, Stanford University, Stanford, USA
Jan Forster
German Cancer Consortium (DKTK, partner site Essen) and German Cancer Research Center, DKFZ, Heidelberg, Germany
Soohyun Lee
Biomedical Informatics, Harvard Medical School, Harvard University, Boston, USA
Sven O. Twardziok
Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health (BIH), Center for Digital Health, Berlin, Germany
Alexander Kanitz
Biozentrum, University of Basel, Basel, Switzerland
Andreas Wilm
Microsoft Singapore, Singapore, Singapore
Manuel Holtgrewe
CUBI – Core Unit Bioinformatics, Berlin Institute of Health, Berlin, Germany
Sven Rahmann
Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
Sven Nahnsen
Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
Johannes Köster
Algorithms for reproducible bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.