GMS Medizinische Informatik, Biometrie und Epidemiologie (Jul 2023)
Challenges for the development of automated RNA-seq analyses pipelines
Abstract
Background: Transcriptional changes are hallmarks of development and disease. RNA sequencing (RNA-seq) allows qualitative and quantitative RNA expression analysis. Raw RNA-seq data passes through a multi-step computational pipeline to derive meaning from such measurements. Often scripts are used for such analyses. However, the use of workflow management systems (WFMS) should be encouraged in order to enhance result reproducibility, to establish best data analysis practices, and to share such data analysis workflows. In this work, we created RNA-seq data analysis workflows in three WFMS, namely Galaxy (free, open-source), KNIME (free, commercial, and partially open source), and CLC (commercial, closed source). Methods: These tools were compared using a variety of criteria ranging from installation to workflow execution and sharing. Four different workflows (WFs) performing RNA-seq data analysis were successfully constructed in all three WFMS. In summary, Galaxy currently provides the most significant number of analysis tools for RNA-seq, while CLC offers the most intuitive visualization. KNIME lags behind in these two aspects but excels at other levels, such as machine learning. Results: Since we already decided on the three WMFS, many of the criteria we suggest for WFMS evaluation do not apply to our situation and we focus on the WF creation here. While it was possible to construct RNA-seq analysis WFs with all three WFMS tools, the constructed WFs are different. These differences entailed disparate results, which were further sensitive to processing settings leading to different biological interpretations in the worst case. We further performed an in-depth analysis of challenges using the three WFMS and provide decision support for which WFMS to use in RNA-seq analysis. In short, RNA-seq is currently best performed using Galaxy, followed by CLC, and KNIME. The level of expertise with these WFMS should be taken into account during the WFMS selection. Finally, we share the WFs in the hope of reducing the use of scripts and that sharing them will lead to the development of best practices for RNA-seq data analysis.
Keywords