Workflow Systems for Large-Scale Scientific Data Analysis

Editors: Ulf Leser, Marcus Hilbrich, Sean R. Wilkinson, Rafael Ferreira da Silva

Size: 676 pages
Format: 17,0 x 24,0 cm
Publishing year: 2026

ISBN 978-3-98781-067-1

42,00 €

The past two decades have seen a steep increase in computational requirements for analyzing scientific data sets. The reasons are manifold: Typical data sets increased enormously in size, the growing complexity of scientific questions required more complex analysis methods, and the growth of methods based on machine learning and artificial intelligence called for additional measures to handle model training and quality control. Furthermore, expectations in terms of reproducibility and reusability are much higher today than in the past, and issues like energy consumption and trustworthiness of results require additional attention. These trends brought along the need to run analysis on large compute clusters and to apply advanced software infrastructures to support such diverse needs as much as possible. Scientific Workflow Management Systems (SWMS) are a class of systems created to cope with these requirements. An SWMS typically consist of multiple components, such as a workflow language and user interface to express complex analysis procedures as multi step pipelines, virtualization and container technologies for task binaries to facilitate portability, and a workflow engine to execute analysis pipelines on distributed infrastructures in a robust and reproducible manner. They rely on further components of cluster infrastructures, such as a distributed file systems for robust data exchange and a resource manager for administering compute cores, memory, GPUs, and storage. When orchestrated in a proper manner, the interplay of these components leads to a reproducible, portable, and easily adaptable data analysis process.
SWMSs emerged at the end of the last century when scientists started to require scalability beyond single workstations. With the steep increase in data sets sizes, the growing complexity of the research questions being studied, and the democratization of data science in general, their popularity increased continuously since then. However, SWMS today work in a different environment than in the past. While first generation SMMS often were designed as stand-alone applications, they today must interact with other infrastructures applied in data centers to manage resources effectively and securely. Thus, systems architectures have grown considerably in complexity, and requirements to SWMS components changed. However, a comprehensive and up-to-date description of these consequences of these developments, i.e., of the inner working of current SWMS, still is lacking.
This book sets out to fill this gap. It is structured in four areas, devoted to introductory texts, concrete SWMS systems, important application areas of SWMS, and descriptions of advanced technological aspects, respectively. It features 25 chapters authored by 127 experts from 17 different countries. The book is intended to address both users of SWMS that want to get insights into the functionality and premises of these systems – as well as their limitations – and developers of SWMS that want to learn about recent technological advancements.

—
Contents

1 – The anatomy of scientific workflow management systems

Ulf Leser

2 – An Extended, Consolidated View on Specification Languages for Data Analysis Workflows

Sebastian Müller, Ninon De Mecquenem, Christopher Lazik, Svetlana Kulagina, Jan Arne Sparka, Fabian Lehmann, Ben Sherman, Marcus Hilbrich, Lars Grunske

3 – Towards Next Generation Data Engineering Pipelines

Kevin M. Kramer, Valerie Restat, Sebastian Strasser, Uta Störl, Meike Klettke

4 – An Ecosystem of Services for FAIR Computational Workflows

Sean R. Wilkinson, Johan Gustafsson, Finn Bacall, Khalid Belhajjame, Salvador Capella, Jose Maria Fernandez Gonzalez, Jacob Fosso Tande, Luiz Gadelha, Daniel Garijo, Patricia Grubel, Björn Grüning, Farah Zaib Khan, Sehrish Kanwal, Simone Leo, Stuart Owen, Luca Pireddu, Line Pouchard, Laura Rodríguez-Navas, Beatriz Serrano-Solano, Stian Soiland-Reyes, Baiba Vilne, Alan Williams, Merridee Ann Wouters, Frederik Coppens, Carole Goble

5 – Tackling Analytical Variability with Workflomics

Vedran Kasalica, Peter Kok, Rob Marissen, Mario Frank, Magnus Palmblad, Anna-Lena Lamprecht

6 – Designing Benchmarks for Data AnalysisWorkflow Systems

Rafael Moczalla, Ilin Tolovski, Tilmann Rabl

7 – Reproducible Multi-Cloud Data Analysis with Nextflow

Paolo Di Tommaso, Ben Sherman

8 – Managing Distributed Scientific Workflows with Globus

Kyle Chard, J. Gregory Pauloski, Ryan Chard, Ian Foster

9 – Programming Task-Based Workflows with COMPSs

Rosa M. Badia, Javier Conejero, Jorge Ejarque, Daniele Lezzi, Francesc Lordan, Raül Sirvent

10 – Serverless Workflow Execution Models and Engines

Maciej Malawski, Bartosz Balis, Tomasz Szydło, Aleksander Slominski

11 – Benchmarking and Simulating Scientific Workflow Systems: A Review

Tainã Coleman, Henri Casanova, Frédéric Suter, Sean R. Wilkinson, Ketan Maheshwari, Rafael Ferreira da Silva

12 – Differences in Workflow Systems: A Use-Case Driven Comparison

Vasilis Bountris, Fabian Lehmann, Felix Kummer, Luis Neuhaus, Ulf Leser

13 – Portable and Scalable Workflows for Earth Observation Data Analysis with Nextflow

Fabian Lehmann, Katarzyna Ewa Lewińska, David Frantz, Dirk Pflugmacher, Florian Katerndahl, Felix Kummer, Patrick Hostert, Ulf Leser

14 – Reuse and Reproduce Bioinformatic Pipelines Using Scientific Workflow Systems

Sarah Cohen-Boulakia, Frédéric Lemoine, George Marchment, Marine Djaffardjy, Alban Gaignard, Clémence Sebe, Khalid Belhajjame

15 – Workflows in Materials Science

Daniel T. Speckhard, Martin Kuban, Christoph T. Koch, Joseph F. Rudzinski, Claudia Draxl

16 – pyiron – Developing and Managing Materials Science Workflows

Tilmann Hickel, Jan Janssen, Sarath Menon, Osamu Waseda, Liam Huber, Jörg Neugebauer

17 – Predicting the Performance of Scientific Workflow Tasks for Cluster Resource Management: An Overview of the State of the Art

Jonathan Bader, Kathleen West, Soeren Becker, Svetlana Kulagina, Fabian Lehmann, Lauritz Thamsen, Henning Meyerhenke, Odej Kao

18 – Optimizing Workflow Execution by Cost-effective I/O Monitoring, Bottleneck Analysis, and Proactive Resource Assignment

Joel Witzke, Ansgar Lößer, Jonathan Bader, Fabian Lehmann, Björn Scheuermann, Florian Schintke

19 – From Suspicious Results to Insights: A Study on Debugging Practices in Scientific Data Analysis Workflows

Anh Duc Vu, Christos Tsigkanos, Caroline Jay, Timo Kehrer

20 – Resource Allocation of DAWs using Mathematical Programming

Somayeh Mohammadi, Latif Pourkarimi, Somayeh Abdi, Ninon De Mecquenem, Ulf Leser, Knut Reinert

21 – Reprohackathons: Training Efforts to Increase Bioinformatics Reproducibility Using Scientific Workflow Systems

Sarah Cohen-Boulakia, George Marchment, Thomas Cokelaer, Frédéric Lemoine

22 – Interactivity in Scientific Workflows: A Survey

Nourhan Elfaramawy, Kedi Cao, Matthias Weidlich

23 – Provenance in Support of Workflows for Science

Paolo Missier, Débora Pina, Adriane Chapman, Bertram Ludäscher

24 – Energy-Aware Workflow Execution: An Overview of Techniques for Saving Energy and Emissions in Scientific Compute Clusters

Lauritz Thamsen, Yehia Elkhatib, Paul Harvey, Syed Waqar Nabi, Jeremy Singer, Wim Vanderbauwhede

25 – Privacy Concerns in Workflows and their Provenance: Where are We?

Ahmad Qadeib Alban, Khalid Belhajjame, Daniela Grigori

Workflow Systems for Large-Scale Scientific Data Analysis

Related products

The Role of Theory

Messunsicherheiten im Physikunterricht