Workflow Systems for Large-Scale Scientific Data Analysis

Editor: Ulf Leser, Marcus Hilbrich, Sean R. Wilkinson, Rafael Ferreira da Silva

Size: 676 pages
Format: 17,0 x 24,0 cm
Publishing year: 2026
ISBN 978-3-98781-067-1
42,00 

The past two decades have seen a steep increase in computational requirements for analyzing scientific data sets. The reasons are manifold: Typical data sets increased enormously in size, the growing complexity of scientific questions required more complex analysis methods, and the growth of methods based on machine learning and artificial intelligence called for additional measures to handle model training and quality control. Furthermore, expectations in terms of reproducibility and reusability are much higher today than in the past, and issues like energy consumption and trustworthiness of results require additional attention. These trends brought along the need to run analysis on large compute clusters and to apply advanced software infrastructures to support such diverse needs as much as possible. Scientific Workflow Management Systems (SWMS) are a class of systems created to cope with these requirements. An SWMS typically consist of multiple components, such as a workflow language and user interface to express complex analysis procedures as multi step pipelines, virtualization and container technologies for task binaries to facilitate portability, and a workflow engine to execute analysis pipelines on distributed infrastructures in a robust and reproducible manner. They rely on further components of cluster infrastructures, such as a distributed file systems for robust data exchange and a resource manager for administering compute cores, memory, GPUs, and storage. When orchestrated in a proper manner, the interplay of these components leads to a reproducible, portable, and easily adaptable data analysis process. 
SWMSs emerged at the end of the last century when scientists started to require scalability beyond single workstations. With the steep increase in data sets sizes, the growing complexity of the research questions being studied, and the democratization of data science in general, their popularity increased continuously since then. However, SWMS today work in a different environment than in the past. While first generation SMMS often were designed as stand-alone applications, they today must interact with other infrastructures applied in data centers to manage resources effectively and securely. Thus, systems architectures have grown considerably in complexity, and requirements to SWMS components changed. However, a comprehensive and up-to-date description of these consequences of these developments, i.e., of the inner working of current SWMS, still is lacking.
This book sets out to fill this gap. It is structured in four areas, devoted to introductory texts, concrete SWMS systems, important application areas of SWMS, and descriptions of advanced technological aspects, respectively. It features 25 chapters authored by 127 experts from 17 different countries. The book is intended to address both users of SWMS that want to get insights into the functionality and premises of these systems – as well as their limitations – and developers of SWMS that want to learn about recent technological advancements.