Workflow systems for large-scale scientific data analysis

Herausgeber*innen: Ulf Leser, Marcus Hilbrich, Sean R. Wilkinson, Rafael Ferreira da Silva

Umfang: 676 Seiten
Format: 17,0 x 24,0 cm
Erscheinungsjahr: 2026
ISBN 978-3-98781-067-1
42,00 

The past two decades have seen a steep increase in computational requirements for analyzing scientific data sets. The reasons are manifold: Typical data sets increased enormously in size, the growing complexity of scientific questions required more complex analysis methods, and the growth of methods based on machine learning and artificial intelligence called for additional measures to handle model training and quality control. Furthermore, expectations in terms of reproducibility and reusability are much higher today than in the past, and issues like energy consumption and trustworthiness of results require additional attention. These trends brought along the need to run analysis on large compute clusters and to apply advanced software infrastructures to support these diverse needs as much as possible. Scientific workflow systems are a class of systems created to cope with these requirements. They typically consist of multiple components themselves and use other services typically available on clusters: Scientists use workflow languages to express complex analysis procedures as multi step pipelines, wrap their task binaries in containers or virtual machines to facility portability, employ workflow engines to execute these pipelines on distributed infrastructures, integrate resource managers to manage CPUs, memory or bandwidth within a cluster, and rely on distributed file systems for robust data exchange. When orchestrated in a proper manner, the interplay of such components lead to a reproducible, portable, and easily adaptable data analysis process. Scientific workflow management systems (SWMS) emerged at the end of the last century when scientists started to require scalability beyond single workstations. With the continuous increase in data sets sizes and the democratization of data science in general, which brought thousands of new usage scenarios for large-scale data analysis, their popularity increased especially over the last decade. However, SWMS today work in a different environment than in the past. While first generation SMS often were designed as stand-alone applications, they today must interact with other infrastructures applied in data centers to manage resources effectively and securely. Thus, systems architectures have grown considerably in complexity, and requirements to SWMS components changed. However, a comprehensive and up-to-date description of these developments, of the inner working of current systems, and of current applications of SWMS technology across scientific domains is lacking. This book sets out to provide introductory and advanced knowledge on SWMS. It is structured in four areas, devoted to introductory texts, SWMS applications, concrete SWMS systems, and descriptions of advanced technological aspects of SWMS, respectively. It features 25 chapters authored by more than 100 persons from 12 different countries. The book is intended to address both users of SWMS that want to get insights into the functionality and premises of SWMS – as well as their limitations – and developer of SWMS that want to learn about recent technological advancements.