SI2-SSI: Pegasus: Automating Complete and Data Intensive Science

2017 - presentpresent
There is a growing gap between the capabilities offered by on-campus and off-campus cyberinfrastructures (CI) and the ability of researchers to effectively harness these capabilities to generate simulated data and to process observational and instrumental data. Faculty and students are struggling to manage data that does not fit on their laptops or cannot be processed by an Excel spreadsheet. Even national and international collaborations that are familiar with advanced computing technologies have trouble developing and sustaining their computational methods as the underlying CI changes and diversified and as more complex methods are developed. For more than 15 years the Pegasus Workflow Management System has been designed, implemented and supported to provide abstractions that enable scientists to focus on structuring their computations without worrying about the details of the target CI. To support these workflow abstractions Pegasus provides automation capabilities that seamlessly map workflows onto target resources, sparing scientists from the overhead of managing the data flow, job scheduling, fault recovery and adaptation of their applications. Automation enables the delivery of services that consider criteria such as time to solution as well as take into account efficient usage of resources, managing the throughput of tasks and data transfer requests. The power of these abstractions was demonstrated earlier this year when Pegasus was used by an international collaboration to manage across a diverse set of resources the compute and data intensive workflows that confirmed the existence of gravitational waves as predicted by Einstein’s theory of relativity. Experience from working with diverse scientific domains - astronomy, bioinformatics, climate modeling, earthquake science, gravitational and material science - has uncovered opportunities for further automation of scientific workflows. This project which is a collaboration between the Information Science Institute at UCS and the UW Center for High Throughput Computing (CHTC) will address these opportunities by innovating in the following areas: 1) expansion of automation methods to include resource provisioning ahead and during workflow execution, 2) support for dynamic workflows, which change their execution path based on user input or other events, and 3) data-aware algorithms and data sharing mechanism in highthroughput environments and high-performance systems. To support a broader group of “long-tail” scientists, this project will include usability improvements, and outreach, education, and training activities.