SC22 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Extending MPI API Support in MANA


Workshop: Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22)

Authors: Tarun Malviya (Northeastern University), Zhengji Zhao and Rebecca Hartman-Baker (Lawrence Berkeley National Laboratory (LBNL)), and Gene Cooperman (Northeastern University)


Abstract: MANA is an MPI-Agnostic, Network-Agnostic transparent checkpointing tool for MPI applications, which is a recent breakthrough in transparent checkpointing. NERSC has been in collaboration with MANA team at Northeastern University and MemVerge, Inc to enable MANA for NERSC’s top applications to support DOE’s experimental facilities’ real-time workloads by checkpointing lower priority jobs and resuming them later. MANA employs a novel split-process approach and works by intercepting the MPI APIs to ensure that transparent checkpointing to occur at a consistent state between MPI processes and also to achieve network agnosticism. Thus, writing proper wrapper functions for MPI APIs is critical for MANA to checkpoint and restart MPI applications correctly and efficiently. While it is straightforward to implement a wrapper function for most of the MPI APIs, it is not trivial to correctly intercept some of the APIs, and the major challenge is to ensure the same behavior after intercepting the MPI APIs. In this lightning talk, we will review the current status of MPI API support in MANA, and present challenges in supporting various MPI APIs including its communicators, objects, data types, environments, etc., as well as the roadmap to extend the MPI API support in current and future versions of MPI standard. What we learned from supporting MPI APIs in MANA will be helpful to similar approaches that intercept MPI APIs.

MANA uses DMTCP as its checkpointing tool, and is implemented in the DMTCP framework as a plugin. MANA is an open source project.





Back to Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22) Archive Listing



Back to Full Workshop Archive Listing