Workshop: Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22)
Authors: Gene Cooperman (Northeastern University), Dahong Li (MemVerge Inc), and Zhengji Zhao (Lawrence Berkeley National Laboratory (LBNL))
Abstract: Testing correctness of either a new MPI implementation or a transparent checkpointing package for MPI is inherently difficult. A bug is often observed when running a correctly written MPI application, and it produces an error. Tracing the bug to a particular subsystem of the MPI package is difficult due to issues of complex parallelism, race conditions, etc. This work provides tools to decide if the bug is: in the subsystem implementing of collective communication; or in the subsystem implementing point-to-point communication; or in some other subsystem. The tools were produced in the context of testing a new system, MANA. MANA is not a standalone MPI implementation, but rather a package for transparent checkpointing of MPI applications. In addition, a short survey of other debugging tools for MPI is presented. The strategy of transforming the execution for purposes of diagnosing a bug appears to be distinct from most existing debugging approaches.