ReMPI: A Record-and-Replay Tool for Debugging Non-Deterministic MPI Applications
DescriptionDebugging massively parallel applications remains a highly challenging task. With trends towards larger and more complex supercomputers, remarkably increasing degrees of parallelism, more parallelism options (e.g., heterogeneity), and emerging programming models, applications gain higher performance and scalability by using more asynchronous algorithms. However, they come at a productivity cost: they introduce non-determinism in parallel program execution—i.e., the applications do not produce the same output in different runs—and this makes debugging even a greater challenge. A particularly well-known source of non-determinism at large scale is the message-passing interface (MPI). As network and system noise can affect the order of received messages, applications can take different computation paths depending on the order of the received messages. This complicates debugging since computation paths and associated computational results may vary between the original run (where a bug manifested itself) and the debugged runs. In this lightning talk, we introduce ReMPI (MPI Record-and-Replay Tool, https://github.com/PRUNERS/ReMPI) that facilitates debugging non-deterministic MPI applications. ReMPI records the execution of each MPI process as trace data, which includes the order of the message receives. Then, during debugging, a replay mechanism uses these recorded traces to ensure that every MPI process observes the same message exchanges as the recorded run.
Event Type
Workshop
TimeMonday, 14 November 202210:35am - 10:40am CST
LocationC143-149
Recorded
Reliability and Resiliency
W
