SC22 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

pyDMTCP: Python Interface to DMTCP via SLURM


Workshop: Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22)

Authors: Gabi Dadush (Nuclear Research Center Negev, Israel); Yehonatan Fridman (Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Israel; Israel Atomic Energy Commission); Re'em Harel (Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Israel; Nuclear Research Center Negev, Israel); and Gal Oren (Technion - Israel Institute of Technology; Nuclear Research Center Negev, Israel)


Abstract: Supercomputers have become increasingly important due to the growing demand for computational power and the amount of available data. As supercomputing systems become larger and serve many users simultaneously, the costs of building and maintaining such systems increase, and the probability of faults increases. Therefore, such systems’ efficiency and resilience are essential for providers and users. One primary tool that provides system resilience is DMTCP, a system-level Checkpoint/Restart (C/R) library that allows performing C/R operations seamlessly without any source code modifications. Meanwhile, Python has become one of the major languages for application programming; hence providing it with C/R capabilities is desirable in many systems. Accordingly, previous work has brought C/R to Python by supporting DMTCP C/R programmatically from within a Python program. Nevertheless, a particular class of python codes is not self-contained but rather designed to support other applications by scheduling, managing, and analyzing their results, such as execution wrappers and pipelining, parameter sweeping, etc. This class of Python codes is widespread on HPC systems using the SLURM job scheduler by all types of users. In this work, we extend the previous integration of DMTCP to Python programs and first introduce pyDMTCP. This Python module enables Python wrappers of scientific applications to easily utilize DMTCP checkpointing via a Python interface and externally to applications via SLURM. The interface also maps the entire HPC system according to several main parameters to allow fault-free and optimized C/R executions between different nodes.

The source code of pyDMTCP will be available at https://github.com/Scientific-Computing-Lab-NRCN/pyDMTCP.





Back to Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22) Archive Listing



Back to Full Workshop Archive Listing