Workshop: Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22)
Authors: Yehonatan Fridman (Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Israel; Israel Atomic Energy Commission); Yaniv Snir (Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Israel; Google LLC); Matan Rusanovsky and Kfir Zvi (Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Israel; Israel Atomic Energy Commission); Harel Levin (Nuclear Research Center Negev, Israel); Danny Hendler (Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Israel); Hagit Attiya (Technion - Israel Institute of Technology); and Gal Oren (Technion - Israel Institute of Technology; Nuclear Research Center Negev, Israel)
Abstract: The recent entrance of the High-Performance Computing (HPC) world into the exascale era challenges how vast amounts of data are analyzed, manipulated, and stored. However, the already substantial performance gap between computing, memory, and storage expands rapidly in the presence of distributed large-scale applications on new generation supercomputers. The widest gap of all, the memory-storage one, is still 2-3 orders of magnitude wide. As a result, said applications struggle with two main storage-oriented tasks – diagnostics and checkpointing – in which there is a need to persist data during runtime for further usage. Recently, novel interdependent introductions of non-volatile RAM (NVRAM) hardware and persistent memory file systems (PMFSs) were made to the storage stack and are planned to collectively integrate into the next Aurora exascale system. Fridman et al. (FTXS@SC’21) benchmarked the diagnostics (FIO, BT-IO) and checkpointing (SCR, DMTCP) use-cases as in supercomputers with the aid of NVRAM and several PMFSs, excluding block-oriented non-volatile devices. Rather, this strategy solely relies on using RAM-NVRAM and even pure-NVRAM memory-storage configuration. We review these results, and introduce how NVRAM can be utilized not only for C/R mechanisms and diagnostics via PMFSs, but also for Algorithm-Based Fault Tolerance (ABFT), with the PMDK library and MPI one-sided communication directly to byte-addressable NVRAM. We specifically focus on Exact State Reconstruction of iterative linear solvers. We show that this strategy utilizes hardware properly and reliably, achieving best-known performances for those use-cases and, as such, suggesting a new approach to devise HPC recoverable algorithms.