Workshop: 12th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2022)
Authors: Yehonatan Fridman (Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Israel; Israel Atomic Energy Commission); Yaniv Snir (Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Israel; Google LLC); Harel Levin (Mobileye, an Intel Company; Nuclear Research Center Negev, Israel); Danny Hendler (Ben-Gurion University of the Negev, Ben-Gurion University of the Negev, Israel); Hagit Attiya (Technion - Israel Institute of Technology); and Gal Oren (Technion - Israel Institute of Technology; Nuclear Research Center Negev, Israel)
Abstract: HPC systems are a critical resource for scientific research. The increased demand for computational power and memory ushers in the exascale era, in which complex supercomputers consist of numerous compute nodes and are consequently expected to experience frequent faults and crashes.
Exact state reconstruction (ESR) was proposed as an alternative mechanism to alleviate the impact of frequent failures on long-term computations. ESR has been shown to provide exact reconstruction of iterative solvers while avoiding the need for costly checkpointing. However, ESR currently relies on volatile memory for fault tolerance, and must therefore maintain redundancies in the RAM of multiple nodes, incurring high memory and network overheads.
Recent supercomputer designs feature emerging non-volatile RAM (NVRAM) technology. This paper investigates how NVRAM can be utilized to devise an enhanced ESR-based recovery mechanism that is more efficient and provides full resilience, based on a novel MPI implementation of One-Sided Communication (OSC) over RDMA.
Back to 12th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2022) Archive Listing