SC22 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshops Archive

Analyzing the Energy Consumption of Synchronous and Asynchronous Checkpointing Strategies


Workshop: Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22)

Authors: Grant Wilkins and Mikaila Gossman (Clemson University), Bogdan Nicolae (Argonne National Laboratory (ANL)), and Melissa Smith and Jon Calhoun (Clemson University)


Abstract: With exascale computing, the number of components that comprise high-performance computing (HPC) systems has increased by more than 70%, leading to a shorter mean time between failure (MTBF) and larger power budgets. These issues induce the need for (1) checkpoint/restart (C/R) and (2) energy reduction techniques. C/R has evolved with different software and hardware advances, thus it is crucial to understand how its energy usage differs under various storage tiers and synchronicity. In this paper, we present a comparison of the energy consumption of leading, state-of-the-art C/R libraries, VELOC and GenericIO. We perform weak and strong scalability tests of the C/R libraries and show that asynchronous C/R provides 4x greater throughput while using 33% less energy than synchronous C/R. Data size and throughput are directly correlated to energy consumption. Therefore, C/R developers should focus on ways to improve/maintain high throughput in order to reduce energy consumption to address exascale needs.





Back to Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22) Archive Listing



Back to Full Workshop Archive Listing