Workshop: Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22)
Authors: Ashley Tung, Haiyan Wang, and Yue Li (MemVerge Inc); Zhong Wang (Lawrence Berkeley National Laboratory (LBNL)); and Jingchao Sun (MemVerge Inc)
Abstract: Spot instances offer a cost-effective solution for applications running in the cloud computing environment. However, it is challenging to run long-running jobs on spot instances because they are subject to unpredictable evictions. Here, we present Spot-on, a generic software framework that supports fault-tolerant long-running workloads on spot instances through checkpoint and restart. Spot-on leverages existing checkpointing packages and is compatible with the major cloud vendors. Using a genomics application as a test case, we demonstrated that Spot-on supports both application-specific and transparent checkpointing methods. Compared to running applications using on-demand instances, it allows the completion of these workloads for a significant reduction in computing costs. Compared to running applications using application-specific checkpoint mechanisms, transparent checkpoint-protected applications take less time to complete, leading to further cost reductions.