When you have a massively distributed computing job that can take months to run across thousands to hundreds of thousands of compute elements, one software hardware or software crash can mean losing an enormous amount of work.
That’s why checkpointing – stopping all processes on all nodes in a cluster as an application is running to copy off the state of each node – was invented in the mid-1990s for HPC systems, notably with Libckpt from the University of Tennessee (which was ported from Unix workstations to clusters), libFT from AT&T Research Labs (remember AT&T invented Unix and used its as the back-end of its massive, parallel phone billing applications, which was the megascale of the time), Porch from the Massachusetts Institute of Technology, and Catch and FTC-Charm++ from the University of Illinois were some of the initial checkpointing software projects, and the work has continued to this day, notably including the Veloc multi-level, heterogenous storage checkpointing software being developed under the auspices of the Exascale Computing Project in the United States. Another one is called CRIU, short for Checkpoint/Restore In User Space. which was created by server virtualization provider Virtuozzo to backup Linux containers.
But one of the more popular open source checkpointing tools in the HPC area today comes from the Distributed MultiThreaded CheckPointing (DMTCP) project at Northeastern University, which is spearheaded by professor Gene Cooperman. The project, which originally started as MultiThreaded CheckPointing, got its start in 2007 backing up the state of applications running on a single workstation, and as happened with AT&T’s libFT, moved from the workstation to distributed computing clusters, and by 2009 it supported over twenty HPC applications and frameworks, including various implementations MPI implementations as well as MatLab, Python, and R in the early days, and the modern incarnation, which lives here on SourceForge and there on GitHub, has hooks into the InfiniBand OFED drivers, OpenMP and OpenGL programming models, and SLURM workload scheduler. Significantly, DMTCP does not make any modifications to the Linux kernel – of course it only works with Linux, which is all you need in HPC and AI – and it also does not require any changes to applications, either.
Not every academic working on a sophisticated software project wants to commercialize their wares, and this is the case with Cooperman and his colleagues who maintain DMTCP – Kapil Arya, Rohan Garg, Jiajun Cao, and Artem Polyakov. Up until now, Intel and the National Science Foundation have supported DMTCP development, but there has been no mechanism for providing commercial-grade technical support.
So MemVerge, the company that has created a Memory Machine hypervisor to mash up main memory and persistent memory into a single storage medium that allows for snapshotting application state out of main memory – a kind of checkpointing – is going to contribute to the DMTCP project and provide commercial support for the tool, giving HPC centers some comfort when they depend upon it. Interestingly, “Cori” and “Perlmutter” supercomputers at Lawrence Berkeley National Laboratory use a plug-on atop DMTCP called MPI-Agnostic Network-Agnostic (MANA) to do transparent checkpointing on these machines, which cleverly separates the state of the MPI application from the state of the MPI stack itself and only checkpoints the application state.
MemVerge has started a collaboration with Lawrence Berkeley to optimize MANA atop DMTCP, and we think there will probably be some talk of how to integrate this all together with the MemVerge memory hypervisor. The work will also include the ability to checkpoint CUDA applications, which is especially important to supercomputers that employ Nvidia GPUs as their math engines. We also think that the scale of the checkpointing will be expanded. While DMTCP has been demonstrated to checkpoint applications with up to 1,000 MPI processes, Perlmutter has a combination of CPU cores and GPU streaming multiprocessors that numbers 761,856. And if every one of those runs an MPI rank, that’s a lot of scale. And even if there is one MPI rank per GPU, that is still 6,200 ranks on the Perlmutter machine just for the GPUs.
The roadmap is not yet set, but Charles Fan, co-founder and chief executive officer at MemVerge, tells The Next Platform that there are some interesting possibilities for mixing and matching technologies across these projects in MemVerge projects.
“Over the past few years, we have got to know the DMTCP people quite well, and we started collaborating with them on a few projects and there is definitely the possibility of incorporating some of the open source code and techniques into our projects,” says Fan. “But we have not done that yet. But at the heart of our Memory Machine product is the ability to capture a running application and snapshot it, and we want to stay a leader and be at the state of the art in such technologies, and the beauty of transparent checkpointing is that you can do it without any involvement of the application. Where we are contributing is doing checkpointing to memory instead to disk, which can shorten the checkpointing downtime from minutes to seconds or even in a single second. And by doing this, we think we can make checkpointing more broadly relevant in the datacenter.”