While the related topics of fault tolerance and resiliency do not garner the same attention as performance and efficiency, being able to recover from and work around failures, especially as applications take over ever-large and increasingly heterogenous machines, is more important than ever. The reality of exascale computing has pushed this to fore but resiliency has been a critical point for the largest HPC sites for years.
In the past, creating resilient, fault tolerant machines was handled by the vendors, the OS, and by various libraries and approaches that kept recovery and backups far from the minds of scientific software developers. That might need to change soon, especially since methods like checkpoint/restart consume an incredible amount of memory and strain much-needed I/O subsystems.
George Bosilca, a researcher at the Innovative Computing Lab at the University of Tennessee, says that this I/O strain can be relieved if programming paradigms have opportunities for developers to write codes that can support recovery and workarounds. But this means upsetting a tradition of fault tolerance and resiliency as not being something HPC developers have to give an enormous amount of thought to.
Bosilca has been invested in making fault tolerance and HPC resiliency more effective and efficiency for almost two decades. “I don’t think we’re doing the right things in HPC. We’re not building our programming models and everything else that supports our applications with the concept of resiliency built in.” He adds that it gets puts off to the side and many errors went undetected because too few people along the chain were not focused on resiliency because the overall scientific results were “good enough” for the mission at hand.
“The sheer number of applications need make resiliency even more critical. How we tackle that now is to either delegate it to someone else (hardware vendors, for instance) or we rely on some basic infrastructure proposed by the OS or a low-level runtime that can do checkpoint/restart.” He says that while checkpoint/restart is still a critical piece of overall resiliency, it needs to be backed by more footwork on the part of developers, especially since checkpointing is so expensive from an I/O standpoint.
“When there is a large scientific application running on a large part of the machine, we’re running at 80-90% of memory capacity, which means checkpoint/restart will have to save all of that—we’re talking petabytes of data regularly hitting the file system.” Bosilca’s approach is to insert resiliency into the programming model itself to let users decide how they want to deal with faults. The idea is not to eradicate checkpointing but to make it less critical for applications to take up massive I/O resources—something that happens with a more brute force approach to checkpointing and keeping everything, despite the resource costs.
“We like to talk about performance but no one wants to talk about how often a big machine fails or how many computations have to be restarted, which basically means a lot of waste happened on a system.” He points to some work done on the Titan supercomputer a few years ago that showed there was an error every eight hours. If you’re running an application at scale that runs for over eight hours (certainly not unheard of on Titan and subsequent machines) and resiliency isn’t built in, the application won’t complete at all. While the info is some years out of date, it’s not like the resiliency problem gets better because of some mysterious progress. Exascale necessitates a more efficient way to recover than just massive I/O consumption for checkpointing. And while resiliency means more than just fault-tolerance, there are plenty of elements there that could also be baked into programming models—it just means developers will need to get on board. And therein lies the challenge.
“We are definitely getting some pushback on this idea because everyone wants something that’s cheap from a productivity perspective. When you design an algorithm you don’t want to deal with it [resiliency/fault tolerance]. A nuclear reaction simulation is complicated enough without making all that in from the beginning.”
Bosilca adds that his approach of pushing MPI and other HPC specific languages to have enough flexibility built in to deal with faults in addition to some checkpointing takes effort but the efficiency and recover benefits outweigh those hurdles. “There are people trying now in different applications to integrate these constructs and the results have been positive in terms of time to solution and reliability of the application. But there is no doubt that there’s a development cost and it needs to be paid up front.”
It’s a shame that the lessons that have been applied in hyperscale resilency can’t transfer to HPC but the legacy of HPC, which hasn’t built these things in from the beginning, will keep it from being a more native part of resilient, fault tolerant systems like that that run at big datacenters based on MapReduce or Spark for instance. And besides, none of the hyperscalers could afford that kind of dedicated gigantic storage and bandwidth on the network just to deal with faults. Unlike those hyperscale underpinnings, MPI doesn’t have anything baked in to enough of a degree for the large applications to cope with faults, Bosilca says.
“In HPC we’re kind of an outlier from a resiliency perspective because we decided early on not to care about having a programming model with all of this built in.”