Over the next few years, we can expect to hear even more about large-scale computing sites bumping up against the memory wall, although not necessarily where they might expect. The compute nodes are the expected place, but as more sites invest heavily in high performance I/O, the next memory blockade will be in storage.
Getting impressive I/O performance is no longer the hard part, at least on the surface. From packing NVMe SSDs, adding Mellanox HDR NICs into nodes, and building rich storage stacks that can do all the checksum, erasure coding, and compression/encryption with a 50GB/s punch, it might seem that the current problems are solved, at least enough to satisfy current demands. But all of these methods were built with disk in mind. It might be possible to get high performance but the supporting infrastructure isn’t there, at least it wasn’t for Los Alamos National Lab’s Brad Settlemyer, senior scientist with their HPC design group.
“What we were specifically running up against with storage was the ability to get the memory bandwidth needed to make multiple passes over the data for high quality compression, erasure coding, all the data services storage provides,” he explains. The LANL team went with one of Eideticom’s NoLoad computational storage processor (CSP), which accelerated their Lustre-based (the ZFS piece, specifically) infrastructure. The company’s CSP is an Alveo U50 accelerator card, something we talked about during The Next FPGA Platform event in person (believe it or not) in January, 2020.
With computational storage in an FPGA, LANL was able to do high performance scientific compression at rates that went from 6% to 30%, a pretty striking improvement considering the expense of NVMe overall. In addition to getting a performance boost via offloading to the CSP over less efficient, memory bound compression, Settlemyer says that there is a lot of uncovered opportunity in other data services, especially in terms of pushing data analysis closer to compute.
With these improvements, one has to wonder why we’re not seeing more computational storage in HPC overall. While it could be that NVMe investments aren’t large enough that we’re seeing a shift, it could also be that getting these improvements is no simple matter. For the LANL team, it took some footwork to modify the Zettabyte File System (ZFS) that underpins parts of Lustre to get hardware acceleration inside the kernel. Luckily, for the rest of HPC, the team has been working to open source much of the effort.
The real reason computational storage, whether it’s a CSP, in a device, or in arrays, as we’re seeing more of, isn’t taking off as quickly as one might imagine is this exact “legacy tax” as Settlemyer calls it. “We’ve spent years hardening these legacy software stacks. Lustre has been a tremendous success, despite all the complains. It’s stable and fast and designed to solve thousands of simultaneous nodes writing rapidly. But it was designed two decades ago.” He adds that putting in such footwork is central to the mission in supercomputing. “It’s the appeal of making HPC work rather than what I would argue would be going back to HPC root’s of trying to have the highest performance by taking a bit more risk.”
“Where HPC has an opportunity to lead is in direct access to storage. We’ve already done this with network; we don’t need to mediate everything through the kernel to low latency, the networks are accessed from userspace and this has been adopted by the hyperscalers, who are using userspace RDMA libraries now. I don’t see why HPC can’t lead the same way with NVMe over fabrics, with thousands of nodes collaborating with thousands of NVMe devices directly without being mediated through some system software. It would mean you can actually just go and analyze data as fast as possible and accelerate discovery.”
LANL has long been at the heart of I/O innovation, from being among the originators of the burst buffer to proving new technologies in both media and systems software. The lab is part of the EMC3 Consortium as well, where they’re working with the handful of companies spearheading some of the early computational storage and NVMe over fabrics initiatives, of which Eideticom is but one. What is notable and necessary about this and NVMe overall is that there is a standards backbone to hold everyone upright. The vendors are taking cues from LANL, Amazon, and others and to drive standards to help users find loopholes in the “legacy tax” system of HPC storage. It takes a lot of custom work to create something that isn’t custom, but Settlemyer says they are at the beginning of a new roadmap, one that promises to integrate the various CSP, CSD, and arrays into large storage workflows with a standards backbone that can let centers like LANL slot into new hardware accelerated storage systems.
“Bringing all the that compute to bear in analyzing data is going to be the next new vista in how we see data analysis transformed for HPC. Because of our history doing low latency, targeting programming models that are network centric, we’re well prepared for NVMeOF,” Settlemyer adds.