Exascale Storage Gets a GPU Boost

Alex St. John is a familiar name in the GPU and gaming industry given his role at Microsoft in the creation of DirectX technology in the 90s. And while his fame may be rooted in graphics for PC players, his newest venture has sparked the attention of both the supercomputing and enterprise storage crowds—and for good reason.

It likely helps to have some notoriety when it comes to securing funding, especially one that has roots in notoriously venture capital-denied supercomputing ecosystem. While St. John’s startup Nyriad may be a spin-out of technology developed for the Square Kilometre Array (SKA), the play is far more centered on hyperscale web companies and enterprise storage companies than on the supercomputing sites of the world.  St. John tells The Next Platform of the $8.5 million they have raised from Data Collective and other VCs even with an HPC hook.

At its most basic, Nyriad is capitalizing on the demise of RAID and forcing the storage controller into retirement and bringing in the GPU to its job—as well as its normal compute gig for GPU accelerated systems.

This idea may not be new in theory (although it is most often associated with FPGAs) but it has been implemented in a way that allows for the use of low-end GPUs for storage-only purposes while also allowing GPUs on supercomputers to shed some network fabric and storage for lower latency. The key is in erasure encoding—and the roots of that story are in the world’s largest telescope’s need for supercomputing speed, even if the market for the technology can go far below (and above) that scale.

In his free time a few years ago, St. John started working with scientific computing groups at SKA specifically on optimizing Fourier transform codes for GPUs. But being focused on software for most of his life, he says SKA’s emphasis on hardware architectures in terms of power and power speed struck him as odd, especially since the reason SKA’s dreams of building ever-bigger telescope projects was limited due to the power consumptions new supercomputers would require. Teams needed to reduce power consumption by 70%-90%, he says, and while it was clear that data movement was the bottleneck and energy consumer, the solution to those problems was not clear.

Of course, every architectural trend in the last several years has in some way angled toward the same end goa in different ways (bringing data closer to the processing, cutting down on hops on networks, pushing for ever-greater density, and so on). And while St. John admits his storage expertise was nil before working with SKA, he recognized immediately that the parallel capabilities in a GPU if exploited fully for I/O functions, could allow them to do the kind of store/compute double-duty that could get SKA (and ostensibly a commercial market) closer to efficiency goals.

“At that point, we had been through a lot of the algorithms and pipeline looking at ways to carve out power. We would have needed to cut out whole sections of the supercomputer to reach their goals. Moving data around the fabric and storage architecture was the big consumer; we knew we needed to dramatically increase density and do more locally,” St. John recalls. “So here are these big dumb storage boxes over this expensive fabric that needed to handle 50 petabytes per day. If we could get all the storage local to the compute nodes, that reduces the amount of network fabric, which reduces the power. And if you are smart about how to make data local, you can move less of it around.”

The “aha” moment was when St. John recalled that GPUs on big systems spend most of their time I/O bound—something that pushed SKA to use FPGAs on supercomputers in the past. With a GPU, once the data is inside the chip there is no reason to move it–so if the storage array and the network were run directly by the GPU, which was doing all the processing, there would be no more passing to and from the CPU with all that copying and the network fabric and storage could be run right on the GPU.

The conclusion here was that by getting rid of the network and storage fabric in its current form meant that as a system scaled to SKA levels, the density would increase and the performance would go up while removing the power concerns. St. John and a small team put their heads together, along with some funding, to build a prototype of the concept. In theory, it all made sense, but there was one crucial piece to the overall puzzle that no one thought of until a student working on the project brought his father to the lab who had been a storage specialist. He turned their thinking away from the hardware problem and toward erasure encoding. And there was born the second “aha” moment for Nyriad.

On second glance, St. John saw how similar SKA’s GPU accelerated Fourier codes and erasure coding were and it was not long before they saw the math let them use a GPU to compute storage arrays that were so parallel that, in theory at least, the speed of the storage array could match the GPU’s bus bandwidth. A simple port of SKA’s erasure coding algorithm to a GPU brought in 110GB/s, which was way faster than anyone knew was possible.

That is certainly not to say that this team was the first to port EC to the GPU but St. John insists nothing was close to that speed. True enough, there have been efforts over the years to do just this with GPUs but the question was how to make it usable and whether there could be a market for such an approach. Cisco, Dell, and various national labs had similar efforts on the GPU and even though the results were promising, nothing was ever productized aside from CPU or FPGA-based implementations from companies like 3PAR and Violin Memory, for instance.

“If I wanted to take 256 drives and erasure encode those quickly, the FPGA is great, if not better than the GPU. But the purpose of erasure coding across many drives is to use them simultaneously to get parallel speedups,” St. John says. When that happens and a drive fails the equations to solve the reconstruction of the array is actually a massive system of sparse linear algebra equations and that is not easy to solve on the FPGA—it takes a lot of code to just make it look like a GPU. He says that making a chip just for this was too difficult for those vendors first to the FPGA game when a GPU can do this better natively. “We realized that we could erasure code really fast and while anyone with an FPGA could also do this, we could keep running at full speed even as the array came apart because the GPU could solve the system of equations so quickly.”

The product, NSULATE is implemented as Linux block device so it can be compatible with several things, including ZFS, Ceph, Lustre, EXT4, and others. It can run on top of these since it is implemented at the kernel level with plans underway this year to go beyond a universally compatible block device and into allowing file and object to sit on top. There is also SHA2 and SHA3 to for crypto proofs that ever single bit is perfect in real time.

For those who just want a storage box built on the tech, Nyriad recommends a low-end workstation GPU like the Quadro T4000. For the supercomputing set, the story is different since some of the systems are already outfitted with GPUs. In this case, the idea is to install NSULATE on the same nodes that those GPUs run on. As counterintuitive as it might seem, it reduces power across the system because erasure encoding gets more efficient the faster it goes. The more bits of this that are done, the less redundant storage is needed. Since the Pascal or Volta GPU is already hot from the compute, he says there is not added energy draw.

“Suppose I have a JBOD of 100 hard drives and I want to configure that with a RAID card. I might configure that as 10 8X2 arrays. That means I’m wasting 20% on RAID and with hard drives that are $1000 each. If I erasure encode that same array at 94/6 and the wrong drives fail I would have lost everything. But now I can lose any random 6 drives and not lose my data and am at 6% overhead saving 16 drives or about $16,000 in redundant storage—and that is just inside one JBOD. Imagine this at exascale.”

Nyriad has already partnered with some OEMs to bring NSULATE to market, which could pose a threat to companies that try to do the same thing on massive ready-to-roll appliances like DDN, for instance. The block base means that virtualization and databases could see big benefit simply by block virtue, but with a file system and object on top, the use cases open wider. For instance, consider the high-value media and entertainment market, which is largely object based.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

6 Comments

  1. Isn’t the risk of correlated failures rather high within a raid group of 100+ disks, given that they will either be mechanical disks purchased at the same time (ie from the same lot) or SSD’s with essentially identical loads which will wear out simultaneously?

    That is, in this design the assumption that drive failures are independent of each other may not hold, and the resulting risk will have to be mitigated outside the erasure coding scheme…

    • You are correct, NSULATE supports up to 128 parity per array which is what makes the use of many parallel drives practical. The GPU can also compute wear-leveling in real-time, so for large arrays we can bias the wear.

  2. Interesting idea but his numbers seems off by 2x or so: 100 drive may require 25+ GB/sec of PCIe bus bandwidth to fully utilize drive (sequential) speed so PCIe 4.0 x16 is needed and HD (even 12TB SAS3 He) cost under $500 retail (not $1,000).
    Also, encoding like 94/6 makes sense performance wise only for large files or objects (50+MB) due to disk seek times overhead.

    • Yes, the arrays can physically be much faster than the IO supported by the bus or GPU. It’s a configuration exercise to find the ideal balance in a system configuration. In trying to express how adding a GPU saves power and cost, I indulged in a little off-the-cuff hyperbole to convey the idea that higher parity, increases the efficiency of drive utilization.

      NSULATE does not place data like a RAID controller. As you observe it doesn’t really scale. Getting great performance for large streaming files was straight forward. The GPU’s computing power has enabled us to take some creative liberties with small reads and writes that we believe will bear out in benchmarks. We think we’ve got a handle on it, but it’s a fair challenge.

      • It looks as though NVDIMMs fit into your hardware scheme? They have super-capacitors and are good In case of power outage…but are they necessary?

  3. It looks as though NVDIMMs fit into your hardware scheme? They have super-capacitors and are good In case of power outage…but are they necessary?

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.