NFS Might Be the Only Foundation for Exascale

SPONSORED If you’ve been living in a world of heavy-duty parallel file systems and scale-out NAS, chances are good you have been there for a while. Decades, perhaps. Back in those far-flung days when scalability meant making trade-offs, when the network was the bottleneck rather than storage, and hard drives were the (sometimes) steady workhorses they still are today.

And why bother looking beyond this vista after years of investment in Lustre or GPFS/Spectrum Scale and an HDD or even hybrid flash arrangement? It works, after all, you say; flash is too expensive and layering another parallel file system into the mix would create more complications than it would solve at this point.

Here’s why organizations, especially with rapid scalability concerns, are gazing beyond the well-known pastures of traditional NAS (i.e. NetApp, Isilon).

But here is the funny part about this: all of the tried and true, familiar technologies backing that stable of storage resources are being upended by one technology that has been around just as long: NFS. That’s right, the network file system, which you already know is not a file system at all, but a protocol used to build all the fancy scale-out NAS installations that don’t actually scale beyond certain limits.

If you stopped paying attention to NFS when you switched to a parallel file system, you might have missed some important additions. The biggest is that it is no longer impossible to talk to multiple machines. RDMA support is now built in. And that whole east/west traffic nightmare with endless updates has been resolved.

When you take these innovations, pull it together from a single namespace over a monolithic pool of flash, the possibilities of NFS grow exponentially. In fact, this kind of approach might be the only thing to usher in the era of AI training at any scale in addition to supporting companies and centers building big HPC systems. We’ll get to how this all comes together and what extra layers are required to do this at 500PB scale in a moment.

The last time many HPC and large enterprise shops shook things up in terms of storage architecture was during the rush to keep pushing data and compute together. The network was the bottleneck. At that time, HDDs were the norm and although densities went up at a regular cadence they could never keep pace with network capabilities. Many of the parallel file systems in use had their heyday with 1GB/sec networking. It’s now 400GB/sec — a 400x improvement — while HDD versus SSD might be 30x. In other words, in the midst of this transition the cracks in this data colocation strategy started to show.

For instance, how were sites, especially with a lot of transactional data, going to handle scaling with updates that needed to propagate across an increasing number of machines? “That kind of traffic within shared nothing architectures has created a law of diminishing returns as they grow beyond petabytes,” says Jeff Denworth, co-founder at VAST Data.

The solution on-the-fly for sites that wanted to scale was to just deploy a bunch of small clusters for storage and why parallel file systems persisted, even though such scale-out NAS systems could not resolve all the backend contention of east-west traffic and ended up blocking on certain nodes that other machines wanted to access.

As legacy Isilon and Dell/EMC users have begun seeing in real-time in the last couple of years in particular, linear performance scaling peters out and while some have taken the approach of building a shared nothing database architecture to cover multiple nodes, contention reigns. The more clients accessing that database, the faster performance degrades and the ripple effects through the cluster hit hard. What also is petering out is the tough choice between going with a lot of nodes with a little capacity on each to avoid nasty data rebuilds or having a few big ones and hoping like hell that a rebuild doesn’t take too much downtime.

So, let’s revive NFS since we’re talking about old ways of doing things. And talk about all the new tricks NFS can do — how well it can perform dives into a pool of shared flash without all the contention, inefficiency, lack of manageability, and bluntness.

If you’re new to NFS again, imagine this: a NAS where each node in a disaggregated and shared nothing cluster is stateless. There’s a gleaming pool of SSDs underneath that each node has access to and in those SSDs are a series of transactional data structures that let all those different machines access the same ranges in a single namespace. What this means is no cache and that means none of the cache coherence code that muddled things up in the old days. That coupling of low-latency flash and NVME-oF in all its stateless glory means no more east/west traffic and an Exascale storage architecture that looks more Exascale-ready than the lumbering parallel file system machines that will hit that mark in the coming years.

VAST Data has had success already with some of the world’s largest banks, the NIH, DoE, and NASA, among many others. But what Denworth thinks is keeping the others in the dark is that they have not updated their notions of what NFS can do, how pooled flash can act as an archive and ready-to-go source of data for computation, and that the wealth of NAS-based systems just will not keep up with where they’re going next, whether its Exascale computing or massive-scale AI training.

It’s not just the big enterprises and HPC centers that need to take note of existing applications. If what we see written on the wall is correct, AI training will become first a growing workload before being an organization’s entire corpus of data. Instead of different data stored in separate bins, having that entire volume or archive, pooled up and ready for compute in fast-to-access flash, might be the only way to efficiently do machine learning at any scale. Only a certain amount of data can fit on-board GPUs or other systems. As the space matures — and the data-overloaded companies stop building separate flash-based clusters just to handle training and put it all in the same unified pool — part of the complexity of training and a slice of the cost in management and storage/transfer go down.

With emerging workloads like AI or even record-scaling HPC applications, saying that with just a few changes, NFS is ready to take over might seem dramatic. The real alterations to the protocol to make it all just work are in VAST Data’s wheelhouse. All the considerations that used to apply for NFS (no coordination, need for caching, etc.) are legacy in an all-silicon flash world. VAST randomizes the I/O for better parallel performance across a lot of flash and lets its novel data reduction algorithm block similar elements together globally instead of in chunks in ways traditional compression cannot touch.

The mere act of turning off caches means you get “atomic levels of consistency” across 500,000 clients all accessing the same namespace. With NFS over RDMA and the option to use GPUDirect too on the NFS innovation side, streamlined storage is the name of the new game.

Here’s another funny point: just like NFS has matured to meet the demands of 2021, the argument that flash is too expensive is just an excuse that seasoned veterans make to keep their storage environment consistent. With all the innovations in NFS as VAST leverages them, the cost of that pool of flash is not much different than all the HDD needed for the same result. Add the efficiencies of data movement, memory savings, and time no longer spent optimizing endless storage tiers and that cost is even lower beyond CAPEX.

There’s another secret to that price point conversation. If you’re scaling by adding to those pools of flash in burst buffer fashion, there is not a huge advantage in using the highest-end SSDs for the job. Run of the mill flash, just as in a burst buffer, is up to the task. “Think about the idea of shared architectures where you have to make decisions about the value of certain data, what goes down to HDD. Using SSDs as basically a transactional layer in that tiered system means you’re fatiguing them more than if you had just one pool of flash for all data,” Denworth explains.

Flash sticker price is just one consideration. The costs that come with vast simplification are another. “People are changing how they think about legacy storage architecture, either shared nothing or controller-based systems. You can build a much simpler system in 2021 thanks to this revolution at the component level. And this is as fast, if not faster than a parallel file system because of the embarrassing parallelism of this architecture. It scales and with our novel data reduction algorithms, we can deliver in a way that brings flash to practical affordability and scale,” Denworth says.

There is a reckoning in large-scale storage. The speed of the network, the rise of new workloads like AI, and the increasing scalability requirements all mean a rethink of data/compute collocated architectures. While we may not have anticipated NFS would be the root of salvation in 2015, 2021 is a different story entirely.

Sponsored by VAST Data Inc.

Sign up to our Newsletter