We spent some time at the end of 2018 getting a handle on how parallel file systems in HPC need to evolve to meet the shifting demands in workloads driven by machine learning. Since that time, we have taken a close look at how some companies catering to HPC have been retuning their stacks to better handle traditional large files as well as the small file and metadata requirements of deep learning.
It is worth noting that this dual handling of both large and small files in HPC is not necessarily new. In life systems where robust HPC resources are needed, this has been a perennial problem, especially with more data and complexity from genomics-driven workloads.
If AI is going to be integrated into more HPC workflows, however, there has to be a better way than GPFS or Lustre some have argued. And just as HPC-rooted storage companies like DDN and Panasas countered recently, there are ways around those limitations.
A relative newcomer to HPC, WekaIO, has architected its own way around the parallel file system wall for mixed workloads by making changes to how file systems function and taking advantage of low latency options like NVMe over fabric.
First, in their approach, all of the metadata is scale-out in ways that are similar to how object stores handle the job without losing POSIX compatibility—a big deal for the HPC centers of the world. We have explored this concept in the past but generally speaking, object has an easier time outside of the big file world because objects are immutable and the semantics are easier.
Second, WekaIO uses NVMe over fabric with its custom file system to parallelize not on the megabyte range but using a 4K approach. This might be a bit new to the strict HPC shops, so let’s put this in some context.
For most NVMe devices, writing takes around 30 microseconds and reading is closer to 100—this is just physics. But the NVMe over fabrics stack for 100GB Ethernet or Infiniband can move 4K around in around a single microsecond. No matter if the IO is local NVMe or remote NVMe there is some inherent latency so with things with very small IO patterns, WekaIO parallelize those IOs. So instead of running an 8K IO from a local device, which takes longer than a 4K IO, the company runs two concurrent 4K IOs on two devices and if it needs to handle more I/O, let’s say 64k (which is still relatively small for POSIX), it is possible to run 16 concurrent 4K I/Os, which in theory, promises a much faster result.
For those in HPC reading in megabytes, just a single megabyte is very small for GPFS or Lustre and all the machinations it needs to do for that single element. But with this kind of approach it’s a big enough file to make a difference since it will be a single megabyte parallelized at, say 16 64k I/Os directly to NVMe, each reading at their own throughput speed. In other words, instead of sending out a single megabyte to local NVMe it gets split across and makes use of the fast networking for low latency.
There is also the ability to run alongside traditional parallel file systems as well and further, it is possible to automatically reconfigure each I/O. For scalable metadata services, if this all works in practice, it is also practical to fulfill as many I/Os as required and more might be needed, scaling up the file system should work.
To minimize idle time for compute clients, HPE has partnered with WekaIO for its high-performance shared storage. WekaIO Matrix3 includes the MatrixFS flash-optimized parallel file system, qualified on HPE Apollo 2000 Gen10 systems and HPE ProLiant DL360 Gen10 Servers with Mellanox interconnect features.
Matrix, their file system and flash technology, targets performance of all-flash arrays with the scalability and economics of the cloud. Matrix transforms NVMe-based flash storage, compute nodes, and interconnect fabrics into a high-performance, scale-out parallel storage system that is focused on I/O-bound use cases.
Eight of the WekaIO enabled HPE ProLiant DL360 Servers, interconnected with Mellanox 100 Gbps EDR networking and running WekaIO Matrix File system software, are capable of delivering 30 GBps for sequential 1 MB reads and over 2.5 million IOPS for small 4K random reads.4 The infrastructure is capable of scaling to hundreds of storage nodes in a single namespace and HPE has published some noteworthy results on ResNet and other performance metrics.
“This is all about taking advantage of the emergence of flash technology in the interim period when Lustre and GPFSs were being developed and driven by the disk drive era,” Liran Zvibel, co-founder and CEO tells The Next Platform.
Zvibel has been in the high performance storage realm for most of his career and was one of the architects responsible for the hardware and software stack for the XIV Storage System, which IBM acquired in 2007 and rolled into Spectrum Accelerate, which is part of Spectrum Scale, which is the thing we still shamelessly refer to here as GPFS, just like everyone else does.
“GPFS has been successful but it was designed in the 1990s for hard drives and a relative small number of large files. IBM was not able to adapt it and bring into the age of NVMe and related technologies that can take full advantage of fast networking,” Zvibel tells us. “We saw an opportunity to build this and invest in a radically new data structure and algorithms that make sense for file systems that can run on tremendous amounts of NVMe and fast networks. We designed this from the ground up and have over fifty patents along the way all with the goal of paving new ways for how POSIX file systems can leverage new technologies.”
The real bottleneck in mixed workloads is in the file system, which WekaIO tackled by re-architecting a new one from the ground up that was designed to provide parallel storage performance using NVMe over fabrics but using similar tricks that Lustre and GPFS use to get more throughput. “To do that, we had to rethink how metadata works. We created a file system with no carry-over and it is now faster than a block array and more scalable than object. We want to show people that it is not necessary to change a workload to fit previous modes and keep running at scale and very fast.”
HPC is one area where mixed workloads are a storage issue, but WekaIO is seeing the bigger picture and focusing on bottlenecks that between supercomputing and commercial deep learning. For instance, GPU starvation is one issue they hope to tackle with their parallel Matrix file system. “Modern analytics platforms need to process large datasets to deliver the highest levels of accuracy to training and analytics systems,” Zvibel says.
“A high bandwidth, low latency storage infrastructure is essential to ensure the cluster is fully saturated with as much data as the application needs, otherwise it wastes expensive GPU resources. Our file system can saturate a GPU node and the integrated cloud tiering scales to exabytes of capacity in a single namespace.”
It would be nice to understand data durability and access reliability. Filesystems generally take a long time to reach maturity. If a flash device fails, is the data replicated elsewhere? If a server fails, does the entire cluster go down? I see lots of data on performance, but very little on reliability and durability, other than using S3 as backing store. Durability features frequently impact performance.
John Haller is correct to point out that there usually is a trade-off between performance and keeping data safe, or uptime & availability. However, WekaIO’s Matrix filesystem architecture was designed to deliver its extraordinary performance while also maintaining both data integrity and system resilience without compromise.
Files stored in Matrix are protected using WekaIO’s own error-correction implementation — different from but functionally equivalent to Reed-Solomon erasure coding. The result is more cost-effective than triple replication while also providing superior data durability depending on the configuration. And individual files can be dynamically rebuilt as needed.
Failure of whole storage nodes is rare, but the WekaIO Matrix cluster architecture is designed for resilience, and the same error-correction schema that protects data also allows the cluster to continue operating even while one or more nodes are missing — and to rapidly reconstruct the complete cluster integrity when replacement hardware is provided.
Lastly, John also asked about replicating data elsewhere, and I’d like to point out that in addition to supporting thousands of instantaneous snapshots with zero performance impact, the WekaIO Matrix filesystem can also export a special kind of snapshot to be stored externally on S3-API compatible object storage (the “snap-to-object” feature). On an as-needed basis, that special snapshot can be “rehydrated” by any other instance of the WekaIO Matrix filesystem. This provides a very efficient mechanism for propagating data to other locations, and subsequently that remote image of the data in a cluster can be periodically incrementally updated to reflect changes to the original data.