Long Live the HPC Parallel File System

When it comes to parallel file systems, few people understand the evolution of challenges better than Sven Oehme, who was part of the original team at IBM building GPFS.

Following almost a quarter century at Big Blue, Oehme has made the shift to DDN where he heads research, focusing on metadata, streaming I/O, and analytics-based optimization on both Lustre and GPFS as well as DDN’s own software stack for fast HPC storage.

As someone who has devoted a career to parallel file systems, hearing that a change in HPC workloads is stretching those file systems to their limits invokes strong opinions. Oehme says there is no chance distributed file systems like Lustre and GPFS will go away anytime in the near future and besides, there are continuous improvements that keep pushing latency and complexity abstraction limits. His assertion is that the problem has less to do with the file system and more with the choices of storage media for the workloads at hand. Since those are changing, it means a rethink is required from centers with increasingly mixed workloads.

“Many users have performance complaints but this is often less a problem of the file system and more of the storage media for the types of workloads that are running,” Oehme says. “If you look at some of these machine learning workloads you will see that the average I/O size compared to previous HPC applications came down significantly. We are no longer talking about megabytes, but kilobytes—anywhere from 30k to 128k. It’s no longer a question about the throughput of a system but rather what the random I/O capabilities of the media are. This is where something like flash has obvious advantages over hard disk, running workloads with just HDD will not work well for these workloads.”

He continues, “If you look at the acceptance criteria for the CORAL machines, you need to be able to create files, write 32KB worth of data in it, and run this workload across a large number of nodes and sustain that for 20 minutes and reach a rate of 2.6 million sustained creates and writes per second. Those are massive numbers. But this is a benchmark not a workload ad further, many of the systems these are deployed on rely entirely on hard drives. And so when you have a smart caching algorithm in the system you can absorb writes quickly, but if you have to read data back and it’s read randomly and not sequentially, which is what most previous HPC applications did, then you very quickly run into the physical limitations of a disk drive.”

This, coupled with the wider use of storage class memory and RDMA-based techniques mean the future of HPC I/O is bright, even if there are some kinks to work out on the way. For instance, he says that the file systems are less to blame for all that talk about limitations as AI and other workload pressures build than some of the policy and POSIX-based restrictions that keep those systems from handling mixed workloads better.“With locking, which is where users are seeing more problems in real upcoming workloads. There is a significant amount of contention as the nodes are all trying to access the same data. The strictness of POSIX and that locking contention is probably more of a problem than the file system overhead itself,” Oehme explains.

Although flash is expensive, this is where the mixed workload problem for HPC storage gets sorted and helps explain why burst buffers have received so much attention in the last couple of years. DDN has their own IME burst buffer which, as Oehme says, allows users to put a flash tier on top of an existing parallel file system and have access to a high performance, low latency tier to that can work well for workloads that do not have the giant datasets. “One of the fundamental issues something like IME solves is that it takes away the extreme coordination effort that has to happen in a typical distributed file system between clients. If you have multiple clients writing into a single file, IME solves this by essentially having a distributed synchronized clock across the nodes and IO just flows into the IME system and is sent very efficiently down to the distributed file system.”

We go into much more depth about some of the strengths and weaknesses of GPFS and Lustre generally as well as the future of flash hardware and its interaction with the main file systems in the video interview from SC18 below.