Memory-Like Storage Means File Systems Must Change
May 24, 2017 Timothy Prickett Morgan
The term software defined storage is in the new job title that Eric Barton has at DataDirect Networks, and he is a bit amused by this. As one of the creators of early parallel file systems for supercomputers and one of the people who took the Lustre file systems from a handful of supercomputing centers to one of the two main data management platforms for high performance computing, to a certain way of looking at it, Barton has always been doing software-defined storage.
The world has just caught up with the idea.
Now Barton, who is leaving Intel in the wake of the chip giant ceasing its commercialized Lustre business, wants to help diversify and commercialize the file system that is at the heart of DDN’s Infinite Memory Engine (IME) burst buffer, and so he has taken the job of chief technical officer for SDN at the high performance storage company.
Barton has spent a career that spans more than three decades on the bleeding edge of high-end storage, starting in 1985 with co-founding Meiko Scientific, a maker of parallel supercomputers based on transputers (remember those?) rather than ordinary vector or single-threaded processors as we know them. Lawrence Livermore National Lab stepped up and bought one of the Meiko Computing Surface clusters rather than the Connection Machine from Thinking Machines. The Computing Surface needed a file system, and Barton tells The Next Platform that Meiko was so short of staff that he had to write the parallel file system, obviously called PFS, himself.
“I didn’t know quite what to do,” Barton recalls. “Solaris had this lovely virtual file system concept, and I thought I could just stripe a file system across other file systems and that will be it. I had one file that I used as the namespace and other file systems where the data was striped across.”
This sounds simple enough, but getting a namespace to scale and therefore provide access to data chunks both large and small across the multiple nodes that comprise a parallel file system is a tricky business, indeed. That is why the Lustre file system was born, and that is why IBM, SGI, Sun Microsystems and others also spent a fortune developing parallel file systems over the decades when supercomputer was new and cool.
“You have got this great big thickness of software that is between you and the hardware, and these new technologies, and 3D XPoint is the exemplary, forced me to see that everything has got to change. No longer do you have this big latency that lets you hide all of the functionality of the storage system. If you have this thickness of software between the application and the storage media, you deny the application the benefit of that media.”
Lustre was “essentially the same thing as PFS, but done properly,” Barton says with a laugh. “And if you had talked to me ten years ago, I would have said that Lustre would address the scalability issues of the namespace and the exascale file system of tomorrow will essentially be, fundamentally down in its basement, will be Lustre. And then 3D XPoint memory happened.”
The trick here is that 3D XPoint looks like memory, but it is persistent like disk or flash storage. Now, storage operations have gone from milliseconds down to microseconds, and while Barton says there is a lot of hype about flash and 3D XPoint, it is many orders of magnitude faster and you have to change the storage software layer to optimize for this difference. The emergence of the NVM-Express protocol, which allows compute and memory complexes to talk directly to flash storage rather than have the flash emulate a disk and go through the SCSI device driver stack is an example of the kind of change that needs to be made in a system and that will be done in larger file systems that are used to thinking in terms of petabytes disk storage, with tens of thousands of spinning disks.
“You have got this great big thickness of software that is between you and the hardware, and these new technologies, and 3D XPoint is the exemplary, forced me to see that everything has got to change,” explains Barton. “No longer do you have this big latency that lets you hide all of the functionality of the storage system. If you have this thickness of software between the application and the storage media, you deny the application the benefit of that media.”
After Intel bought Whamcloud in July 2012 to put its muscle behind commercializing Lustre, Barton had an inside view on the development of 3D XPoint and has also been working on the Distributed Application Object Storage (DAOS) project with colleagues at Seagate Technology, DataDirect Networks, the University of California Santa Cruz, Sandia National Laboratories, and Lawrence Berkeley National Laboratory. DAOS is being funded by the US Department of Energy’s Fast Forward Storage and IO Stack effort, and the idea is to create an abstraction layer between the Lustre file system and the underlying media, whether it is some kind of NVRAM, flash SSDs, or spinning disks; it is being designed to exploit ultra low latency media and interconnects, and is basically an object storage system like those made popular by the hyperscalers that has a flexible namespace method that blurs the lines between databases and file systems as we know them.
“DAOS was all interesting, but it is not resulting in products yet,” says Barton. “And I really want to work on products.”
As a commercializer of the Lustre file system as well as a reseller of the Spectrum Scale (GPFS) file system from IBM, Barton is obviously well acquainted with DDN. But it was his collaboration with Paul Nowoczynski, parallel and distributed storage systems architect at DDN, on the DAOS project that led him to the internals of the IME burst buffer and its potential as the foundation for a much broader set of storage products and the chance to help make that happen.
Barton says that the software internals of the IME burst buffer have two main components, and they are a ground up implementation not based on any other open source projects or previous DDN products. “The dead hand of legacy software is not dragging this back into the swamp,” says Barton.
The IME software stack includes a bulk data store based on the object model that makes use of erasure coding to ensure that data is not lost without having to make two or three copies of the data. (The erasure coding overhead is on the order of 20 percent, not 3X.) The IME stack also has a highly elastic and scalable metadata store, which is what is used to keep track of what is stored in the object storage and where it is placed on the clustered storage in the burst buffer.
The IME stack uses a distributed hash table as the mechanism for delivering the global namespace that keeps track of the data as it comes into and exists the burst buffer, and the current implementation has its knobs and dials set specifically for the kind of burst buffer workloads expected in HPC centers. The idea is to take the randomized hot data that is coming off applications and sort it out and cool it down before pumping out into a traditional parallel file system for longer term storage. Supporting terabytes per second of I/O requires a huge amount of metadata, and whether a file is “a big chunky thing or a crappy small stuff,” the IME stack’s metadata server does an I/O operation for this data. This namespace layer is itself a kind of key/value store and it would not be terribly difficult to layer on top of it the APIs for MongoDB or Cassandra, just to name two popular NoSQL databases.
The neat bit is that this namespace layer can absorb data at different rates, depending on how busy the nodes in the storage cluster is and on what underlying hardware it has, so that performance can be balanced across a wide variety of hardware and application data needs. This capability is important for HPC style workloads because there is a saddle distribution of file sizes and count in most HPC workloads: there are zillions of small files that might only make up 1 percent to 5 percent of total capacity, and then there are thousands of big files (such as those used for checkpointing) that account for the vast majority of the capacity of the storage.
Getting read and write performance for both types of files (and anything in between) is difficult. (Storage startup Qumulo, which was founded by the techies who created Dell/EMC’s Isilon storage, is tackling this bi-modal storage issue, too. And so are others, as we have discussed recently.)
But even HPC workloads are changing fast, so DDN’s products have to react to this. Barton elaborates:
“Metadata is being used so much, that I kind of feel that the large stuff is going to dominate less. It is always going to be important that you always have to handle the large stuff well, that you deliver the streaming bandwidth of the media and that you have efficient erasure codes so you only have 20 percent overhead on the capacity to guarantee availability. But more and more, problems are becoming more irregular and people are dealing with more complex systems. At the beginning of HPC, the problems were very regular and the software was very simple. Now, you look at things like multi-physics codes and it is hugely complex, and it is going to get more and more like that and that is going to put more and more pressure on HPC delivering the kind of IOPs that commercial applications gave up on. Commercial applications have completely abandoned the idea of ever streaming data for their databases out of their storage media. It is all about IOPS. Once you relax that, once you allow yourself not to worry about using up the full bandwidth to get your checkpointing done, and you stop thinking about it as this many gigabytes and instead as this many trillions of chunks, then everything will be in terms of IOPS. The capacity is going to be less squeezed, but the number of entities and therefore the rate of the IOPS is absolutely going to dominate. But you will still have some bulk data requirement.”
Barton says that DDN is not just going to build a burst buffer out of this software stack. Even though the company has a background in HPC and a burst buffer is the first obvious thing to do, DDN wants to take the technology at the heart of the burst buffer further and address much wider markets.
This could include – and Barton is not making promises here – an object storage platform, an elastic block storage system, a graph database, and NoSQL key/value databases. The idea here is not to grab open source code and graft it onto the IME stack, but rather to support the APIs for accessing storage that are used with the popular open source storage systems and databases and let the IME stack handle the transformations. The Swift and S3 object storage APIs, the Ceph and Cinder block storage APIs, and the Cassandra and MongoDB NoSQL APIs are the obvious ones to support, and it is likely that DDN will support these atop its “Wolfcreek” storage platform running the IME stack with a diversity of memory, NVRAM, SSD, and disk media to meet the throughput, latency, and capacity goals of the storage system.