When it comes to advanced technologies at the high end of compute, networking, and storage, Lawrence Livermore National Laboratory is one of the world’s pathfinding testbeds. Trying new things at scale is a big part of the mandate for the lab, which among other things, is the US Department of Energy’s site for running the simulations and models that manage the effectiveness of the US military’s nuclear weapons arsenal.
While Lawrence Livermore certainly has its fair share of capability-class supercomputers at the top of the Top 500 charts, the lab has a lot of different systems, which are used for both classified and unclassified work, and it has over 250 petabytes of storage that holds the data for these many systems. To get a sense of the size and scope of the storage systems in use at Lawrence Livermore, we had a long conversation recently with Robin Goldstone, HPC strategist in the Advanced Technologies Office at Lawrence Livermore.
Goldstone started at the lab back in 1995 as an HPC systems integration lead and was responsible for the design, deployment, and support of some of the first Linux clusters running at scale on the planet as well as several generations of exotic supercomputers based on myriad technologies. In 2006, she was tapped to be the associate program leader for networks and convergence at Lawrence Livermore, taking control of the strategic planning for the lab’s network connectivity. In 2011, Goldstone was elevated to her current position, where she is involved in developing the HPC strategy at the lab, including its exascale systems as well as commodity clusters and the next-generation I/O and storage systems that support them.
The file systems at Lawrence Livermore are at the moment dominated by the Spectrum Scale (also commonly called GPFS) parallel file system, but the lab has a long history of fostering the development of and the use of the open source Lustre parallel file system. And, more recently, the lab is also testing out the all-flash NFS scale out file system from VAST Data to see where this might fit into its future exascale-class storage needs.
The big GPFS parallel file system at Lawrence Livermore is is attached to the “Sierra” hybrid CPU-GPU system that was built by IBM and installed in full in 2018. That machine weighs in at 125 petaflops of peak double precision performance and it has a 150 PB of capacity to do its classified work. A smaller unclassified chip off Sierra, called “Lassen,” weighs in at 22.5 petaflops has its own proportionately sized GPFS storage. The rest of the parallel file systems used on various clusters at Lawrence Livermore are predominately based on Lustre, the open source alternative to GPFS that was created by Peter Braam of Carnegie Mellon University and cultivated by the lab. Goldstone says that the lab also has NFS file systems for user home and scratch directors, mostly NetApp filers, but the lab is using some of its COVID-19 funding to expand its use of flash-based Universal Storage from VAST Data. And, of course, there is a mammoth tape library that archives all of the data the lab generates and can’t throw away.
“We have a little bit of everything,” Goldstone tells The Next Platform. “We try to have our file systems be mounted across all of our clusters – the exception being against these very large systems like Sierra where that file system is dedicated to that computing system. But in general, our users move around from one system to another, and they use different systems depending on availability. In some cases, there is specific hardware that they might want to us, and they want to have their data with them wherever they go. And it’s advantageous to use as well because otherwise they are going to make copies of it everywhere.”
Portability is a key aspect of computing and storage at Lawrence Livermore.
“We really want our users to develop codes to be portable so they can run on different systems,” says Goldstone. “This whole notion of portability applies to their applications, and it also applies to storage, and in that regard, we want a common storage interface.” And that interface, of course, is a POSIX file system. And whatever parallel file system abstraction is used, and whatever underlying file system is used on the nodes in the storage cluster, it all ultimately has to look like plain old POSIX to the applications.
Lawrence Livermore has not said much about the storage in the “El Capitan” exascale-class machine that it will be installing in late 2022 and ramping up production on in 2023, but this being a “Shasta” Cray EX system with a Slingshot Ethernet interconnect, it was natural enough to think it would be a Lustre parallel file system because Cray backed Lustre even before it was acquired by Hewlett Packard Enterprise last year and HPE wants to keep the storage money for itself and not send it over to Big Blue for what is a parallel file system that is 2.7X as big as the one used on Sierra. El Capitan is going to have in excess of 2 exaflops of peak double precision floating point compute, and will have a very large file system as part of its $500 million hardware budget – around 400 PB in a disk array, as we reported a year ago as Cray was announcing its ClusterStor E1000 arrays, with around 2.5 TB/sec of aggregate bandwidth.
The future Lustre parallel file system attached to El Capitan splits the difference between using local flash on compute nodes as baby burst buffers for each node and virtualizing that up to look like a global burst buffer, as was done with GPFS by IBM Sierra, and actually building a giant, pooled burst buffer as some supercomputing centers have implemented.
“It’s a model where there are small pools of flash storage that are local to a set of nodes and can connect those nodes via PCI switching infrastructure, but can also be used to build a distributed file system, something like a small-ish Lustre file system that’s local and dedicated to a single user’s job. So I can’t speak a lot more about that technology. It’s still under development, but it’s going to be another way to try to solve that problem.”
Lawrence Livermore has used the Zettabyte File System, or ZFS, that Sun Microsystems created for its Solaris Unix operating system, as the underlying file system on the local node that supports Lustre, and was in fact instrumental in the porting of ZFS, which is open source, to the Linux operating system. And for the El Capitan system, HPE and Cray will be working with the lab to see how to take advantage of the object storage layer in Lustre, where the data is actually stored, and taking advantage of flash as the media to capture small files quickly and therefore improve the overall performance of Lustre across diverse file sizes.
If portability of applications and data is important within Lawrence Livermore, then experimentation across the various HPC labs that are funded by the US Department of Energy – there are seventeen such labs, but Lawrence Livermore, Oak Ridge National Laboratory, Los Alamos National Laboratory, Sandia National Laboratories, Argonne National Laboratory, Lawrence Berkeley National Laboratory are the big ones in terms of supercomputer budgets and installations – is equally important.
“We have different DOE labs with different missions and different applications and different use cases,” explains Goldstone. “But we do try to collaborate and look at these different technologies and adopt different ones in different places so that everybody has a chance to try out these technologies and see if there is one that is better than the other.”
And some of that experimentation at Lawrence Livermore means trailblazing with VAST Data’s Universal Storage, which debuted in February 2019 and which is designed to look like an NFS file system but have the huge storage capacity and I/O performance of a parallel file system.
“VAST Data is very interesting to us,” Goldstone says, saying that the company’s Universal Storage has a lot of promise. “One thing that’s very interesting is from the client side, it looks like NFS, meaning it uses the standard NFS client that’s part of Linux and is part of the OS stack that runs on all of our clusters, versus something like Lustre or GPFS, where you have a client that’s provided with the file system, but it is a piece if software that has to be integrated into our software stack.” Meanwhile, on the backend, says Goldstone, Universal Storage looks nothing like an NFS file system as we know it, with all of its limitations. “It’s a scale out architecture, and again, it has lots of innovative features. It’s making use of 3D XPoint as a landing area where data initially gets deposited and it is persistent immediately. It has performance characteristics closer to memory, but the persistence of flash And then behind that is another layer of very low cost flash. They’re using cheap, commodity flash and managing the erasure coding and wear leveling of that flash very efficiently so that they can make that flash be used for a long period of time.”
The other attractive thing about VAST Data, according to Goldstone, is that it can scale out in two ways. The CNodes, which project the storage out to the client, can be scaled out to increase the performance of the storage, and then scale out the storage capacity separately as needed in HA Enclosures that sit on the other side of the NVM-Express fabric that links all of this together. That independent scaling of compute for driving storage and the storage itself as well as the different projections – NFS or S3 object storage are the key ones that matter at Lawrence Livermore right now.
“Object storage hasn’t been a big area of interest at Lawrence Livermore for our applications, but we do have new and different types of applications that we’re trying to support where people want object storage. It’s just available with another interface with VAST, so we can just turn it on and let people try it.”
And so the lab is encouraging users who have capacity quotas on Lustre or GPFS to give Universal Storage a whirl, which at the moment doesn’t have any quotas precisely because Lawrence Livermore wants to entice users to try it.
Things that take off at Lawrence Livermore have the habit of taking off elsewhere. Such as GPFS and Lustre, for instance.