Oak Ridge National Lab’s forthcoming “Frontier” supercomputer will be blazing a multitude of new trails when it goes live in 2022. While it is easy to focus on the massive scale of the compute resources, that scalability presents new challenges from an I/O standpoint.
“Scale is always a problem and everything we do is at extreme scale,” ORNL IO Working Group lead for the Frontier system, Sarp Oral, tells The Next Platform. “First, the vendors cannot test a system of this scale before it’s deployed so we will have to trailblaze at deployment time to work out the kinks and make sure the system is stable.” As the lead for all things I/O, central to this plan is a new approach to storage via a dual-pronged approached: a flash-based in-system storage layer and a center-wide Lustre and ZFS-based file system called Orion.
The Frontier supercomputer will test some serious I/O capabilities. The broader flash environment will be comprised of over 5,000 NVMe devices for a peak read/write of 10TB/s and over 2 million random-read IOPs. Capacity wise that many NVMe drives presents 11.5 petabytes as well. Of particular interest is a flash-based metadata tier of 480 more NVMe devices to handle all the mixed workloads and the new file system’s structure, all the while adding another 10 petabytes of capacity.
It would be unimaginably expensive to outfit all of Frontier with high-end NVMe-based technology, of course. The bulk of the storage environment is comprised of 47,700 disks for almost 700 petabytes of capacity and peak read/write around 4.6 TB/s.
“On a machine of this size, even if all applications are doing proper large-scale IO, at the end of the day, a centralized file system will see a mesh of all those requests, random or otherwise, and from a file system perspective, everything will become mixed, random I/O even if the applications aren’t presenting that.” This approach will provide a standard backbone to match increasingly mixed workloads, from traditional modeling and simulation to machine learning. The performance tier can tackle both workloads with low latency over disk and the end result will be a more uniform environment for applications of all stripes.
“Our workloads will increasingly be a mix of traditional modeling and simulation and AI-type workloads, so it’s not just large message performance we need to think about, but also all the small message random read,” Oral adds.
Orion will have 40 Lustre metadata server nodes and 450 Lustre object storage service (OSS) nodes. Each OSS node will provide one object storage target (OST) device for performance and two OST devices for capacity—a total of 1,350 OSTs systemwide.
An extra 160 nodes will serve as routers to provide peak read-write speeds of 3.2 terabytes to all other OLCF resources and platforms.
The in-system storage layer will employ compute-node local storage devices connected via PCIe Gen4 links to provide peak read speeds of more than 75 terabytes per second (TBps), peak write speeds of more than 35 TBps, and more than 15 billion random-read input/output operations per second (IOPS). OLCF engineers are working on software solutions to provide a distributed per-job name space for the devices.
“Orion is pushing the envelope of what is possible technically due to its extreme scale and hard disk/NVMe hybrid nature,” said Dustin Leverman, leader of the OLCF’s High-Performance Computing Storage and Archive Group. “This is a complex system, but our experience and best practices will help us create a resource that allows our users to push science boundaries using Frontier.”