For Arm-Driven Supercomputing, Nvidia is Right on Time

While Marvell’s ThunderX family of server-class processors might not have taken high performance computing by storm from the outset, where there was interest and demand, it was fierce and committed. Luckily, all that effort on optimizing Arm-based architectures for HPC isn’t lost, and in fact, it might have a better shot, especially on machines that might include other accelerators. In short, Nvidia saved the day for labs that invested big in the future of Arm. Los Alamos (LANL) is a perfect example.

LANL will be among the first to receive a supercomputer equipped with Nvidia’s “Grace” CPUs, rounding out the labs fleet of machines designed for both classified weapons and broader unclassified scientific research.

Expected by the end of next year with completion in the first half of 2023, the system’s capabilities will provide some of the Arm-specific processor tailoring the lab was banking on from the ThunderX family (cancelled by Marvell last year) along with the memory bandwidth that drives LANL’s big system decision-making.

LANL is famously experimental with mid-sized supercomputers but for its mission-critical weapons simulation machines, tends to stick with standards. The current production workhorse, Trinity is all-CPU and the future leadership-class system, Crossroads (2022) is also Intel-based (Sapphire Rapids). For the lab’s High Performance Computing Division Leader, Gary Grider, the Nvidia Grace-based machine will provide a “look into the future” of where LANL might go, especially since it appears to have all the elements from other processors cooked into one: processor customization potential, solid memory bandwidth, and from a systems perspective, both the Cray environment (via HPE) and compilers from Nvidia/PGI.

Even though there weren’t many early heavy-hitters serious about ThunderX for large clusters, LANL (and Sandia as well as U. of Bristol with its “Isambard” machine) were quite serious. Grider tells us they were already working with Marvell on ThunderX4 when the entire family of processors was set aside. Luckily, he says Nvidia was there right as that was happening, providing a new route to the most important Arm features a lab like LANL wants: incredible memory bandwidth and the ability to fine-tune the processor for specific work.

“It was really a shame that Marvell left ThunderX on the table because that was going to be an incredible server-class Arm chip with good memory bandwidth. We were interested in Arm because we think tailoring processors is a way for us to get much higher efficiencies—having some of our sauce in the process was where we wanted to go. When Marvell fell through, we started shopping to see what we could do next and honestly, Nvidia needed this, they needed something they could do special things with and needed control over their own destiny in that space. I’m not going to say who infected who, but we’ve been on-board with Nvidia for a long time, since almost before the Marvell blow-up.”

“From a processor point of view, memory bandwidth is king for us. You can see this if you look at Crossroads with Intel SPR and HBM. We wanted an Arm processor because we want to add some of our own things and that’s a high cost with Intel, with Arm, it’s not—all things relative. That’s what we’ve been chasing. If you think about memory latency, you can’t fix it but you can hide it and the only way to do that is to put elements into the processor to do things to help with your problems like scatter/gather, one-sided memory operations, and so on—things that are not application specific but genre-specific, so to speak,” Grider explains. “That’s why we were going down the ThunderX road in the first place. Intel makes a great processor and SPR will be good but we want to go beyond that and tailor it our needs. We’re starting with “Grace” but the real target is what is beyond this chip that might make a real difference for us.”

Grider adds that there are other benefits from Nvidia being on time with a replacement strategy processor-wise. For instance, he points to the SVE unit approach since LANL codes do vectorize to some extent. They’ve always used AVX and they needed first-class vectorization along with the memory bandwidth, which was what Marvell really promised, especially with what Grider says he could see on the ThunderX4 roadmap. He thinks the “Grace” processor can fill all these gaps but “how much of that lands in this chip versus the follow-on remains to be seen.”

While LANL knows what is most promising about the forthcoming machine (which dollar-for-dollar should be a 6-7X improvement over the Trinity supercomputer) there are still some remaining areas of exploration. For instance, even though HPE is putting the system together, he adds that they have not yet decided on interconnect or rack type beyond confirming it will be water cooled. He adds that they will likely have some GPUs on the machine as well with the CPUs being put to the test for some experimental weapons workloads.

“It doesn’t have to be pressed into immediate weapons certification work, this is a LANL not NNSA machine. It’s unusual for us to do this since we’re a production-oriented weapons site. We will get to play around a bit and use it for some forward-looking work.”

Some of that forward-looking exploration includes keeping system balance via a strong I/O strategy. NVMe and computational storage are at the heart of new approaches Grider wants to try. “We’re one of only a few user players in the NVMe standards space, along with Amazon, so we’re pushing hard for computational storage and doing demos with different companies. But what I really want is user space access directly to the NVMe devices in a secure way. Nvidia comes close with GPU Direct Storage, but the security piece is weaker because NVMe isn’t fully fleshed out on user space access.”

Grider says if it was possible to take GPU Direct concept and match it with secure computation it would be the best of all worlds. “How far we’ll get by 2023 with all of this is hard to say with standards and prototyping but we’ll have a good sense in six months or so.”He says given the experimental nature of the “Grace” based system they might end up setting one-third of the storage system aside to test next-generation secure computational storage.

“The approaching era of exascale AI is bringing an unprecedented wave of innovation in supercomputing,” said Ian Buck, vice president and general manager of accelerated computing at NVIDIA. “NVIDIA’s long-standing partnership with Los Alamos National Laboratory will deepen with NVIDIA Grace.”

“This is a major engagement for the Laboratory,” said Bob Webster, deputy director for Weapons at Los Alamos. “The ongoing close collaboration with NVIDIA to innovate for greater efficiency and more in-depth analytics will benefit both high-fidelity 3D simulation, new modes of computing for mission, and shape our future computing procurements.”

In the context of Arm and HPC it seems a little weird not to mention either that the number 1 machine in the Top 500 is Arm based (https://www.top500.org/system/179807/), or the Fujitsu A64FX technology from which it is built. Especially since you mention that “memory bandwidth drives LANLS’s big system decision-making”, and that is precisely an area in which A64FX shines. Such machines are available from Cray (oops… HPE), and are installed both in Isambard 2 (https://www.nextplatform.com/2020/03/08/isambard-2-is-about-driving-technology-diversity/) and at Stony Brook (https://news.stonybrook.edu/stony-brook-matters/alumni/stony-brook-installs-new-supercomputer/).

Nicole Hemsoth says:

April 16, 2021 at 10:02 am

Was more focused on LANL’s journey with this but good point, Jim. Thanks.

Reply

Michael A Bruzzone says:

April 15, 2021 at 1:07 pm

Nvidia Grace surely known among these club members establishes Nvidia procurement of ARM has been a rouse. Regardless of Nvidia cross ARM licensee PDK apprehension, Nvidia vertical by horizontal in server CPU space and regardless CPU competitive in relation GPU complimentary nubs the acquisition from a regulatory perspective. Nvidia has found a better way to take control and/or get what Nvidia wants from the ARM road map and ARM Holding’s avoids being accused of block marketing?

What a deal.

Mike Bruzzone, Camp Marketing

Jim Cownie says:

April 16, 2021 at 4:02 am

In the context of Arm and HPC it seems a little weird not to mention either that the number 1 machine in the Top 500 is Arm based (https://www.top500.org/system/179807/), or the Fujitsu A64FX technology from which it is built. Especially since you mention that “memory bandwidth drives LANLS’s big system decision-making”, and that is precisely an area in which A64FX shines. Such machines are available from Cray (oops… HPE), and are installed both in Isambard 2 (https://www.nextplatform.com/2020/03/08/isambard-2-is-about-driving-technology-diversity/) and at Stony Brook (https://news.stonybrook.edu/stony-brook-matters/alumni/stony-brook-installs-new-supercomputer/).

- Nicole Hemsoth says:
  
  April 16, 2021 at 10:02 am
  
  Was more focused on LANL’s journey with this but good point, Jim. Thanks.

For Arm-Driven Supercomputing, Nvidia is Right on Time

Sign up to our Newsletter

3 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Nvidia Embraces The CPU World With “Grace” Arm Server Chip

The Ampere Arm Server Chip Roadmap May Lead Beyond Hyperscalers

Gutting Decades Of Architecture To Build A New Kind Of Processor

3 Comments

Leave a Reply Cancel reply