It might have been difficult to see this happening a mere few years ago, but the National Nuclear Security Administration and one of its key supercomputing sites are looking past Intel to Arm-based supercomputers in hopes of reaching efficiency and memory bandwidth targets needed for nuclear stockpile simulations.
Los Alamos has invested heavily recently in standard Intel-based machines recently without the bells and whistles of other leadership class national lab machines in order to stay focused on mission versus tuning to exotic systems, and while that has changed with the introduction of Marvell (formerly Cavium) 64-bit ThunderX2 into the NNSA supercomputer fleet, Deputy Division Leader on the HPC side at LANL, Gary Grider, tells us that the jump to Arm was far easier than one might imagine, even with some of the world’s most complex simulation codes.
The first Arm machine at LANL is not a huge one, a few hundred nodes, Grider says, but it is running real codes (versus mini-apps) and arriving at accurate results. The new system, which just went live in the last few weeks, is called “Thunder” and was integrated by Cray as one of its Aries inteconnect-based XC50 systems. Cray, among a few others in HPC, have been instrumental in working with the Arm developer community to deepen support for Arm in their Cray operating environment.
All of this work on Cray and Arm’s sides paid off in what Grider described as a surprising launch on a different ISA with the lab’s codes. He says it took very little work and with early tests, they are seeing a 3X improvement in system efficiency over the Broadwell and Haswell machines on site currently. He also tells The Next Platform that it took far less time to stand up the XC50/Marvell cluster than the Broadwell machine. That efficiency number is key here because he says that overall efficiency (percentage of peak) with Broadwell and Haswell on their codes was sub-1% in some cases with the top end being 3%. The XC50 is more like 3-6% he says. To those outside of HPC that might sound like an abysmal number. It is, but that is the case on most of the supercomputing codes no matter what the underlying architecture.
The processor ecosystem for HPC has been looking richer lately with the real viability of AMD’s role in the market confirmed. Grider says that Rome looks appealing but the momentum for Arm was already in play at the time they were securing this new machine. In other words, there is no discounting AMD for future systems for the same efficiency and memory bandwidth reasons, they were just not ready at the time of planning for the Thunder system.
Of course, it also helps that Marvell is a U.S. based company with a valid 64-bit Arm offering given the sensitive nature of these workloads. While Fujitsu has a very compelling processor option in the same vein, it is clear why the NNSA would not go that route. As for other Arm-based chips, Grider says they were not server-class in one way or another. “The Arm ecosystem is healthy and robust because it has an enormous industrial base, Arm is used in almost every device we use. That healthy market means a stable base for these server-class chips within an ecosystem that is competitive within its own ecosystem and now against Intel and AMD. While there wasn’t a real HPC ecosystem for this until recently with Cray and others working to build that, we now have a real alternative.”
“We have important work to do to certify the stockpile every year and with the current nuclear posturing, perhaps even more to do. What we are saying is we need more than 1% efficiency. So we are starting to buy machines and fund efforts that work toward generating much higher efficiency machines that require less effort from programmers. There is a lot of talk about co-design, but the hardware people have just gone to design something for another market and we are expected to change out codes. Co-design needs to do both things for both sides. It wasn’t happening so we have started down this path and want to anyone else doing complex simulations to see that there is something more amenable than what the Linpack-focused approaches are providing.”
Grider says there is nothing wrong with those Linpack-centric machines with GPU floating point prowess given the rise in machine learning workloads in HPC. However, for strict simulations at scale, this is not a viable path. LANL will be working with Marvell to build more features necessary for their simulation workloads into future generation chips.
LANL will be issuing an RFP for the revised version of its Crossroads HPC program to invest in next capability and capacity machines for the 2021 timeframe in February. Grider says that the focus there will be on efficiency, memory bandwidth capability, and ease of use by programmers. From the sound of it, Arm might find one of its most striking HPC use cases in the center of one of the most challenging simulation workloads.
“Every time we get more memory bandwidth, we go faster. But when you give us FLOPS, we don’t. We just wait for memory more often and we are looking for a better way to do this for our codes.”