There are just two Arm-powered supercomputers on the latest TOP500 rankings: the “Astra” system at Sandia National Laboratories and Fujitsu’s new A64FX prototype. The latter captured the number one spot on the Green500 list, becoming the first non-accelerated system to do so since 2012. There are also a handful of smaller Arm-based clusters, such as the University of Bristol’s Isambard supercomputer, which has yielded some impressive performance results on an array of HPC microbenchmarks.
The small number of Arm-based HPC machines in the field has to be balanced against those achievements. But numbers do matter and a lot of industry-watchers, including us, had speculated that the architecture would be enjoying wider adoption at this point. At our HPC Day event prior to SC19, we spoke with Brent Gorda, senior director of the HPC Business at Arm, about what may be behind the slow rollout and what the company is doing to speed adoption.
As Gorda has told us before, developing a new processor is not a quick process and can require $100 million or more just to get to first silicon. “The primary reason is that hardware takes a long time to come out of the oven,” Gorda tells The Next Platform. “I’m a software guy and I look at these hardware timelines and I scratch my head and think that’s got to be a really hard problem.”
As he reminded us, the Fujitsu project to develop the A64FX processor for the upcoming “Fugaku” supercomputer for RIKEN lab in Japan was a ten-year effort. Of course, some of that extended timeline has to do with the fact that it was Fujitsu’s first Arm server design and one that include the Scalable Vector Extension (SVE) instruction set, which had never been implemented by anyone before. The project was not just for that one system. Fujitsu will also use the A64FX to power its newly announced FX1000 and FX700 HPC system, while Cray will offer the processor as an option in its “Storm” CS500 cluster line.
The ThunderX2 processor, originally available from Cavium and now owned by Marvell, is a more generic Arm design and is currently the one that is being offered by most OEMs offering HPC systems. (The ThunderX2 is really the “Vulcan” chip that Cavium picked up from Broadcom when that chip maker rival exited the Arm server processor field ahead of its own planned launch.) The ThunderX2 powers Sandia’s Astra supercomputer, Bristol’s Isambard system, and a handful of smaller HPC clusters in the United Kingdom, the rest of Europe, and the United States. But it wasn’t available until May 2018, three years after the original ThunderX predecessor debuted.
The slow cadence of these initial Arm development efforts is understandable, inasmuch as they represent first- or second-generation designs. But if these chips are to compete with their x86 competition, the Arm vendors will have to pick up the pace. Both Intel and AMD produce new generations of server silicon on a two-year (or less) cadence, which puts the current Arm server chipmakers at a disadvantage.
As Gorda points out, Arm Holdings, the entity owned by SoftBank, doesn’t produce chips. It provides the IP for doing so. More broadly, Arm drives an ecosystem for its namesake architecture that involves myriad software and hardware vendors.
One example of how that has paid off is the decision by Nvidia to bring Arm into CUDA’s inner sanctum so that its GPUs can work with yet another host processor. As a consequence, the Arm ISA will be supported throughout Nvidia’s software stack. As we reported back in June when it was announced, this has the potential to provide tighter integration between Arm processors and Nvidia IP, such as NVLink or InfiniBand from Mellanox Technologies (soon to be part of Nvidia), which could result in the same type of heterogeneous server designs we see today with IBM Power CPUs and Nvidia GPUs supercomputers. At SC19, Nvidia announced that most of the Arm support in its CUDA libraries for HPC and AI are now available in beta form, setting the stage for broader adoption of the CPU-GPU combo.
Arm also can help speed up chip development and reduce costs for its chipmaking partners more directly by offering standard IP building blocks, like, for example, a Compute Express Link (CXL) connector, to drop into an otherwise custom implementation. That approach is even more appealing for platforms built from chiplets. In this case, the IP blocks can be assembled with other custom, semi-custom and standard chiplets as part of an Arm-based package. This model is being used by the European Processor Initiative (EPI) to develop its home-grown HPC processor.
Outside of HPC proper, there are a number of Arm server platforms that are making their way into the broader market, including the Graviton processor developed by AWS for cloud duty, the Kunpeng 920 processor from Huawei Technology’s chip arm HiSilicon, and Ampere’s Skylark design inherited from Applied Micro.
For both HPC and non-HPC designs, the larger challenge for Arm is to align its own Neoverse roadmap for server IP with that of its chip and foundry partners. If all goes as planned, Arm will be releasing a new Neoverse platform every year based on progressively advanced process nodes, starting with “Ares” in 2019, at 7 nanometers, followed by “Zeus” in 2020, using a refined 7 nanometer process, and “Poseidon” in 2021, at 5 nanometers. Each iteration is designed to deliver approximately a 30 percent performance boost for each generation, along with new features.
Exactly how Arm brings its partners along and which chipmakers have the capacity to keep pace with the Neoverse roadmap remains to be seen. But Gorda is optimistic that things will start to coalesce in coming years. “I think what you’ll see over time is that there is not only an increase in the frequency that will better match the process node and our own roadmap schedule,” Gorda says, “but also in the breadth of offerings that are available to the market.”