Strong-Armed Into HPC, Like It Or Not

If you are an HPC center in Europe, and particularly one that is funded by public funds, you are thinking about Arm-based CPUs in your supercomputers. And that is despite Arm Holdings being a British company and all of the issues with the United Kingdom and its Brexit separation from the European Union.

Arm is still the closest thing to a European architecture that companies can deploy, and it is a licensable architecture – even if it is not an open one in the strictest sense – and that standard in stark contrast to the X86 architecture that has dominated HPC compute for three decades now.

This is particularly true given the A64FX processor designed by Fujitsu, with its fat SVE vector engines, and used in the “Fugaku” supercomputer at RIKEN Lab in Japan and the intent by Arm Holdings to add substantial vector processing performance in its upcoming “Zeus” V1 core, which has already been added to the 64-core Graviton3 (code-name unknown) processor from Amazon Web Services.

But interestingly, the first use of the Arm architecture in stock HPC systems might be as a babysitter to accelerators, and that ironically means that Ampere Computing’s 80-core “Quicksilver” Altra CPUs and 128-core “Mystique” Altra Max CPUs could start seeing come action. Particularly given the high throughput, deterministic performance, and low price Ampere Computing is charging for these CPUs relative to X86 alternatives, as evidenced by the 40 percent to 45 percent better bang for the buck that Microsoft and Google are both delivering on Altra instances compared to Intel “Ice Lake” Xeon SP and AMD “Milan” Epyc 7003 instances. Every euro or pound not spent on the CPU in a hybrid CPU-GPU system is a euro or pound that can be spent on accelerators, memory, network, or storage.

And that is why E4 Computer Engineering, based outside of Milan in Italy and one of the scrappy supercomputer suppliers in Europe playing to its niches and often up against Atos, Hewlett Packard Enterprise, and Lenovo, is bringing Ampere Computing’s Altra and Altra Max CPUs to its systems.

As you well know, Ampere Computing has been very clear that it is designing processors expressly for hyperscalers and cloud builders, who want better security isolation between cores (and therefore instance types) and processors that have all their cores running at the same speed all the time so the performance is more predictable than with machines that set their own speeds based on workload. We have said all along that Ampere Computing’s path may lead it outside of its target hyperscaler and cloud builder customers, particularly given the success of the Graviton family at AWS, and that for many workloads, cheap cores with enough math and good throughput is what the HPC center will need in a CPU where the accelerator does most of the calculating work.

Eventually, we think, Ampere Computing will want a piece of the HPC and AI pie directly and will bring vector engines into some of its future processors so they can be used in all-CPU clusters running HPC and some AI workloads. Ampere Computing has its Altra and Altra Max CPUs in Alibaba, Baidu, Tencent, Microsoft, and Google and will not be able to sell into AWS but can probably make its way into Facebook and Apple. The point is, to expand its total addressable market, Ampere Computing is going to have to go where the market is leading it.

“At this moment, we see three driving forces for Arm in HPC and AI,” Fabrizio Magugliani, head of strategic planning and business development for E4 Computer Engineering, tells The Next Platform. “The first one is the European Processor Initiative, which has selected the Arm ISA for the “Rhea” general purpose processor. E4 is a member of the European Processor Initiative, and we will integrate the Rhea CPU into systems. The second degree of freedom is the fact that for most of the scientific workloads today, the processor is basically the driver of the GPUs and both Rhea and Altra support Nvidia’s CUDA offload. And third, with AI applications, again the workload is driven mostly by GPUs, and an Arm CPU is a very good solution because it shows a good TDP while driving the same performance as the top-level Intel Xeon processors. So more and more HPC users will endorse the Arm ecosystem because it has a comparable level of performance as top level X86 CPUs and an overall a lower total cost of ownership.”

Magugliani adds that E4 already has a couple of customers who have deployed Ampere Computing Altra Max CPUs into their systems, but cannot name names because of confidentiality agreements of these early adopters.

To help foster more widespread deployment of Altra and Altra Max CPUs in hybrid CPU-GPU systems, E4 has worked with Ampere Computing and Nvidia to put together what it calls the Nvidia Arm HPC Developer Kit, which puts an Altra CPU and an A100 GPU accelerator on a system node and bundles the Nvidia HPC SDK toolkit on top of it so customers can load and go with testing HPC workloads on accelerated systems. And, incidentally, Magugliani says that it has some other customers who are marrying the Altra and Altra Max processors with Xilinx FPGAs from AMD, too, the adoption of Arm CPUs in hybrid systems is not restricted to GPUs, whether they come from Nvidia, AMD, or Intel. The EPI’s own STX accelerator, which we have written about here and which turbocharges the math used for stencil tensor operations commonly used in the oil and gas industry, could also be well paired to an Ampere Computing Arm processor.

All of this work with Ampere Computing is a good hedge against any further delays in bringing the Rhea CPU and its “Cronos” follow-on from SiPearl, under the auspices of the EPI effort, to market.

The EPI should, I think, collaborate scientifically with RIKEN (like Fujitsu-Siemens in the SPARC days) to improve the A64FX for higher performance and lower power consumption; for example by porting the stencil-based STX (reportedly 3x faster and 5x more power efficient than a GPU, per your report) to the A64FX.

Eric Olson says:

July 30, 2022 at 7:32 am

I agree that pushing the A64FX design forward is a good idea.

In my opinion the heterogeneous compute environment that comes from mixing GPU accelerators with CPUs makes efficient use and programming such an engineering challenge that only the biggest projects benefit. At the same time there may be more science in projects with large scale computing requirements but smaller teams of software engineers.

While I suspect security and administration are also more difficult for systems using GPU accelerators, some algorithms simply require a tighter coupling between the things CPUs are good at and the things GPUs are good at that can’t be achieved with a unified-memory coherent-cache architecture. To solve such problems it really helps for everything to be further combined into a unified instruction set.

Reply

Hubert says:

July 28, 2022 at 10:05 pm

The EPI should, I think, collaborate scientifically with RIKEN (like Fujitsu-Siemens in the SPARC days) to improve the A64FX for higher performance and lower power consumption; for example by porting the stencil-based STX (reportedly 3x faster and 5x more power efficient than a GPU, per your report) to the A64FX.

- Eric Olson says:
  
  July 30, 2022 at 7:32 am
  
  I agree that pushing the A64FX design forward is a good idea.
  
  In my opinion the heterogeneous compute environment that comes from mixing GPU accelerators with CPUs makes efficient use and programming such an engineering challenge that only the biggest projects benefit. At the same time there may be more science in projects with large scale computing requirements but smaller teams of software engineers.
  
  While I suspect security and administration are also more difficult for systems using GPU accelerators, some algorithms simply require a tighter coupling between the things CPUs are good at and the things GPUs are good at that can’t be achieved with a unified-memory coherent-cache architecture. To solve such problems it really helps for everything to be further combined into a unified instruction set.
  
Hubert says:

August 5, 2022 at 3:46 am

Apologies for the late reply … I completely agree with you that this integrated approach has great advantages in terms of code development and deployment. It took me a while to understand better why the A64FX is not the performance leader that I expected … although it is very close, from the better perspective provided by HPCG. The dense matrices of HPL (regular top500) are one thing, but the sparse ones of HPCG (2nd list in top500) are a better fit to the numerical solution of PDEs (fluids, heat, contaminant transport) by finite difference and finite elements. Dense matrices are favored by accelerators while sparse ones require more address-generation gymnastic (or stencils) from the CPU. Top500 doesn’t give power consumption for HPCG so assuming that the machines use the same amount of power for both tests gives 1.5 MJ/PetaFlop for EPYC, 1.8 MJ/PF for A64FX, 2.5 MJ/PF for Xeon, and 3.5 MJ/PF for Power9. In HPCG then, the A64FX performs quite close to the highly-tuned EPYC (and better than the other archs). Also, if EPYC does indeed run at 1.5 MJ/PF in PHCG, then Frontier’s score would be 14 PF/s in HPCG, which would make it #2 to Fugaku (maybe that is why its HPCG score has not been reported?)!

Strong-Armed Into HPC, Like It Or Not

Sign up to our Newsletter

3 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

The Looming Arm Server Battle Between AWS And Microsoft

Lambda Snags $320 Million To Grow Its Rent-A-GPU Cloud

Nvidia Rounds Out “Ampere” Lineup With Two New Accelerators

3 Comments

Leave a Reply Cancel reply