Tuning Up ARM To Do The HPC Math

Timothy Prickett Morgan

8 years ago

For ARM processors to take off in the HPC arena, a whole bunch of pieces have to come together to create a platform that can compete against more established architectures. While many have obsessed – and correctly so – over the availability of production-grade 64-bit chips and Linux operating systems, to a certain extent the availability of compilers and their companion math libraries is just as important in the rarified air of HPC.

ARM Holdings, the commercial entity behind the ARM RISC instruction set and licensable processor components, is very eager for ARM chips from its various partners to take off in HPC. In fact, HPC is one of the two target areas where ARM Holdings believes that its eponymous architecture has a chance to take off in the datacenter and build some momentum, with the other area being hyperscale datacenter operators and their service provider peers.

To help accelerate the adoption of ARM chips for HPC workloads, ARM Holdings has been working with Numerical Algorithms Group for the past two years to port the latter company’s Fortran compiler and related Numerical Library to the 64-bit ARMv8-A architecture. But now ARM Holdings is taking it even one step further and is licensing NAG’s Numerical Library and its related software test suite tools so it can distribute those to customers in both open source and commercially supported variants.

Computational mathematics for high end servers and HPC are very important to ARM, and linear algebra routines are important for computational mathematics,” explains Darren Cepulis, datacenter architect and server business development manager at ARM Holdings explains to The Next Platform. “We endeavor to create a set of core math libraries that people can build higher level routines off of. These libraries will be optimized not just for our 64-bit implementations, but also those of our partners. As you know, different architectures and different memory subsystems can impact the performance of different math routines, and so it is important to have a set of libraries that are tuned for the hardware that you are running on.”

To that end, ARM is taking the BLAS, FFT, and LAPACK linear algebra and matrix math routines developed by NAG, which Cepulis says are the most widely used math routines in use on X86 platforms in the HPC space today, and tuning them up for ARM. These three libraries are not the full numerical library from NAG, but it is the key part that will get ARM started for optimized HPC application execution. It is not clear when and if ARM Holdings will license the full Numerical Library set from NAG, but Cepulis is clear that ARM software engineers will be doing further tuning of these three key HPC routines to squeeze more performance out of the ARMv8-A architecture. This optimization work is not a one-off thing, mind you. The architectures of the chips are changing at a steady pace, and there are going to be more implementations of the ARMv8 architecture coming to market this year and next, so the testing and tuning of the math routines will get broader and deeper. Cepulis says that the optimization work will be ongoing for the next couple of years, given the number of implementations that are coming down the pike and the number of compilers with which the math libraries need to integrate.

The math libraries that ARM has licensed will currently work on anything that supports the 64-bit AArch64 architecture, but they have been tuned to work better with ARM’s own Cortex-A57 cores and any chip that makes use of them and the ThunderX processors from Cavium Networks. ARM will be tuning the math libraries up to work with its Cortex-A72 cores next, and presumably Applied Micro’s X-Gene processors, which are also being aimed at HPC workloads, will be next. Others like Broadcom and Qualcomm, which are working on their own beefy ARM server chips, will no doubt join the party, as could others such as Phytium, Marvell, and AMD.

At the moment, the key compilers for ARM server chips are the open source GNU Compiler Collection (GCC) and the Low Level Virtual Machine (LLVM) compiler framework as well as the EKOPath compiler suite from PathScale, which previewed this suite at last year’s SC14 supercomputing conference. The preview of the EKOPath suite included C99, C++ 2003, C++11, Fortran 90/95, and partial support for Fortran 2003 and 2008, and also included BLAS libraries that PathScale ported over from the X86 architecture. These PathScale compilers support OpenMP 4.0 and OpenACC parallel programming extensions, and also support Power CPUs and Nvidia and AMD GPUs in addition to ARM and X86 CPUs. NAG’s Fortran compiler is also supported on ARM chips, and so does Python, which is increasingly used in HPC environments thanks to its own math and scientific algorithm libraries.

“We do an awful lot of GCC and LLVM work,” says Cepulis. “We have been tuning those for the past year and a half. We have increased performance on AArch64 by 15 percent in that time, so it is important to get the latest compilers to get the best performance. If you are on GCC 4.9, then you are 15 percent behind what is out there in the latest trunk.

ARM is not licensing or selling the either the PathScale or NAG compilers, by the way. If you want to use them, you will have to license them from either of those companies. But by licensing and open sourcing the NAG libraries it is making them available to the GCC and LLVM communities.

It is a little tough to find out all of the different components of the HPC stack that are working on or being moved over to the ARM architecture, and to that end ARM has set up an HPC ecosystem section on its web site (which now runs on ARM chips, finally) that brings it all together. (You can find out more about the ARM Performance Libraries based on the licensed NAG routines at this link.) Open source variations of linear algebra libraries are, of course, available out there on the Internet – OpenBLAS, BLIS, Atlas, and others come to mind – and many have been ported to ARM, and you can get them through the new ARM HPC site, too, or through various compiler and library projects.

For hybrid computing mixing ARM processors and Tesla GPU accelerators, Nvidia does all of the work to get its CUDA parallel programming environment to work in hybrid fashion across the compute engines, and at the moment the X-Gene processors from Applied Micro and the ThunderX processors from Cavium both can offload work to Tesla GPUs using CUDA.

As part of its commercial offering, ARM has identified 35 different open source packages and libraries that HPC shops care about and is porting the ones that have not been moved yet from X86 architectures over to AArch64 and tuning those that have been, including profiling tools and MPI libraries. ARM is wrapping up these packages with the three math libraries that it has licensed from NAG and selling it as a supported product (compiled into binaries) with an annual subscription for around $2,000 per programmer seat. There are no runtime licenses or royalties that have to be paid for applications that make use of these packages and libraries, and Cepulis says that a cluster license is available for national labs and academic clusters where lots of people might be coding.