Tuning Up ARM To Do The HPC Math

November 23, 2015 Timothy Prickett Morgan Code, HPC 5

ArmCortex — For a company that is based on passing innovation down the line for its users to build on, Arm is taking cues from its own playbook. The components of its first foray into AI processing are all familiar and smack of some key developments in dot product engine use for neural nets, ways of shrinking down weights and nets, and cutting down on noise.

For ARM processors to take off in the HPC arena, a whole bunch of pieces have to come together to create a platform that can compete against more established architectures. While many have obsessed – and correctly so – over the availability of production-grade 64-bit chips and Linux operating systems, to a certain extent the availability of compilers and their companion math libraries is just as important in the rarified air of HPC.

ARM Holdings, the commercial entity behind the ARM RISC instruction set and licensable processor components, is very eager for ARM chips from its various partners to take off in HPC. In fact, HPC is one of the two target areas where ARM Holdings believes that its eponymous architecture has a chance to take off in the datacenter and build some momentum, with the other area being hyperscale datacenter operators and their service provider peers.

To help accelerate the adoption of ARM chips for HPC workloads, ARM Holdings has been working with Numerical Algorithms Group for the past two years to port the latter company’s Fortran compiler and related Numerical Library to the 64-bit ARMv8-A architecture. But now ARM Holdings is taking it even one step further and is licensing NAG’s Numerical Library and its related software test suite tools so it can distribute those to customers in both open source and commercially supported variants.

Computational mathematics for high end servers and HPC are very important to ARM, and linear algebra routines are important for computational mathematics,” explains Darren Cepulis, datacenter architect and server business development manager at ARM Holdings explains to The Next Platform. “We endeavor to create a set of core math libraries that people can build higher level routines off of. These libraries will be optimized not just for our 64-bit implementations, but also those of our partners. As you know, different architectures and different memory subsystems can impact the performance of different math routines, and so it is important to have a set of libraries that are tuned for the hardware that you are running on.”

To that end, ARM is taking the BLAS, FFT, and LAPACK linear algebra and matrix math routines developed by NAG, which Cepulis says are the most widely used math routines in use on X86 platforms in the HPC space today, and tuning them up for ARM. These three libraries are not the full numerical library from NAG, but it is the key part that will get ARM started for optimized HPC application execution. It is not clear when and if ARM Holdings will license the full Numerical Library set from NAG, but Cepulis is clear that ARM software engineers will be doing further tuning of these three key HPC routines to squeeze more performance out of the ARMv8-A architecture. This optimization work is not a one-off thing, mind you. The architectures of the chips are changing at a steady pace, and there are going to be more implementations of the ARMv8 architecture coming to market this year and next, so the testing and tuning of the math routines will get broader and deeper. Cepulis says that the optimization work will be ongoing for the next couple of years, given the number of implementations that are coming down the pike and the number of compilers with which the math libraries need to integrate.

The math libraries that ARM has licensed will currently work on anything that supports the 64-bit AArch64 architecture, but they have been tuned to work better with ARM’s own Cortex-A57 cores and any chip that makes use of them and the ThunderX processors from Cavium Networks. ARM will be tuning the math libraries up to work with its Cortex-A72 cores next, and presumably Applied Micro’s X-Gene processors, which are also being aimed at HPC workloads, will be next. Others like Broadcom and Qualcomm, which are working on their own beefy ARM server chips, will no doubt join the party, as could others such as Phytium, Marvell, and AMD.

At the moment, the key compilers for ARM server chips are the open source GNU Compiler Collection (GCC) and the Low Level Virtual Machine (LLVM) compiler framework as well as the EKOPath compiler suite from PathScale, which previewed this suite at last year’s SC14 supercomputing conference. The preview of the EKOPath suite included C99, C++ 2003, C++11, Fortran 90/95, and partial support for Fortran 2003 and 2008, and also included BLAS libraries that PathScale ported over from the X86 architecture. These PathScale compilers support OpenMP 4.0 and OpenACC parallel programming extensions, and also support Power CPUs and Nvidia and AMD GPUs in addition to ARM and X86 CPUs. NAG’s Fortran compiler is also supported on ARM chips, and so does Python, which is increasingly used in HPC environments thanks to its own math and scientific algorithm libraries.

“We do an awful lot of GCC and LLVM work,” says Cepulis. “We have been tuning those for the past year and a half. We have increased performance on AArch64 by 15 percent in that time, so it is important to get the latest compilers to get the best performance. If you are on GCC 4.9, then you are 15 percent behind what is out there in the latest trunk.

ARM is not licensing or selling the either the PathScale or NAG compilers, by the way. If you want to use them, you will have to license them from either of those companies. But by licensing and open sourcing the NAG libraries it is making them available to the GCC and LLVM communities.

It is a little tough to find out all of the different components of the HPC stack that are working on or being moved over to the ARM architecture, and to that end ARM has set up an HPC ecosystem section on its web site (which now runs on ARM chips, finally) that brings it all together. (You can find out more about the ARM Performance Libraries based on the licensed NAG routines at this link.) Open source variations of linear algebra libraries are, of course, available out there on the Internet – OpenBLAS, BLIS, Atlas, and others come to mind – and many have been ported to ARM, and you can get them through the new ARM HPC site, too, or through various compiler and library projects.

For hybrid computing mixing ARM processors and Tesla GPU accelerators, Nvidia does all of the work to get its CUDA parallel programming environment to work in hybrid fashion across the compute engines, and at the moment the X-Gene processors from Applied Micro and the ThunderX processors from Cavium both can offload work to Tesla GPUs using CUDA.

As part of its commercial offering, ARM has identified 35 different open source packages and libraries that HPC shops care about and is porting the ones that have not been moved yet from X86 architectures over to AArch64 and tuning those that have been, including profiling tools and MPI libraries. ARM is wrapping up these packages with the three math libraries that it has licensed from NAG and selling it as a supported product (compiled into binaries) with an annual subscription for around $2,000 per programmer seat. There are no runtime licenses or royalties that have to be paid for applications that make use of these packages and libraries, and Cepulis says that a cluster license is available for national labs and academic clusters where lots of people might be coding.

BuildItFat says:

November 23, 2015 at 6:04 pm

It’s great to see Arm Holdings working on their ARMv8A ISA software stack, but it’s not going to be ARM Holdings’ Reference design(A53, A57, A72) cores that dominate in the ARM based server room! It’s going to be the Custom Micro-Architectures that are engineered to run the ARMv8A ISA that will comprise most of the ARM server market.

Let’s not forget that Apple Beat ARM Holdings to the 64 bit ARMv8A ISA market with a Custom designed Apple A7 Cyclone Micro-Architecture that was engineered to run the ARMv8A ISA, and the Cyclone was twice as wide order superscalar as any of ARM holdings reference designs. So the Apple A7 was able to keep up 6 instructions per clock relative to the ARM holdings reference design core that could only make 3 instructions per clock. The Apple Cyclone had more in common with Intel’s Haswell in execution resources than the Cyclone had with any of Arm Holdings’ reference design cores.

We may see with AMD’s custom K12 Micro-Architecture an ARMv8A ISA running core with SMT capabilities. And a custom wide order superscalar design K12 that can beat even Apple’s Custom designs in the IPC metric, while having AMDs Greenland graphics to accelerate computations HSA style for server workloads. AMD is not the only one besides Apple with a top tier architectural license from ARM Holdings, and is only licensing the ARMv8A ISA from ARM Holdings. There are others doing up their own server custom ARMv8A running Micro-Architectures that will allow the ARMv8A ISA to make greater inroads into the server market.

One need only make reference to the Powre8/power RISC processors core designs and see that the ARMv8A RISC ISA with a custom micro-architecture could very well be bumped up a few notches with a custom wider order superscalar front end like the power8, and plenty of execution pipes on the execution side, and be able to offer the same levels of performance. Add to that SMT for better processor thread utilization and the ARMv8a ISA running on a custom micro-architecture could very well make the server grade.

It will be interesting just to see what Jim Keller’s design team did with the K12 custom ARMv8A ISA running micro-architecture especially if they use the same CPU core execution design tenets for the K12 that they used for the Zen x86 based core, including bringing SMT abilities to a custom ARM core designed to run the ARMv8A ISA. And as Apple’s P.A. Semiconductor folks proved with the Cyclone and A8/A9 cores Jim Keller is not the only one with the ability to fatten up a custom CPU core’s design. The ARMv8A ISA with enough investment in a fat custom designed micro-architecture core has every bit as much potential as a Power8, or other, RISC ISA based design to make inroads into the server market.

The software stack built up around the ARM ISA/ARMv8A ISA is what has made more towards ARM Holdings’ success in the marketplace than even the ARM reference designs of the last few years! because it’s that software stack that made it easy for Apple/others to adopt/create the various ARM reference/custom designed cores that run the various ARM 32/64 bit ISAs and save Billions(US) in software development costs over the years that the ARM based market has developed. AMD will also have its Boltzmann Initiative to ease the pain of CUDA migration to other APIs on AMD’s GCN GPU accelerators be thay x86 based, or ARM based, server APUs or discrete GPU accelerated server products.

Reply
- OranjeeGeneral says:
  
  November 24, 2015 at 7:37 am
  
  Well I don’t know if you beef up an ARM design to be like a Power design what will you end up with? Just another Power clone with the same issues (power) and zero benefits? Doesn’t sound to me like a wise strategy.
  
  Reply
  - SupplyChainsAndBalls says:
    
    November 24, 2015 at 6:14 pm
    
    For you anything that is not dependent on x86 appears unwise, even with the tablet/phone industry dominated by ARM/RISC processors. So an ARM as a power clone that beats even the OpenPower price point is bad for you, when we all see what prices an x86 server market dominated by a single supplier will bring to the overpricing of server kit! Hopefully AMD will be able to make inroads into the x86, and custom ARM server SKU market and bring back even better price to performance ratios. It’s not as much about single ISAs, even though RISC designs(IOT, phone up to supercomputers) appear to be able to span a wider market than CISC, it’s about the supply chain of server SKUs not dominated by a single interest.
    
    IBM sure knew that on the hardware side at the dawn of the PC market, forcing the cross licensing of the x86 16/32 bit ISA on Intel to AMD/others! It’s just that IBM did not do so on the software side, and abandoned the PC market for higher margins. So a power equivalent micro-architecture engineered to run a fat superscalar design that can run the ARMv8A ISA on a custom core would be all around good for the server marketplace!
    
    The very reason the PC/Laptop/computing market is so lopsided and unhealthy on the x86 side is that there are too few licensees for the x86 ISA, not so for Power, ARM, or MIPS. I say let the licensed IP business model take the computing world, and no single parts supplier get any CPU/SOC market between the devil and the deep blue sea ever again. Single parts suppliers in such control over any market’s supply chain is an anathema to progress and innovation.
    
    Reply
    - OranjeeGeneral says:
      
      November 26, 2015 at 5:06 am
      
      Well you’re making big assumptions. Yes it would be interesting to see ARM on HPC but not being just another Power clone. Nothing is gained via that besides maybe on the price point as you managed. Power architecture isn’t so great if you look deeper under the hood as many people believe. What we need is something that scales really well but does not break the power envelope. IBM’s Power isn’t that. So ARM might be that who knows still to be seen. x86 has evolved and the XeonPHI direction Intel is going along side with the memory architecture looks most promising right now
      
      Besides you better check your history on IBM PC they actually tried to close down everything and were not happy about clones appearing of their architecture in the early days (I was there and remember pretty well ). They tried to challenge that at least twice in history. Ever heard of the MicroChannel fiasko? IBM is as much untrustworthy as Intel is when it comes to open architectures.
      
      So far ARM brought only water down architectures to the masses. Moving more of local computation to cloud computation since their local processor are simply too weak.
      
      Reply
      - AnIndustryBegsToDiffer says:
        
        November 26, 2015 at 2:13 pm
        
        You are avoiding the Custom ARM micro-architectures that are just engineered to run the ARMv8A ISA, and expecting that most will not know the difference between the ARM Holdings reference designs, and the custom micro-architectures that are engineered to run the ARMv8A ISA. Your stanch defenestrations of anything not Intel and x86 belies logic. Those non ARM Holdings CPUs/SOCs can and are being beefed up to allow the custom ARMv8A ISA based micro-architectures to approach Intel’s “i” series in execution resources(Apple A7) on the way to becoming even more powerful AMD(K12) and other ARM custom ARMv8A ISA running server micro-architectures that are engineered to run the ubiquitous ARMv8A 32/64 bit ISA! It is simply a matter of time until someone with the funding takes the ARMv8A ISA and engineers a fat Power8 style version of a very wide order superscalar custom micro-architecture to take that ARMv8A ISA deeper into Intel’s and IBM’s territory in the server room.
        
        AMD for one will have the advantage in 2 years time of having both x86, and ARMv8A custom micro-architectures that share the same execution resources and design tenets. It will just be that AMD’s Zen micro-architecture will be engineered to run the x86 32/64 bit ISA, and AMD’s custom K12 will be engineered to run the ARMv8A ISA. And AMD will have APUs on an interposer versions in development, and coming to market that will pair both x86 and custom ARMv8A Running micro-architectures with Arctic Islands GPU Graphics/HSA accelerators! So expect that new Arctic Islands GPU micro-architecture to have even better GPU asynchronous compute processing thread management with the Arctic Island’s brand new GPU ISA/micro-architecture! So compute workloads that once required a CPU to complete that task will now be even more able to be offload compute to AMD’s newest GPU iteration. That Arctic Islands GCN/Newest and those ACE units will have even more CPU like asynchronous compute ability!
        
        The entire industry, that has embraced the ARM holdings Licensed Business model, will take the ARMv8A ISA and move it up into the Power8’s performance metric range, and the Power8 ISA is just another RISC ISA design, same as ARM and MIPS. I’ll also expect that Imagination Technologies(IT) will not be standing still with their MIPS RISC ISA designs, They already offer SMT capable MIPS core options, along with PowerVR HSA aware GPUs that even offer process virtualization hardware on their GPUs. IT will be very close to having a GPU design that could be taken by a licensee and made into a discrete SKU, but more than likely the SOC/APU market will see laptops mostly forgo discrete GPUs in favor of SOCs, or AMD’s APUs and even more powerful APUs on an interposer! Those AMD server SKUs on an Interposer will usher in a new and viable way of fabricating multiple separate CPU, GPU, other processing dies and placing them on an interposer based module, and will allow for much more efficient producing of individual CPU/GPU/Other dies to be fabricated on processes that best suit the individual processing capabilities of the specific processor’s usage model.
        
        This will allow for modular and more efficient construction of tailored processing power, small to large and mobile to supercomputer. Expect that on the GPU side AMD will be able to engineer modular GPU dies that are fabricated in smaller die sizes that will be able to be added to the interposer in increasing die amounts to scale from small to large and mobile to HPC/workstation by simply adding more CPU core modules, and more GPU/other core unit modules, with the interposer’s silicon substrate hosting the very wide coherent connection fabrics that will make these separate processor dies perform computing tasks as if they where all on an single monolithic die. That Interposer based technology and HBM will be too attractive to pass on for the entire computing industry, Intel included.
        
        AMD has really positioned itself for its future, and its return to the HPC/Server market, and expect that Uncle Sam’s continued funding of exascale research will again see AMD improving on its exascale offerings, with any technology developed from the grants finding its way down into AMD’s other product lines, commercial to consumer. AMD will be a little late to the ARM based industry’s custom party, but the current Seattle ARM A57 Reference based server cores will be supplanted by AMD’s custom K12 cores shortly, while Zen will be there for the x86 market. AMD will be using Seattle to get the software stack ducks in order for its custom K12 offerings, and AMD’s GPUs will be making inroads into the HPC/Server market assisted by AMD’s Boltzmann Initiative with its CUDA to other languages migration tools.
        
        The HSA foundation, and HSA, is beginning to come into its own with the software stack beginning to catch up with the hardware, on not just AMD’s version of HSA aware GPUs and APUs. One need only look at what AMD’s Mantle graphics/HSA API has morphed into on the Vulkan, and DX12 side of the API equation that has implications across the entire SOC/APU and graphics/GPGPU industry.
        
        Grant wise that government investment in exascale research will do for computing what the space program did for the aerospace industry, with the technology improvements spreading across the entire marketplace. AMD, Intel, IBM and others are on the receiving end of some increased funding that hopefully will foster much more healthy competition. Intel’s time as the supplier to most of the industry for CPU/SOC needs is coming to an end as the Other ISAs become more widespread, dew to the Licensed IP business model introduced to computing by ARM Holdings and practiced by others including just recently by IBM itself.
        
        What about Intel’s TSX, and other troubles, you appear to have an agenda that presupposes infallibility on your choice while others can do no good! That Phi from Intel is only based on ATOM with added AVX, and the Phi’s DP abilities will be eclipsed even more so by GPUs at 14nm/16nm than it was by GPUs at 28nm!

Tuning Up ARM To Do The HPC Math

Sign up to our Newsletter

5 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Europe Takes Another Whack At Homegrown Compute Engines

AWS Pushes Bang For The Buck With Graviton 4 Instances

The Prospects For An Arm Server Insurrection

5 Comments

Leave a Reply Cancel reply