Arm server development is a reality and a growing one at that. Not just from a performance point of view but also, perhaps more important, from an ecosystem view.
Be it the Marvell ThunderX2 processor or the Ampere eMAG Skylark processor, the hyperscale, cloud, enterprise ecosystems are willing to adopt these new processors to further improve their TCO or dollars/core.
The all-important ecosystem is catching up with Arm, which is key to the momentum necessary to make the Arm servers a sustainable reality. With AWS launching their version of Arm instances i.e. Graviton processors, there’s the much needed push to make the software ecosystem more widely acceptable in the industry. Not just that, AWS even announced bare-metal offerinings for EC2 A1 instances.
Slowly but steadily, Arm has also made a mark for itself in high performance computing, something we expect to see in full force at this year’s Supercomputing Conference. Arm has the most traction in terms of deployments and software development in HPC in the United States, Europe and Japan with each region leading the way along different trajectories to deploy systems based on the Arm architecture for their supercomputers.
All of this has taken time and extended development, of course. The first wave of Arm based servers came in 2010 until 2014 and were more experimental in nature than real production systems.
The first 64-bit Arm design i.e. the ARMv8-A was introduced in 2011 and since then the Arm server ecosystem have seen lots of ups and downs. ZTSystems, in November 2010 had launched a 1U Data Center Arm server based on Cortex-A9 cores (32-bit) which was supposed to be energy efficient and a denser solution compared to Intel Servers. Then came Calxeda with their version of 32-bit Arm servers i.e. the EnergyCore-ECX-1000 which did not see adoption and Calxeda eventually went defunct in 2013. In 2011 AppliedMicro launched the X-Gene 1 processor followed by X-Gene 2 in 2014. Samsung, Cavium (now Marvell) and AMD came up with their versions of Arm processors which tried to penetrate the server market but could not generate tangible interest among the end-users to adopt these technologies.
Arm servers have undergone a transformation in terms of development and early signs of this were seen in a semi-secret project within Broadcom which was taking shape in the form of ‘Project Vulcan’. The idea was to develop a world class 64-bit serious Arm server to take on Intel in the HPC and cloud market.
In late 2016, when Avago gave up on Broadcom’s ambitions to develop a first class Arm server, Cavium jumped in and brought the Vulcan IP and Team on-board and fully funded the Vulcan project, re-christened as ‘Cavium ThunderX2’ now, ‘Marvell ThunderX2’. In more ways than one, the ThunderX2 is a serious contender to Intel and AMD in the HPC, hyperscale and cloud businesses.
To make things better for the Arm ecosystem, in 2017, a brand new company, Ampere Computing bought the X-Gene assets and re-introduced the X-Gene processor as the Ampere eMAG processor. It needs to be mentioned that Qualcomm tried its hand at building a true Data Center Arm Server – Centriq based on the Falkor Architecture and given Qualcomm’s standing, with time, it could have made their data center server project a success. However, for reasons unknown to many, they chose to significantly disinvest and many personnel from Qualcomm’s Centriq project were hired by Ampere Computing in Raleigh. Huawei has a very compelling Arm Server offering in the Kunpeng 920, which is a 7-nm, 64 core CPU.
Figure 1: Diverse Arm architectures (source)
The question many have is whether the Arm server ecosystem is mature enough to be excited about?
The ecosystem has come a long way to become a stable one. However, it has many miles to go to reach the same level as x86. Given this momentum, it would not be surprising if the likes of Google, Facebook, Tencent etc. are actively experimenting with Arm platforms. Amazon and Microsoft have already invested in Arm platforms in their respective clouds i.e. AWS & Azure.
Figure 2: Commits to Linux GitHub repository for x86 vs. arm64 as of 13th November, 2019
The contributions towards enabling aarch64 for Linux operating system have steadily increased since 2012 while the growth rate for x86 has not been as consistent. These are good indications that the Arm ecosystem is here to stay and growing.
An ongoing debate among software engineers is whether to implement a business logic in a monolithic architecture or take the same logic and break it down into multiple pieces. There is a growing trend of organizations moving to a ‘Microservices’ architecture for various reasons be it unit testing, ease of deployment, server performance among many others. Also, microservices based architecture are relatively easy to scale compared to a monolith. Linaro, Arm and Arm Server Manufacturers are leading this charge. Also, Packet is providing the developer community a platform to develop and sustain the ecosystem.
If there’s one area where Arm servers have taken the biggest strides, it is definitely be High Performance Computing (HPC). The Arm ecosystem for HPC is also the most developed compared to Arm’s progress in cloud datacenters.
The momentum for Arm in HPC was driven by many centers, but Dr. Simon McIntosh-Smith and the University of Bristol and Cray hosting the 1st Isambard Hackathon to optimize HPC applications for ThunderX2 based servers back in November 2017 at Bristol. This was promptly followed up by a 2nd Isambard Hackathon in March 2018.
Most of the HPC applications compile and run ‘out of the box’ for Arm based servers with Arm compilers, GCC, OpenMPI, OpenMP support.
I participated in both representing Cavium Inc, assisting developers, architects and engineers optimize their codes/applications for ThunderX2 Processors. Collectively, we optimized key HPC applications like NAMD, UM-NEMO, OpenFOAM, NWCHEM, CASTEP, etc. and compared to Intel CPU Architectures like Broadwell and Skylake. Prof Smith and team did a detailed study identifying the opportunities and benefits of Arm Servers with regards to the incumbent Intel servers with compelling performance per dollar for the Arm-based servers.
Figure 3: Cray-Isambard performance comparison on mini-apps
Figure 4: Cray-Isambard performance comparison on key Archer applications
Figure 5: Cavium Inc. published HPC Performance comparison vs. Intel Skylake CPUs (2017)
This was a significant movement that Arm servers needed in the HPC space. The two Isambard hackathons also fast-tracked the Arm HPC development with Arm optimizing their compilers as well as Math libraries in collaboration with Arm server manufacturers like Cavium Inc (now Marvell Semiconductors). There is tremendous movement in the Arm HPC Performance Libraries optimization world. Arm has invested in optimizing GEMM, SVE, spMM, spMV and FFT libraries in collaboration with developers and Silicon manufacturers like Marvell. The Arm Allinea Studio has successfully established itself as a ‘go-to’ tool for Arm server Workload Analysis, similar to what VTune would be for Intel.
Another major milestone was the Vanguard Astra Arm based supercomputer at Sandia National Laboratories powered by DoE, Cavium and HPE. This is the first Arm based supercomputer to make the TOP500 list at 156th position as of June 2019 and 198th rank in the November 2019 rankings. The building blocks are HPE Apollo 70 platforms, Marvell ThunderX2 CPUs with 4xEDR Infiniband interconnect. The Astra Supercomputer is made up of 2592 compute servers i.e. 145k cores and 663 TB memory. US DoE is making a concerted effort to invest in diverse as well as future proof technologies such as Arm, in it’s path towards achieving exascale computing.
Figure 6: Astra, the Arm based supercomputer debuted on the TOP500 list in November 2018
Europe and Asia are taking huge strides in deploying Arm based clusters and systems for HPC and Research. Be it Monte-Carlo, Isambard or CINECA-E4 projects in Europe or Japan’s Arm based Fugaku supercomputer, it’s just the beginning of a new era of Arm in HPC. Cray is betting big with the A64FX Arm chip built by Fujitsu. The A64FX prototype is number one on the Green500 list and 160th on the Top500 list..
HPC workloads tend to be highly parallelizable in nature, and Arm CPU’s provide an opportunity to leverage lots of cores at reasonable price points. Further, having competition in the CPU market benefits all buyers, not just HPC shops, to negotiate the best resources for their workloads.
Marvell is a pioneer in more ways than one in introducing the Arm server ecosystem to the hyperscale world with Marvell and Microsoft partnering on ThunderX2 platforms for Azure. Oracle has invested $40 Million in Ampere Computing, which is home to the ARMv8 eMAG processor. Oracle also has plans to massively expand their datacenter footprint in the coming months and this investment in Ampere could mean potential deployment of eMAG processors in Oracle Data Centers.
In the recent past, there’s been a slew of announcements regarding enhancements to the Arm ecosystem. VMware announced 64-bit support Arm Support. In an official announcement, DDN announced professional support for Lustre on Arm servers in 2018 In mid 2019 at ISC, AMI announced firmware support for the Marvell ThunderX2 Arm based servers in March 2019.
NVIDIA announced CUDA support for Arm at ISC19 and backed it up with a major announcement of introducing a reference design to enable organizations to build GPU-accelerated Arm based servers, which is a big shift towards enabling Arm to be successful in the HPC and accelerated computing segment. Imagine a system with power efficient Arm based CPUs with GPUs for training and AI ASICs for inference. Machine Learning & Artificial Intelligence pose interesting opportunities & the collaboration with NVIDIA will enable this segment for Arm based solutions.
Like Intel, AMD and Arm, Ampere Computing too has created a developer program for developers to build and expand their Cloud Ecosystem. This will enable further and faster integration of Arm servers in the hyperscale and datacenter world in a much more open and collaborative way.
While the ecosystem still needs more time to grow and mature, it is steadily moving towards that nirvana of ‘It just works’. With the emergence of Arm in the computer architecture world along with RISC-V and many other semiconductor start-ups, it’s only a matter of time until aarch64 is the new normal like x86. That is what the community is all striving towards.
Once the developers are convinced that their software stack ‘just works’ on Arm Servers, it would be a big win for the Arm Server ecosystem, and I for one am willing to make the bold claim that for many workloads – especially HPC – ‘It just works’
About the Author
Indraneil Gokhale is a Performance Engineer and leads the Hardware Engineering team at Box Inc. Indraneil has previously worked at Cavium (now Marvell), Uber and Intel. Indraneil has experience in optimizing HPC applications and workloads for x86 and aarch64 architectures. He has published white papers, book chapters on optimizing the Weather Research and Forecasting (WRF) application. Indraneil holds a Masters Degree in Electrical Engineering from Auburn University, USA and a Bachelor’s Degree in EEE from Jawaharlal Nehru Technological University, Hyderabad, India.