Hope Springs Eternal For Arm Servers

IT organizations are funny creatures, indeed. On the one paw, they are eternally optimistic about the prospects for new technologies, and on the other paw, they are extremely resistant to change because of the economic and technical risks that change requires.

For more than a decade now, the people who control infrastructure inside of datacenters and the companies that create server processors and/or platforms based on them have been contemplating the possibility of Arm servers in the datacenter. And Arm Holdings, the company that invented the architecture that took over our smartphones, our tablets, and a slew of embedded controllers that knocked out the PowerPC architecture (itself the follow-on to the Motorola 68K architecture popularized in workstations, some PCs, and embedded controllers three decades ago) put some pretty tall stakes in the ground several years ago, boasting in 2015 that its Arm architecture could, through its several licensees and then the OEMs and ODMs that build machines for users, get 20 percent shipment share by 2020. A year later, Arm Holdings upped that target to 25 percent of shipment share by 2020.

Well, that clearly didn’t happen. But, it is safe to say, despite all of the churning that has happened in the Arm server space in the past decade, that 2020 was the best year for Arm server chip and system sales yet and that the prospects for an increasing share for Arm chips in servers are still good. But not for the reasons many of us expected, and not through the channels that seemed so obvious when Calxeda truly got the Arm server chip ball rolling way back in November 2011.

There has been a lot of water under the bridge since those days. A year later, Samsung was rumored to be graduating to Arm server chips from its client CPUs, but never launched its efforts. Applied Micro launched its X-Gene Arm server chip, with all kinds of tweaks and not just a licensed core from Arm Holdings, in August 2012. Applied Micro got through two generations, stumbled on the third, and its X-Gene design and its people fight on as the foundation for Ampere Computing and its Altra chips. Cavium Networks went lateral from its network processors to server processors in 2014 with the ThunderX designs and gained some traction with its follow-on ThunderX2 chips, but these latter chips were not really its own, but rather the “Vulcan” designs from networking chip rival Broadcom, which bowed out of the market in December 2016. Just like Qualcomm did with its “Amberwing” Centriq Arm server chips in May 2018 after barely getting started. And late last year, after saying that the “Triton” ThunderX3 was going to be a custom part, Marvell, which bought Cavium a few years back and inherited this Arm server chip business as part of that deal, decided to heck with it all and just shut down the ThunderX3 effort entirely as 2020 came to a close.

AMD was going to build Arm and X86 chips alike – remember the “Seattle” Opteron A1100 Arm server chips for microservers and the “Project SkyBridge” plan to have X86 and fatter K12 Arm server chips occupy the same sockets? There are rumors going around that AMD might revive its K12 effort, but we think if AMD does that, it will be a special case (and likely not for servers) and it will be because some of the big clouds or client device makers have asked for it. With AMD readying its “Milan” third generation Epyc server chips for launch this quarter – the company did a preview of these 64-core processors, based on the Zen3 cores, at the CES 2021 virtual event this week – we don’t think AMD is going to do anything to muddy the server waters as it gets into position to finally take big chunks of market share from Intel’s Xeon SP server CPUs, which are being adversely affected by Intel’s 10 nanometer and 7 nanometer manufacturing process issues.

All of the change that is happening in the Arm server space – and trust us, it is hard to be patient here – is happening against the backdrop of Nvidia trying to acquire Arm Holdings for $40 billion, and facing some tough questions from antitrust regulators in the United Kingdom and in China. Nvidia is committed to the idea that Arm-based serial computing should be a big part of the datacenter, as we discussed with Nvidia co-founder and chief executive officer Jensen Huang back in October at length. Huang thinks that datacenters will need many kinds of CPUs going forward, and that the Arm licensees – and possibly Nvidia itself – can deliver this variety by working together. Huang absolutely is not buying Nvidia to become the dominant or only Arm server chip supplier, and is very clear about this. But that does not rule out Nvidia making its own CPU for special use cases. Imagine, for instance, an Nvidia Arm server chip with native NVLink ports to allow for shared memory across CPUs and GPUs and, possibly, even DPUs that also have NVLink ports on them. (In fact, if we were designing an Nvidia DPU, we would be heavy on the NVLink ports and relatively light on the Arm CPU cores and try to get off the PCI-Express bus at the next stop.)

At this point, as we begin a new year and one where a recession is not just possible but likely and therefore companies will be looking to cut IT costs in any way they can without sacrificing performance or function, it is probably a good idea to just review the state of the Arm server chip collective and how that relates to the compute battlefield that is still utterly dominated by X86 iron.

Right now, it is safe to say that Amazon Web Services, with its Graviton2 Arm CPUs, which launched in December 2019 and which it uses in its own DPUs and now as CPUs inside of servers, is both the dominant maker (in the sense that it designs the chips, has them fabbed, and commissions ODMs to build machines) and the dominant user (in the sense that it uses them internally in its datacenters but is now selling raw Graviton and Graviton2 capacity to customers on its EC2 compute service) in the world when it comes to Arm servers. We do not know what Amazon’s long-term goal is – raw infrastructure will be based on the X86 architecture for a long time as customers need it, but many of the services that AWS sells could run on Arm chips and no one would be any the wiser – but Andy Jassey, chief executive officer at the cloud division of Amazon, said that the re:Invent 2020 event in December that its Graviton2 instances offered 40 percent better price/performance than its Intel Xeon SP and AMD Epyc instances, and we all know that Google has said that it would change compute architectures just to get 20 percent better bang for the buck.

The rise of Graviton and then Graviton2 and what we presumed will be Graviton3 at the end of this year and Graviton4 at the end of next year is one of the reasons why Marvell canned the Triton ThunderX3 chip at the end of last year, and the other one is the rumor that Microsoft is working on its own Arm chips for clients and servers. And this is an important lesson. Those chip makers who bank on a hyperscaler or cloud builder to be their big customer may be pulled up short. Microsoft has used Qualcomm Snapdragon Arm chips in its Surface Pro laptops (which run Windows) and has a Windows Server variant that it could run on the Qualcomm Centriq server chips as well as the ThunderX2 and ThunderX3 from Marvell. Microsoft said three years ago that it wanted half of its datacenter capacity to be on Arm eventually, and either it got tired of trying to bolster these chipmakers or they gave up and left it holding the bag. Either way, that probably won’t happen again, and if Microsoft is considering designing its own Arm server chips, then it is reasonable to think it may do the same for clients and game consoles and maybe even DPUs and other devices that it can sell to customers. (If we had to guess, Microsoft decided to go its own way for much the same reason that AWS did, and that put the kibosh on both Qualcomm and Marvell in Arm server chips.)

That brings us to Apple, which doesn’t have the server volumes of AWS, Google, or Microsoft but is still a hyperscaler in its own right. Apple is a case study in how to use chip design and emulation to make architectural transitions easier and cheaper on both itself and its customers. When Apple shifted off the PowerPC architecture (which it helped create with IBM and Motorola in 1991 to create a follow-on to the Motorola 68K and IBM RISC processors) to Intel X86 chips for its PCs in 2005, this was done with the help of the QuickTransit CPU emulator, which was launched in 2004 by a company called Transitive and which was acquired by IBM in 2008 because it was completely dangerous to Big Blue’s proprietary and RISC/Unix systems businesses. The good news was Apple had a license to QuickTransit, which it tweaked to make its Rosetta translation environment so code compiled for PowerPC chips would work on X86 chips without modifications. And it worked, and worked brilliantly. And this time around, as it moves its PC customers from X86 chips to its own Arm eight-core M1 processors for its Macs, the Rosetta2 emulation environment is giving customers more oomph at less watts – and lower cost and at more profit to Apple, presumably. And no money at all to Intel.

There is nothing preventing Apple from making an M1 processor for edge devices or making a beefier S1 processor, perhaps with 2X or 4X or 8X as many cores, slapping Rosetta2 on them, and running this inside of Open Compute-style servers in its own datacenters. Apple has the QuickTransit license. Which is key. Others do not, but IBM could create a lot of mischief if it decided to make it available on the cheap to those porting X86 code to other architectures – including its own Power architecture.

Apple could become a server vendor once again – Steve Jobs shut the business down a decade ago – and probably do something very interesting at that because it is a big consumer of servers now, not just a slick device manufacturer.

If AWS is the volume Arm server chip supplier today, Fujitsu has a claim to stake here, too, with the A64FX Arm processor, which has fat vector engines and an integrated Tofu D 5D mesh/torus interconnect. The “Fugaku” supercomputer at the RIKEN lab in Japan has 158,976 of these A64FX processors across more than 400 racks, and had a total budget of $910 million. Let’s say the servers were half of that, and the CPUs were half of the cost of the nodes; that would be around $230 million in CPU revenues for Fujitsu – a lot of money for Arm server chips, indeed. But AMD is probably going to be at a $500 million to $600 million sales rate in Q4 2020 for Epyc processors and Intel is probably at least 10X that amount. The vast majority of A64FX processors that Fujitsu will ever sell have been sold, but there could be a couple ten thousand more chips sold to HPC customers around the globe. (That’s a shame, and it would be wonderful if that were not true. We would love to see the A64FX and its Tofu D interconnect more widely deployed because it is an elegant, efficient architecture for all-CPU workloads.)

Of the credible vendors that have a chance at reasonably high volume sales, the remaining one is Ampere Computing, which has aggressive plans as we discussed last summer. The “Quicksilver” Altra and Altra Max processors come in 80-core and 128-core variants using 7 nanometer processes from Taiwan Semiconductor Manufacturing Corp (as all Arm chip makers do), the “Mystique” follow-ons based on 5 nanometer etching are due in 2021, and the “Siryn” kickers to these, presumably based on a refined 5 nanometer process but possibly on 3 nanometer technology are coming in 2022. Ampere is well funded by private equity firm The Carlyle Group and has a bunch of ex-Intel people who want to take a bite out of the datacenter. (We will be talking to Ampere Computing later this week to get an update on what is going on. Stay tuned.)

That leaves the wild cards. This includes Nuvia, an upstart Arm server chip maker we profiled back in February 2020 and who we talked to on Next Platform TV back in July 2020. Nuvia has not given out a lot of details about its plans, but it is quite possible that Google, Facebook, and other hyperscalers and cloud builders are looking to the Nuvia team, led by people from Apple – importantly, Gerard Williams, who led the Arm CPU designs for the iPad and iPhone – from Google (for its commercial products) and from AMD (for its GPUs). The impression we get from Nuvia is that it will be creating a stripped down CPU that only does what hyperscalers need, much as Innovium has done with switch ASICs, to boost performance per clock and aggregate performance across a large number of cores. If Nuvia succeeds, then Google, Facebook, and maybe even Apple do not have to create their own Arm processors; they can buy them from Ampere Computing and Nuvia.

There is, of course, the SiPearl effort in the European Union to create a line of homegrown Arm server chips aimed at the HPC market, which we covered here back in April in detail. It makes perfect sense that Arm would want its own Arm server chip, and it remains to be seen when this will be accomplished and how it will be adopted. Europe would probably like an advanced fab, too, to compete with TSMC and Samsung and perhaps a resurgent Intel should that happen in chip manufacturing.

That leaves the two chip makers in China that still seem to matter. The first is Phytium Technology, which unveiled its “Earth” and “Mars” Arm server chips in August 2016. The Mars processor was etched in TSMC’s 28 nanometer process and had 64 Armv8 cores across eight panels with a total of sixteen DDR3 memory controllers. Mars was a beast for its time, and in 2019 the Mars II shrank the design to 16 nanometers to make the chip a lot less costly and to crank the clocks a bit. The word on the street is that Phytium will shrink to TSMC 5 nanometer processes this year and boost the core count to 128. There is very little chance any organization outside of China will ever see these chips.

Ditto for the Arm server chips created by the HiSilicon division of Huawei Technology. HiSilicon jumped into the Arm server fray back in 2016 with the Hi1616, and took it up a notch in January 2019 with the Kunpeng 920, which has 64 cores running at 2.6 GHz, eight memory channels, and 640 Gb/sec of aggregate PCI-Express bandwidth. The Kunpeng 630 is reported due this year, with more cores and simultaneous multithreading on those cores as well as support for the Scalable Vector Extension (SVE) math units that Arm Holdings created in conjunction with Fujitsu for the A64FX processors used in the Fugaku supercomputer. These Kunpeng 930 chips will be made using TSMC’s 5 nanometer process. The Kunpeng 950, due in 2023, will presumably use 3 nanometer processes (and possi9bly have more cores) but possibly an advanced 5 nanometer process (and maybe only a modest increase in core count) – depending on what TSMC can put into the field.

Oh, and there is chatter that Intel may get back into Arm chips – which would be funny indeed. Intel inherited the StrongARM chip business from Digital Equipment Corp and sold it off to Marvell, and would have been strongarmed back into the market by outside forces if this turns out to be true.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

5 Comments

  1. You totally missed SiPearl, the company behind the European Processor Initiative. Their HPC ARM chip looks promising but was postponed to 2022. On the high level it looks to be a mix of AMD’s chiplet design approach and Fujitsu’s A64X with HBM2 and DDR5 memory support, also with an FPGA and custom crypto hardware.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.