The Other Way To Bring Arm CPUs To Servers

There are at least two – and possibly more – paths to make Arm processors competitive with the Intel and now AMD X86 incumbent processors in the datacenter.

The first path, and the one taken by most of the Arm collective to date, is to create a better CPU based on Arm cores and adjacent technologies that results, in the end, with a server that looks and smells and tastes more or less like the X86 server that has been common in the datacenter for the past two decades – right down to the management controllers and peripherals. By going down this path, the differentiation is on aggregate throughput, price/performance, and an aggressive cadence on future processor designs that Intel has not been able to deliver with Xeons in recent years and that AMD has done a pretty good job with for its first two generations of Epyc processors.

The other, and certainly less traveled, path to bring Arm servers into the datacenter is to take low-powered Arm CPUs and architect a different kind of system that doesn’t require the beefy X86 processors that are standard in the datacenter today, but can still handle a lot of distributed computing workloads with a lower cost and better efficiency. This is an inherently riskier path, and one that reanimates the wimpy versus brawny core debates of the past decade, as well as a healthy dose of skepticism regarding microservers versus servers now that we think on it. But after building some experimental Arm servers that test out these ideas, Bamboo Systems is raising its first war chest from private equity (as opposed to academic and government funding) and is going to try to put the idea of distributed systems based on low-powered Arm processors to the test in the real market, not the one of ideas.

Bamboo Systems is not a new company so much as a more focused and funded one. The company was formerly called Kaleao, which we talked about way back in August 2016 when John Goodacre, a professor of computer architectures at the University of Manchester and also formerly the director of technology and systems at Arm Holdings, pivoted his microserver-based cluster designs, then known as the EuroServer project, from hyperscaler workloads to include HPC workloads.

At the time, more than three years ago, Goodacre fervently believed that many of the key technologies developed to parallelize supercomputing applications – including the Message Passing Interface (MPI) protocol for sharing work across a cluster and the Partitioned Global Address Space (PGAS) memory addressing scheme – would have to be integrated into the programming model of a future exascale system, no matter what workloads it runs and no matter if it is at an HPC center or a hyperscaler. There is just no other way to bring millions of threads to bear at the same time.

Goodacre and his team started the EuroServer project way back in 2014, and many of the ideas of that platform as well as some other projects, were stitched together to create a commercial product from Kaleao called KMAX. Now, in the wake of raising $4.5 million in pre-Series A funding – isn’t that called angel funding? – Kaleao is renaming itself Bamboo Systems and taking a very long view into becoming a system vendor that will be at the right place at the right time when Moore’s Law finally does run out of gas in the next decade.

The first KMAX systems shipped in 2017 under the radar, and the company uncloaked those designs back in April 2014, which we covered in detail here. The KMAX clusters were based on the relatively modest Exynos 7420 processor developed by Samsung. This chip was created by Samsung for its smartphones, and includes a four-core Cortex-A57 processor complex from Arm running at 2.1 GHz paired with a less brawny four-core Cortex-A53 complex running at 1.5 GHz. The Cortex-A53 cores are used for system and management functions, and only the Cortex-A57 cores are used for compute. The Exynos 7420 chips are etched using 14 nanometer processes and are made by Samsung itself; they support low profile DDR4 main memory and also have an embedded Mali-T760 MP8 GPU included in the complex. You can do a fair amount of interesting work with them.

The KMAX compute node has four of these Exynos 7420 processors, and the architecture is what Goodacre calls “fully converged” in that the node has compute, storage, and networking all bundled on it and importantly, with FPGAs – specifically the Zync FPGAs from Xilinx – supporting the PGAS and MPI memory schemes across nodes using the embedded networking as well as offloading certain network functions from the CPU complex. Each blade has two of the KMAX nodes on it, and up to a dozen blades fit into a 3U chassis that has an aggregate of 128 cores, 64 GB of memory, and 2 TB of embedded flash that delivers 80 GB/sec of I/O bandwidth and handles somewhere on the order of 10 million I/O operations per seconds across that chassis. There is an additional 32 TB of NVM-Express flash storage that can be attached to each blade. Here’s the neat thing about the KMAX design: A standard 42U rack has 14 of these 3U KMAX enclosures, for a total of 10,752 worker cores (and an equal number of smaller utility cores), 10.5 TB of main memory (1 GB per worker core), 344 TB of local flash, 5.2 PB of NVM-Express flash with about 50 GB/sec of aggregate bandwidth, and a total of 13.4 Tb/sec of aggregate Ethernet bandwidth across the tiered network that embedded in the system boards.

Using the high density KMAX-HD variant (which is a little deeper than the standard racks), a single KMAX chassis can do the hyperscale work (think caching, web serving, and such) of two dozen Dell PowerEdge servers (admittedly using slightly vintage Xeon E5 processors) at about one quarter the power, one third the cost, and one eighth the space. Presumably the next-generation of Bamboo Systems machines, due this year, will meet or exceed those fractional multiples.

According to Goodacre, datacenters consume 3.5 percent of the world’s energy today, and the amount of energy consumed is expected to grow by 3X to 5X over the next five to ten years. Yes, there are some very large error bars on those predictions. The point is, that is a lot of energy and, importantly, datacenters will overtake the airline industry as the largest producer of greenhouse gas emissions this year, and by 2023, datacenters will consume somewhere between 4X and 5X that of the airline industry. That may not be a big deal in the United States or China, but energy efficiency has always been a bigger motivator for compute in Europe and these numbers will resonate better there. (This also explains, in part, why Arm took off as it did with embedded and handheld devices and why Goodacre did the pioneering work on servers where he did.) But hyperscalers and cloud builders all do the same math, and they certainly will be watching how successful Bamboo Systems is with peddling fully converged microserver clusters.

“The server business is an $80 billion-plus market, it’s huge,” Tony Craythorne, the new chief executive officer at Bamboo Systems, reminds The Next Platform. Craythorne was most recently in charge of worldwide sales at data management software maker Komprise and also ran parts of the business at Brocade Communications, Hitachi Data Systems, and Nexsan. “We all know that the Intel processor owns the majority of the server market. But in the past few years, some things have changed. Software design has moved from very efficient C and C++ code to far less efficient interpreted languages like Go and Python and a software stack dominated by containers and Kubernetes. At the same time, artificial intelligence workloads, and machine learning in particular, are putting extreme strain on the Intel architecture because it was not designed to run those applications. People are managing these workloads by throwing more and more compute at the problems, which is great for the Dells, the HPEs, and the Supermicros of the world, but not so good for the datacenters.”

We don’t know by how much, but the datacenter energy consumption, if the numbers that Bamboo Systems is citing are right, is growing faster than the aggregate datacenter compute. As Goodacre and Craythorne see it, this is an opportunity. And more precisely, this is the opportunity.

But Bamboo Systems can’t just slap a new label on the KMAX prototype machines and be done with it. Later this year – the company is not saying when – the updated microservers will shift from the Samsung processors to an unspecified, off-the-shelf Arm processor that Goodacre says “is considerably faster” and then hints that something with between 8 and 16 cores for a single operating system image is probably the sweet spot to balance out compute capacity, memory bandwidth, and power consumption and heat dissipation; he adds that something along the lines of the original 16-core Graviton processor created by Amazon Web Services but not the new 64-core Graviton2, is the goal. Goodacre won’t say what chip it is, but says that it is already available in the market today. The Tegra “Carmel” Arm chip from Nvidia (embedded in its “Xavier” Jetson AGX autonomous car platform) tops out at eight cores. The Marvell Armada chips top out at four cores, even the Armada 8K and Armada XP high-end versions. And the Qualcomm Snapdragon 865 has eight of the “Kryo” 585 cores on it. The odds favor the Qualcomm chip, but Nvidia is an outside possibility, particularly for workloads than need a certain amount of GPU oomph. There is no reason that blades could not come with either or both, depending on the compute needs. (This is not meant to be an exhaustive list, if we have forgotten one.)

We have seen many interesting microserver-style processors and systems coming and go over the years here at The Next Platform, and we ask the same question now that we did over the years: Why is this going to work now when it did not in the past?

“I think the key is that you have to make the software look the same,” explains Goodacre. “People really only view a system as the software that it supplies them, so if it looks the same it doesn’t matter that it has a higher number of nodes under it with clever resource management software.”

Both Goodacre and Craythorne are realistic that it is going to take time for enterprises to test out the ideas in the Bamboo Systems architecture and find the right applications in their stacks to test and then roll into production. And so, the company will be focusing on machine learning and artificial intelligence, IoT and edge computing, smart storage, web infrastructure, content delivery, and data analytics applications and, equally importantly, making it easy for customers to consume testbed machines so they can eventually ramp to proofs of concept and into production. Bamboo Systems is in for the long haul, and true to its namesake, it hopes to be able to take root and spread at a steady, organic pace. The fact that the company expects for there to me a lot more margin in this system for resellers than is possible with the X86 server market won’t hurt, either. We all know who got the lion’s share of the X86 server margin for the past decade or more: Intel.

One last note: The third way to bring Arm processors to servers is the way that AWS has done it with its Nitro SmartNICs, which offloads storage and networking functions from the processor. And you can do this SmartNIC in conjunction with either of the brawny or wimpy Arm processors mentioned above.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. The range they’re talking about smells a lot like an NXP LX2160a range, which is widely available and is being used for an Arm-based developer platform/desktop by solidrun. Solid part, but still don’t see the utility in such platforms as servers because there’s sort of a baseline level of (single-core) performance necessary for useful applications, and those (A72-based) parts just don’t have it. Likewise, the Qualcomm part has 1 fast core, 3 mostly fast cores, and 4 horrifyingly slow cores, the asymmetry of which is also suboptimal to servers.

    • Whilst there are certain operations in a datacenter that require a fast single core, that is not the majority of workloads. This is why we have virtualisation and containers today. All that compute capability is going to waste. Worse, that virtualisation itself causes an overhead and consumes power and CPU cycles. With more cores, we have less overhead due to virtualisation.

      However, it’s not just about the processor. Bamboo have re-imagined the server and removed unnecessary components that require power. This means that it is possible to get a massive compute density of around 5:1 compared to traditional servers, and using about 20% of the power needed.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.