Applied Micro Finds ARM Server Footing, Reaches Higher
October 4, 2016 Timothy Prickett Morgan
One of the frustrating facts about peddling any new technology is that the early adopters that discover a strategic advantage in that technology want to keep that secret all to themselves. Word of mouth and real-world use cases are big factors in the adoption of any new technology, and anything that hampers this actually causes the adoption to move slower than it otherwise might.
But eventually, despite all of the secrecy, there comes a time when the critical mass is reached and adoption proceeds apace. We have been waiting for that moment for a long time now for 64-bit ARM processors, and while many have written this sentence before, next year could be the year of the ARM server based on the ramping of existing ARM chips in servers and storage happening this year.
Networking chip maker Applied Micro started down the road of ARM server chip development back in 2009, when the global economy was suffering from the Great Recession and IT spending was taking a dive. Had the ARM collective caught the server bug a few years earlier and delivered a 64-bit server chip at that time that was capable of supporting server virtualization, IT history might have been radically different. But ARM was only starting to move down the server path and could not take advantage of the strong impetus to change that a recession brings, and as it turns out, Intel’s “Nehalem” Xeon platform and VMware’s ESXi hypervisor were the big beneficiaries of the Great Recession.
Should another recession hit in 2017 or 2018, then it could be the ARM collective that benefits this time around, particularly if the ARM server chips and the software running atop them mature as expected. Hopefully, it won’t take another global recession to bring some diversity to compute for serving and storage.
From the looks of things at Applied Micro, the uptake of its X-Gene processors for server and storage workloads is gathering some steam, and with next year’s “Skylark” X-Gene 3 chip, we could see broader adoption of the X-Gene family. We told you about the X-Gene 3 chips back in November last year, when they were unveiled, and Applied Micro reached out to The Next Platform to give us an update on the Skylark chip and its X-Tend NUMA clustering as well as to talk about some design wins it has had with the current X-Gene 1 and X-Gene 2 chips.
“It takes a long time for the market to adopt a new technology,” concedes Kumar Sankaran, who is associate vice president of software and platform engineering at Applied Micro. “To get a value proposition and TCO savings over X86 platforms that companies use today, it takes a good amount of time for that to happen. In 2016, we are seeing things starting to ramp.”
This includes a design win that has Hewlett Packard Enterprise, which is putting an X-Gene 1 processor inside of its StoreVirtual 3200 disk arrays, which are kicker devices to its existing Xeon-based Store Virtual 4330 appliances. These appliances are not particularly useful for large scale deployments, but anything that cranks up the volume on X-Gene chips makes the X-Gene family more credible as a processor and makes Applied Micro the money it needs to continue to invest in the product. Applied Micro is also showing off its “Mudan” scale-out object storage platform, which it developed in conjunction with Red Hat, and another system that was designed to host Redis key/value store and Memcached caching software that is actually in production at a large (unnamed) enterprise. Finally, Applied Micro has developed its own “SmartNIC” that can be used as a host bus adapter with integrated caching for storage, or a network adapter with embedded encryption and decryption or possibly other kinds of functions running on an X-Gene processor. This SmartNIC device is actually not a reference platform that someone else is putting together, but a product that Applied Micro is making and selling itself.
The StoreVirtual win could be a significant boost to Applied Micro’s X-Gene shipments – how much is hard to say. The important thing is that it is a concrete example of a major IT supplier putting a 64-bit ARM server chip into a product that is aimed at millions of customers. The StoreVirtual 4330 appliance has two nodes in a 2U form factor, with each node using a Xeon E5-2600 processor and delivering about 60,000 I/O operations per second (IOPS) across mirrored 7.2 TB of disk storage. With dual controllers, the StoreVirtual 4330 costs $33,000. The newer StoreVirtual 3200 is a two-node machine as well, but the machines are not clustered for redundancy but are run in active-active mode delivering a full 14.4 TB of capacity in a 2U form factor. Both machines run the LeftHand (now called StoreVirtual) storage operating system, which provides snapshots, thin provisioning, and replication services for the storage – features that are not common in devices aimed at SMBs, and certainly not at the $6,000 to $14,000 price tag that HPE is targeting for the StoreVirtual 3200. The adoption of the X-Gene 1 chip is instrumental in bringing the price of a fully loaded array down to $14,000, and that is because RAID 5 and RAID 6 disk controller functionality and 10 Gb/sec Ethernet links are built into the X-Gene 1 system on chip and don’t have to be added to the system.
We think Intel could probably build a StoreVirtual 3200 using the Xeon-D system on a chip it created for Facebook microservers as well as for networking devices and storage arrays. But again, the important thing is that HPE did not do that.
The Mudan object storage platform was rolled out by Red Hat and Applied Micro back at the Red Hat Summit in June, and it is being made by Mitac, an original design manufacturer (ODM) based in Taiwan that owns the Tyan motherboard business. Interestingly, Mitac also sells a microserver using a sled design called Datun, which packs eight X-Gene 1 or X-Gene 2 processors onto a 1U sled, which allows for 384 of the X-Gene chips and up to 3,072 cores to be put into a non-standard 48U rack. To our eye, it looks like Mitac is using the same Datun platform by replacing some of the compute with disk and flash storage. The Mudan sled is based on the X-Gene 1 chip and puts a dozen 3.5-inch disk drives on each sled plus two SSDs for caching the disks and booting the operating system for the X-Gene 1 processors. Without naming customers, Sankaran says that a number of hyperscalers and financial services companies have deployed the Mudan platform for cold storage. One of those financial services companies is based in the US is all that Sankaran can say.
On the data store and caching front, Applied Micro is showing off a server design aimed at Redis key/value store and Memcached caching workloads, and pitting it against a single socket implementation of Intel’s “Haswell” Xeon E5-2620 v3 server. The Applied Micro node in this Redis/Memcached system is based on an eight-core X-Gene 2 processor running at 2.4 GHz with 64 GB of memory and two 480 GB SSDs from Intel. The 1U chassis has two of these nodes with four disk drives each, all running the CentOS 7.2 clone of Red Hat Enterprise Linux. Here is how these two systems stack up:
While Sankaran did not provide list prices for the two Redis/Memcached systems, he did provide relative performance, processor wattage, and costs of the two systems, and as you can see, Applied Micro says it can deliver the same performance at the board level with 60 percent lower power at the CPU level and at 40 percent lower cost. The Xeon E5-2620 v3 costs $417 each when bought in 1,000-unit trays from Intel, and that means the X-Gene 2 costs around $250 each. While the relative pricing on the processors is interesting, what really matters is how the price of the nodes stack up against each other.
This particular Redis/Memcached system is actually deployed at a Tier One cloud service provider, with the system comprising a few thousand nodes. This is real, and it is not a small cluster by any stretch of the imagination. (At 84 nodes in a rack, that is at least 25 racks for this cluster and perhaps more. A capability-class supercomputer is around 200 racks, just for reference.) This is a real benchmark from a real customer, and they are seeing a 30 percent to 35 percent TCO advantage at the rack level including equipment, power, cooling, and other costs.
That leaves the final design win, which is not really a design win at all since it is Applied Micro using its own processor in its own SmartNIC server card. What it is, however, is Applied Micro getting into the server adapter business in its own right when this card is available sometime in the first half of 2017.
The multi-function accelerator card is based on an eight-core X-Gene 2 running at 2.4 GHz, and it has up to 32 GB of memory for its two SODIMM slots, two 10 Gb/sec ports, and six SATA ports. Customers can dial the watts of the card to between 25 watts and 45 watts, depending on the number of cores activated and the clock speeds they set. The card uses a standard open source Linux driver with security acceleration libraries embedded in it. Here is how the performance stacks up when it is used as a SmartNIC with encryption acceleration:
This SmartNIC will plug into any server, including Xeon machines, ironically. The target price for this device is somewhere between $900 and $1,100, says Sankaran.
The Update On X-Gene 3
While the eight-core “Storm” X-Gene 1 and “Shadowcat” X-Gene 2 processors are interesting for relatively lightweight workloads, the X-Gene 3 chip, as we have pointed out, will be aimed at heavier duty jobs in the datacenter made possible by the much brawnier design of the SoC and the much larger number of cores on the device.
The Skylark X-Gene 3 is set to sample by March 2017, which is the end of Applied Micro’s fiscal year and just in time for the Open Compute Summit next year. Production shipments through ODM and OEM customers are expected to follow about a year later, according to Sankaran.
Last fall, we went over the X-Gene roadmap in detail and how the X-Gene 3 fit in, so we are not going to repeat all of that here. The X-Gene 3 is making a big jump to 16 nanometer FinFET processes from Taiwan Semiconductor Manufacturing Corp, and that is allowing for the core count to be pumped up to 32 on the device. The news is that with early test chips back, Applied Micro feels confident enough to provide an estimated integer performance based on the SPECint_rate_2006 CPU test, and that number is around 550 as you can see in the chart. Applied Micro is also now saying that this chip will run at between 110 watts and 125 watts – lower than a typical Xeon does today, but not incredibly lower.
By the way, that SPEC integer rating on a 32-core X-Gene 3 running at 3 GHz is about six times higher than for the X-Gene 2 with eight cores running at 2.4 GHz. Given the core counts and the clock speeds, you would expect about 5X more performance, so that extra performance is coming from tweaks to the core design and other factors like more memory channels and bigger caches. Applied Micro is not providing SPECfp_rate_2006 floating point performance estimates for the X-Gene 3 at this time, but Sankaran says that it should be 6X higher than X-Gene 2.
Here’s the thing: this X-Gene 3 chip should be able to stand toe-to-toe with Intel’s “Skylake” Xeon E5 v5 processors, which will have 28 cores, and the future Power9 from IBM, which will have 24 cores and probably run at slightly lower clock speeds.
Those current and future Intel Xeon and IBM Power chips implement NUMA functionality in their transistors, but for its first pass, Applied Micro is gluing together systems in NUMA setups over the PCI-Express bus. It can do glueless NUMA on two-sockets and for four sockets (available next year) and eight-sockets (on the roadmap) it requires a PCI-Express switch. The X-Tend setup imposes a 10 percent to 20 percent overhead on the NUMA cluster, getting higher as the socket count rises as it does on all NUMA machines, and if it works, it will allow Applied Micro’s partners to build systems with 2 TB, 4 TB, and 8 TB memory footprints.
There are a lot of other possibilities, of course. Sankaran says that Applied Micro is in discussions with customers to possibly add NVLink ports, important for HPC workloads where GPUs accelerate the compute, to future X-Gene chips. It all depends on the case that customers make and the volume of sales. (We would say that any advantage over Intel is a good one, as IBM clearly believes.) Applied Micro is also planning to support the CCIX standard that will come out of the consortium, headed up by Xilinx, that seeks to provide a coherent means of adding coherence between various processing elements. This CCIX support will come in a future version of the X-Gene chip. Applied Micro will also be adding HBM2 memory in a future X-Gene chip. Our guess is that the X-Gene 4 will probably look a bit like Intel’s original plan with the “Knights Landing” many core processor in that it will support close memory (in this case HBM2) and far memory (in this case DDR4 or DDR5) and let customers operating in modes that include one, the other, or both.