The Carlyle Group, the publicly traded investment firm that has invested in nearly 300 companies that have a net worth of $170 billion and which itself could make around $4 billion in management fees and income from those investments for 2017, does not invest in any technology lightly.
So the fact that it has acquired the X Gene server processor assets that were created over many years by Applied Micro and briefly owned last year by IT supplier MACOM means that Carlyle believes Arm servers have a shot in the datacenter and that its investors want to get a piece of the action.
As we have previously reported, those Applied Micro server chip designs were acquired by a company called Project Denver LLC, but despite the name this Carlyle acquisition vehicle did not in any way also buy any assets from Nvidia’s long-defunct “Project Denver” effort to put an Arm server core on a GPU, which was unveiled seven years ago as a future Tesla product and quietly put on the backburner a few years later. (The hybrid CPU-GPU effort lives on in the Tegra line of platforms for gaming and other kinds of embedded systems.) We asked Kumar Sankaran, who used to be associate vice president of software and platform engineering for the compute business at Applied Micro and who has the same job at Ampere, the name that the Carlyle subsidiary is taking, about the Project Denver link and he tells us it is just a coincidence.
It would have been more fun if it were not and some Nvidia technology was also in the mix, but with Carlyle backing the Applied Micro technology and promising to advance it in future Ampere chips, giving Qualcomm and Cavium some competition in Arm chips and giving Intel and AMD a run for the datacenter money, too, we can still have plenty of fun here. By the way, Applied Micro had lots of networking technology, but none of this was acquired by Carlyle from MACOM. This is strictly a processing deal, and according to Sankaran, Ampere has no interest in developing its own switch ASICs although it could create its own network interface cards running at 25 Gb/sec, 50 Gb/sec, or 100 Gb/sec speeds and embed them on its future system on chip units. (What the plan is, Sankaran is not saying.)
The team that Ampere has put together has some heavy hitters on it, starting with Renee James, who was president at Intel and was passed over for the chief executive officer position at the chip giant for Brian Krzanich; she has held a number of positions at Intel, and was previously in charge of the McAfee security software unit at the company. Chi Miller, a long time finance specialist at Intel and formerly the director of finance at Apple, has joined Ampere as its chief operating officer and chief financial officer, keeping a tight rein on the purse strings, we presume. Rohit Vidwans, who spent 26 years at Intel and lead develop of many generations of Atom and Xeon processors, has joined Ampere as executive vice president of hardware engineering, and Atiq Bajwa, a 30 year veteran of Intel who was the head of X86 architecture, is the chief architecture for the Arm upstart. And Greg Favor, who was a fellow at AMD and part of the K6 and K7 development teams and who was the lead architect at Applied Micro for its X Gene processors, is a senior fellow at Ampere. This obviously is a pretty serious team.
The following chart explains why:
Like Qualcomm with its “Amberwing” Centriq 2400 processors and selected models of Cavium’s ThunderX and ThunderX2 processors, Ampere is focusing on the cloud server market, by which the company means what we would call the hyperscale and cloud segments. The server CPU market is growing slowly, as you can see, at 3 percent compounded annually, but the cloud portion of that (using Ampere’s definition, not ours) is growing at 15 percent compounded annually and will comprise around 50 percent of server chip shipments by 2021.
It is very unlikely that there will be huge demand from IT shops for raw Arm server instances from a public cloud, but given significant total cost of ownership advantages, Arm processors can be – and will be – used behind the scenes, running various data manipulation and storage workloads.
Ampere knows this because Applied Micro put over 25,000 proof of concept and test machines into the field with the prior “Storm” X-Gene 1 and “Shadowcat” X-Gene 2 processors. The X-Gene-1 was implemented in the 40 nanometer processes from Taiwan Semiconductor Manufacturing Corp, and offered eight beefy and custom Armv8 cores running at 2.4 GHz; the X-Gene 2 was supposed to scale up to 16 cores with RoCE enhanced networking, but it only made it into the field with eight cores running at 2.8 GHz despite the shift to 28 nanometer etching – a very significant process shrink.
With the “Skylark” X-Gene 3 chip, which was unveiled as a project in November 2015 and which we profiled in detail when it was further unveiled in October 2016, Applied Micro was shifting to 16 nanometer FinFET processes from TSMC, allowing for a substantial increase in core counts and a slight increase in clock speeds, plus a doubling of the memory channels to eight and a speed bump to DDR4. Applied Micro finally laid it all out there, saying it would have 32 cores on the X-Gene 3, with a more traditional L2 and L3 cache architecture and supporting up to 1 TB of DDR4 memory running at 2.67 GHz, which is 33 percent more bandwidth and capacity than “Skylake” Xeon SP Silver and Gold processors from Intel. The X-Gene 3 chip also sported integrated SATA I/O ports and 42 lanes of PCI-Express peripheral bandwidth across eight controllers.
By any measure, this is a beefy server chip, and it was absolutely intended to be competitive with any Xeon or Epyc or Centriq or ThunderX2 alternative. It was not really aimed at the heftier Xeon SP Platinum models or IBM’s Power9. But that’s fine. There is plenty of TAM to chase.
The X-Gene 3 chips started sampling on time in March 2017 and were expected to start shipping in production quantities sometime in the first quarter of this year. (The X-Gene 1 and X-Gene 2 chips have been put out to pasture, Sankaran tells The Next Platform.) Given all of the changes at Applied Micro and MACOM in the past six months, the schedule for the delivery of the X-Gene 3 chip has shifted out to the second half of 2018, but Sankaran says that this is not a test platform, but something that Ampere expects for customers to put into production. The chip, when it comes out, will not be called X-Gene, but have another brand that will carry forward from that point forward. And, importantly, it is being tweaked from its original design. First, it is being implemented in a follow-on 16 nanometer process and will be able to turbo up to 3.3 GHz clock speeds.
Here is how Ampere is ranking its chip against the competition, however vaguely:
The interesting bit in that table above is that Ampere is providing integer performance, as gauged by the SPECint_rate2006 CPU test, as well as a list price and a thermal design point for what we presume is the top bin, 32 core part. A server with four of Intel’s top end, 28 core Xeon SP-8180M chips, which clock at 2.5 GHz, is rated at 5,530 on the same test, and that means each socket delivers a rating of around 1,383. The Intel chip costs $13,011 at list price, and that works out to $9.40 per SPEC unit of performance on the same test, as well as 6.74 units of performance per watt. The X-Gene 3 chip shown above has a lot less oomph at 500 units of performance, but it delivers it at a cost of $1.90 per unit of performance and at a 4.0 performance per watt level. That’s 40 percent better performance per watt in favor of Intel, but at nearly a factor of four better bang for the buck for the Ampere X-Gene 3.
This is not the real comparison, of course, because an Armv8 core is not a Xeon SP core, but this does show the worst case scenario. If you are looking at compute density per rack and cost per rack, all loaded up, and compare the Xeon SP Gold and Silver models, the gaps probably won’t be as large. Take a Xeon SP-6152 Gold, which has 22 cores running at 2.1 GHz. A two socket machine is rated at 2,120, so that is 1,060 units of integer performance per socket at 140 watts and at a price of $3,655, or $3.45 per unit of performance and 7.6 units of performance per watt. The latter is again better than the Ampere X-Gene 3 is delivering. But you can buy two Ampere chips and have the same performance and spend less money, so there is that.
It all will come down to cases, and Ampere is better that the gaps will be large enough to make inroads into the hyperscale and cloud datacenters that it is targeting. And you have to remember to look at things at a system level, not just at the CPU level, too.
“The pain point for most people is reducing the total cost of ownership,” explains Sankaran. “In the markets where we are chasing – Web tier, data analytics, big data, storage – our go to market has largely been to show the customers a lower TCO for the same amount of performance per rack. The TCO is comprised of three vectors: power, performance, and price. If you turn these knobs in various ways, you can get significantly higher savings, and that is why companies will change their architectures. The TCO you need to achieve at the system level varies based on the workload. In storage, anything greater than 5 percent better TCO is a big deal because the storage platform costs are so dominated by the storage media. The story is different with big data or Web infrastructure, where the costs are compute. In that case, anything greater than 15 percent savings on TCO is a game changer for changing architectures.”
This is consistent with what Google has told us, which is that it would change architectures for a 20 percent savings.
Ampere is currently working on two more generations of chips on the whiteboard right now, which will follow-on the heels of the X-Gene 3. The next one up, which will be the first one designed by the recently assembled Ampere team, will have a similar power envelope – somewhere around 125 watts, all in – but offer higher performance and therefore higher performance per watt. After that, it is hard to say. Ampere will probably stick with TSMC and ride its process curve down, but Sankaran says the company is always looking at the process roadmaps for the competition. That is Intel, which doesn’t really do a lot of fab work for others; Samsung, which does a little; TSMC, which is all foundry for others; and GlobalFoundries, which is also all foundry and which has a shot to win some business if TSMC can’t keep shrinking its processes as fast.