Getting Logical About Cavium ThunderX2 Versus Intel Skylake

Timothy Prickett Morgan

6 years ago

Any processor that hopes to displace the Xeon as the engine of choice for general purpose compute has to do one of two things, and we would argue both: It has to be a relatively seamless replacement for a Xeon processor inside of existing systems, much as the Opteron was back in the early 2000s, and it has to offer compelling advantages that yield better performance per dollar per watt per unit of space in a rack.

The “Vulcan” ThunderX2 chips, at least based on the initial information that is available in the wake of their launch, appear to do that.

Different customers look at components of that performance per dollar per watt per unit of rack space differently, but all four are always a part of the equation even if one of them is zeroed out because money is no object (think high frequency trading) or space and therefore compute density is not a concern (a rural enterprise datacenter close to a major metropolitan area does not have the same pressure as a hyperscaler) or screaming performance is not required (and hence the fact that Intel sells so many middle bin parts to millions of companies each year).

For this reason, it is probably also a good sign that Cavium has cooked up over 40 different SKUs of the Vulcan chips for its initial launch, and stands in stark contrast with the Qualcomm Centriq 2400 line, which had a mere four SKUs at launch last fall and which may be in the process of being shut down or sold off by the world’s second largest maker of smartphone chips after Samsung Electronics.

Incidentally, both Samsung and Qualcomm had an urge, like so many others, to tap into datacenter profits with Arm server chips; Samsung never got off the drawing board before quitting, and Qualcomm has spent four years at it thus far. Broadcom shuttered the Vulcan Arm server chip project and sold it off to Cavium, which has revamped it as one of two distinct ThunderX2 chips. Calxeda, the original Arm server upstart, went bust trying to make the jump from 32 bits to 64 bits in servers, AMD has gone cold on its “K12” Arm chip, and Applied Micro was sold twice and is trying to re-emerge with the “Skylark” X-Gene 3 chip under a new company called Ampere.

“I Have The Phaser, Captain, And I Do Not Intend To Simply Disappear As So Many Of Your Opponents Have In The Past”

Cavium looks to be a survivor in this brutal battle of datacenter compute, particularly in the hyperscale, public cloud, and HPC markets that it is initially targeting with the Vulcan variants of ThunderX2.

We did an initial analysis of these 32 core Vulcan ThunderX2 chips here, and drilled into the original homegrown 54 core ThunderX2 chips (code-name unknown) there. And last week, concurrent with the general availability of the Vulcans, we did a deeper dive into their architecture and promised to get into the latest performance specs provided by Cavium for the Vulcans.

Back when Cavium put out its initial benchmarks for the Vulcan ThunderX2 chips, the results were for single socket machines. With the general availability of the chips, Cavium has tested workhorse two socket machines on compute and memory bandwidth tests, as before, and also has run a few HPC tests and shared results from the University of Bristol, which is one of the champions of Arm in HPC and which put out numbers showing how the Vulcans can compete with Intel Skylake Xeons in the HPC realm. Last time, these benchmarks were done on single socket machines, and the Isambard Project has now completed tests on dual socket machines and graciously shared the results with Cavium to help it make a case for the Vulcans.

Since memory bandwidth is so critical to certain workloads, we will start there, with the STREAM Triad test, which is the touchstone for gauging the relative memory bandwidth of systems. Here is how Cavium stacked up a pair of its top bin 32 core Vulcans running at 2.5 GHz against a pair of Skylake Xeon EP-8176 Platinum processors, which have 28 cores each running at 2.1 GHz and are at the top of the bin for the SKUs with balanced energy efficiency. (You can see all of the Skylake SKUs here.) Take a look:

The memory in both systems are running at the highest 2.67 GHz speeds that are available on both machines, and the fact that the Vulcan chip has eight memory controllers compared to the six in the Intel Skylakes is what accounts for the vast amount of the memory bandwidth difference seen between the Xeon ad the ThunderX2. In theory, with 33 percent more DDR4 memory controllers, the ThunderX2 chip should do 33 percent better in terms of memory bandwidth, with the same DIMM capacities and speeds. On the actual test, Cavium is getting a 23.5 percent advantage on bandwidth, and with more tuning it should be able to push that higher. On Intel’s own STREAM Triad tests, which we revealed last summer, a pair of “Broadwell” Xeon E5-2699 v4 processors topped out at about 135 GB/sec on STREAM Triad, and with the top bin Xeon SP-8180M Platinum chips it could still do about 225 GB/sec with the latency held at around 130 nanoseconds on memory access times. So Intel can tune up STREAM Triad a bit better than Cavium can on the Xeons, which is not surprising. The two socket results are consistent with the single socket results, by the way.

For the SPEC integer and floating point compute tests on dual socket machines, Cavium compared the performance of a pair of its more standard 2.2 GHz 32 core ThunderX2 chip to a pair of Intel’s Skylake Xeon SP-6140 Gold chips, which have 20 cores each running at 2.5 GHz and only 27.5 MB of cache on the die activated. (Cavium is quoting internal benchmarks it has run but not yet submitted to SPEC against Intel results that have been submitted, which is not exactly kosher but we have to get the data we can get.) These are both volume SKUs, not top bin (and therefore not very expensive, relatively speaking) parts. The Xeon chip is rated at 150 watts but that does not include the southbridge chipset for linking out to I/O, while the ThunderX2 chip is rated at 180 watts and includes all I/O controllers embedded. In the past, Cavium has shown relative performance on the SPEC tests, but this time, it is showing absolute numbers:

As you can see, Cavium has done a lot of work in tuning up the GNU open source compilers (GCC) to run well on ThunderX2 chips, and in this case is getting nearly the same performance as Intel gets using its own compilers on its own Xeon chips in the integer test. The GNU compilers are not as well tuned on the Intel chips, but they are often preferred by hyperscale and cloud customers and more than a few HPC centers.

As for floating point math, the custom Armv8 cores in the Vulcan chips have a pair of 128-bit NEON math units, and the Xeon SP Gold chips have a 512-bit AVX-512 unit with two fused multiply add (FMA) units activated. (Some of the Skylake chips only have one FMA turned on.) On the SPEC floating point test, the ThunderX2 can beat the Intel chips using GCC compilers, but Intel pulls ahead on its own iron using its own compilers by about 26.5 percent over the ThunderX2 using GCC compilers. The important thing is that Cavium is working with Arm Holdings, which now owns software tools maker Allinea, to create optimized compilers that goose the performance of integer and floating point jobs by around 15 percent, which will put ThunderX2 ahead on integer performance (for these parts anyway) and close the gap considerably on floating point (with about a 10 percent gap still to the advantage of Intel).

Intel charges $3,072 each for that Xeon SP-6148 processor when they are bought in 1,000-unit trays, and Cavium is charging $1,795 for that 32 core, 2.2 GHz Vulcan. If you assume that Cavium and Arm Holdings can get that 15 percent performance boost from optimized compilers – a big if we realize – and assume the Intel compilers are used on the Intel chips, then the ThunderX2 as tested will cost $14.59 per SPEC rating unit across two processors on the integer test, compared to $28.44 per unit for the pair of Xeon SP-6148s. That is a big gap. And while Intel has the performance advantage on the SPEC floating point test, Cavium has the price/performance advantage on these two chips, costing $20.14 per unit of floating point performance compared to $31.35 per unit for the pair of Intel Xeon SPs. The difference of $2,500 per system is a big deal, particularly at stingy hyperscale and HPC shops.

For a more commercial workload, Cavium chose the SPEC JBB Java middleware and database benchmark test, and the higher core and thread count and the higher memory bandwidth – and the better balance between the two – really showed through and gave the Vulcans a significant advantage over the two-socket Xeon-SPs.

Once again, Cavium is looking at volume SKU comparisons as above. A few notes here on these comparisons. First, there are two ways to run the SPECjBB benchmark: one where you focus on boosting the number of transactions pushed through the system, and another where you pay attention to the latency and try to minimize big tails. Intel has published results for the Xeon SP-8180M Platinum chip on these tests, but has not done so for the Xeon SP-6148, so Cavium had to estimate the performance of this volume part from that of the top bin part. (If Cavium didn’t do it, we would have to or you would have to.) Finally, these are lab results on the ThunderX2 system that have not yet been approved by the SPEC people. Take this all with grains of salt, of course. The ThunderX2 machine has a 30.5 percent or 30.9 percent performance advantage, either way the test was run and estimates done, and the ThunderX2 does a unit of work for about 40.5 percent of the cost (at the raw CPU level) of a Xeon SP.

“In Every Revolution, There Is One Man With A Vision”

Because Hewlett Packard Enterprise is pushing the ThunderX2 in its Apollo 70 supercomputer nodes, the company has run the very tough High Performance Conjugate Gradients (HPCG) benchmark on both Xeon and ThunderX2 nodes, and Gopal Hegde, vice president and general manager of the datacenter processor group at Cavium, shared the results of HPE’s tests with The Next Platform. Using the GCC 7.2 compilers on the volume bin ThunderX2, a two-socket node achieved a rating of 35 gigaflops, compared to 36 gigaflops for the pair of Xeon SP-6148s. That is called spitting distance, and the price difference means the Vulcan chips are doing the work at a 60 percent better bang for the buck.

Finally, there are some new results coming out of the Isambard Project at the University of Bristol. Back in November last year, the Isambard team shared the performance specs on various HPC microbenchmarks on single-socket servers using Broadwell and Skylake Xeons and ThunderX2s. The tests were conducted on 18 core Broadwell, 22 core Skylake, and 32 core ThunderX2 processors. Here is the rematch, doubled up, with everything normalized to the Broadwell performance on the first set of tests:

The deltas are about the same as on the single socket machines, and what it really shows is the effect of the NUMA interconnects from either Intel or Cavium when lashing two CPUs together in a shared memory system.

Here is another chart with a series of higher level, full blown applications that the Isambard team has tested on the same iron, the test results that are expected to appear soon in a paper that the Isambard team has yet to publish:

In some cases, the Skylake chips win, in others the ThunderX2 chips win, depending on how the code has been tweaked to the architecture and the nature of the underlying architectures. But the key takeaway here is that ThunderX2, at least on these volume parts, is definitely in the running and will be able to demonstrate about 85 percent of the performance across real HPC workloads at around 42 percent better performance per dollar on average across those eight real HPC applications shown above, and in around the same wattage and in around the same space. If the compilers can be improved through Allinea as expected, the performance gap will close.

This is all significant, and HPC shops and hyperscalers alike should consider it as any wise Vulcan – even one in a parallel universe – would do. And it shows that, as we have been saying all along, that someone will try to come along and make this about price and that any technology that has 50 percent gross margins will generate the very competition that removes that profit. This is how the technology business works.