Putting more and more cores on a single CPU and then having two CPUs in a standard workhorse server is something that yields the best price/performance for certain kinds of compute-hungry workloads, and these days, particularly those who want top bin Xeon parts and the cost of the processor is no object because it saves on the total number of server nodes that has to be deployed.
But this is not the only way to pack the most compute density into a rack. A case can be made for middle bin parts, particularly for workloads that scale well across many nodes. For instance, for workloads that are naturally embarrassingly parallel like genomics or, better still, just a huge number of independently serial jobs running at scale like EDA code or Memcached farms. That is why AMD and Qualcomm, which both have delivered credible server “Amberwing” Centriq 2400 and “Naples” Epyc 7000 series processors, respectively, to market as the year is winding down, are focused on single-socket server performance to attack the Xeon base. Cavium has also fielded the “Vulcan” variant of its ThunderX2 chip, and showed very good performance against certain Xeons and, like the Epyc 7000s, can scale across two processor sockets like many of the Xeon SPs.
Because of Intel’s relatively high pricing on “Skylake” Xeon SP Platinum processors, which have all of the bells and whistles as well as high core counts, large memory capacity, and the biggest number of UltraPath Interconnect (UPI) NUMA links on each socket, AMD, Qualcomm, and Cavium are able to make pretty good technical and economic arguments in favor of their processors; in some cases, this will warrant either a vendor change, or an instruction set change and a vendor change, and if for no other reason than to give Intel competition inside of an account. Even Intel, when examining the performance of its Skylake Xeons against the Naples Epycs, concedes that AMD has the upper hand on a number of different workloads.
For today’s episode of Battle for Datacenter Compute, we are examining how Qualcomm and one of its server application partners, Cloudflare, have stacked up the new Centriq 2400s against the past two generations of Xeon chips from Intel. As we have always contended, it takes a slew of benchmarks to get a sense of what modern CPUs, many of them packed with accelerators, are best at, and we know full well that every vendor brings their own skills and assumptions to the tests. We bring a salt shaker to any vendor supplied benchmarks – much more than just a grain – but some information is better than no information, and we have always contended that such tests are really just the precursor to doing your own tests on your own code or, just threatening to move to get some leverage from a server maker.
Cloudflare provides a content delivery network, application acceleration, and security layer that sits between various SaaS application services as well as raw APIs and more generic web sites that are part of the e-commerce stack. The Cloudflare services, which themselves run in a distributed cloud that is comprised of 118 locations linked by 10 Tb/sec of network bandwidth around the world, are precisely the kind of customers who would otherwise have reflexively bought Xeon servers and who Qualcomm is specifically targeting with the “Falkor” 64-bit Arm cores and the accelerators on the “Amberwing” system-on-chip design. On announcement day for the Amberwing chip, Vlad Krasnov, a system engineer at Cloudflare and a former crypto and algorithm engineer at Intel, put out a very detailed post on the performance tests that he had done pitting middle bin two-socket servers based on Intel’s “Broadwell” Xeon E5 v4 chips and the new Skylake Xeon SPs against the Centriq 2400. (Krasnov did not reveal results for the Cavium ThunderX2, but he has machines based on that Arm processor in his lab.)
As part of that blog, Krasnov created this handy little comparison table, which lines of the salient feeds and speeds of the Broadwell, Skylake, and Falkor cores and their respective CPUs:
As you can see, the feeds and speeds of these three chips are different, but not so much that you would expect radically different raw performance. Qualcomm has the advantage in terms of process, having jumped to Samsung’s 10 nanometer manufacturing, while Intel is still hanging back with its 14 nanometer processes for both the Broadwell and Skylake Xeons.
Given the workloads that Cloudflare supports, the performance tests that it ran are what you expect: OpenSSL public key encryption and symmetric key encryption, data compression using gzip and brotli, and the Go language created by Google and used to code a lot of infrastructure software these days (the Kubernetes container controller is a big one), the LuaJIT just in time compiler for the Lua language, which glues together the Cloudflare stack, and the NGINX web server.
In these tests, Cloudflare tested a single-socket server using the Centriq 2452 with 46 cores running at 2.5 GHz against a two-socket Broadwell Xeon E5-2630 v4 with ten cores running at 2.2 GHz per socket and another two-socket Skylake Xeon SP-4116 Silver with twelve cores running at 2.1 GHz. The Xeons have HyperThreading simultaneous multithreading turned on, so that is 40 Broadwell threads versus 48 Skylake threads versus 46 Amberwing threads. The two Broadwell chips cost $1,334, the two Skylake chips cost $2,004 together, and the single Amberwing chip costs $1,383. They are in the same ballpark on clock speeds, thread count, and price. Krasnov correctly presented the performance metrics per core and per system, so we can see the differences in these two aspects of the systems immediately. In general, the per-core performance on the Amberwing chip is near that of the Broadwell core on a lot of these workloads, and the Skylake core has more oomph than both, but with real cores and more of them, the Amberwing chip often has more throughput on encryption and compression workloads, while the Go performance for certain routines needs some work.
The gzip test that Cloudflare did really show how the Amberwing chip flies. Here is the per core performance:
And here is the per system performance:
On the NGINX web server, the Skylake Xeon setup has 31.5 percent better performance than the Broadwell Xeon machine, but the Amberwing Centriq has 24.1 percent better performance than that same Broadwell machine. The fun bit is that once the measured wall power of the three systems was done, the Amberwing chip did 214 requests per second per watt, compared to 99 for the Skylake chip and 77 for the Broadwell chip.
This is the kind of comparison that Qualcomm launched its Arm server assault to make. Yes, we know Intel could test a top bin Xeon SP-8160M Platinum chip and put 56 cores and 112 threads on any of these workloads. But those chips cost 10X that of the Broadwells that Cloudflare used in its tests, and they will not have 10X the performance; something more like 3.4X with GCC compilers and maybe 4X to 4.5X with Intel’s compilers and tuning. But that top bin Xeon SP will also burn a lot more juice. There is just no getting around that.
In its own benchmarks, Qualcomm lined up three Skylake Xeons against its three Amberwing Centriqs, thus:
The benchmarks are a little generic at the moment out of Qualcomm, but the performance per thread on these three different bake offs looks like this on the SPECint_rate2006 integer test:
Here is what it looks like at performance per watt using the thermal design point (TDP) maximum heat dissipation for each chip:
Obviously, we want to see measured wall power for the SPEC tests, not theoretical maximums. Qualcomm showed this for the subtests in the SPEC integer test, with the median power draw being 65 watts compared to the peak of 120 watts theoretical:
And finally, here is how the three sets of processors compared on a cost per units of SPEC integer performance:
There are a couple of things in these charts. Qualcomm has estimated the performance of the Intel systems by taking results using Intel’s own compilers and estimating results for the same system using the open source GCC compilers. The Intel compilers tend to have 15 percent to 20 percent higher performance on the same applications, particularly on the SPEC compute benchmarks. Moreover, Qualcomm is scaling the performance results from two-socket Xeon machines down to one-socket Xeon systems. As far as we know, no one is making one socket Skylake systems. But, we would have to do similar math, so we understand why Qualcomm did all of this.
The main point is that the Amberwing Centriq chips, at least on this initial pass and with a limited set of benchmarks, can get its foot in the door. And now that is two credible Arm server chips and one credible X86 competitor, plus now a Power9 with more hybrid capabilities than any of these chips. Intel has some serious competitive heat coming.