Knights Landing Can Stand Alone—But Often Won’t

It is a time of interesting architectural shifts in the world of supercomputing but one would be hard-pressed to prove that using the mid-year list of the top 500 HPC systems in the world. We are still very much in an X86 dominated world with the relatively stable number of accelerated systems to spice the numbers, but there are big changes afoot—as we described in depth when the rankings were released this morning.

This year, the list began to lose one of its designees in the list of “accelerator/offload” architectures as the Xeon Phi moves from offload model to host processor for at least some of the early users—mainly national labs but with many more expected to follow as the year rounds out and the November rankings emerge. The move from coprocessor to host processor is one thing, but a new sort of heterogeneity might emerge from this—namely, augmenting systems with powerful HPC-centric host processors with beefier general purpose CPUs. More on that in a moment.

There are 13 systems on the June 2017 Top 500 list released today that use “Knights Landing” Xeon Phi processors as their main engines, and there are another 14 that have earlier “Knights Corner” Xeon Phi coprocessors as accelerators for plain vanilla Xeon CPUs.

At the moment, big HPC shops building clusters with the Knights Landing processors are taking two approaches when it comes to networking the chips to each other. Four of the systems make use of the “Aries” interconnect created by Cray for its XC systems (which was acquired by Intel back in 2013) and the remaining nine systems make use of the first-generation OmniPath interconnect from Intel (which is based on the InfiniBand networking Intel acquired from QLogic with a smattering of Aries technology added in). Only Cray sells Aries – that is one of the conditions of the acquisition – but it can also sell Omni-Path, and in fact the “BeBop” 1.1 petaflops cluster that was installed at Argonne National Laboratory and that is making its debut on the June 2017 list is a Cray CS system running Omni-Path. (Cray CTO Steve Scott, who is one of the world’s experts on interconnects, explained the role of various interconnects in HPC last year.) Cray will sell either happily, but for the very large scale clusters where latency and adaptive routing are important, then Aries is better than Omni-Path or InfiniBand, Scott contends.

The one thing that Knights Landing customers seem to be universally in agreement in so far, as least as gauged by the Top 500 list, is that the top bin Xeon Phi 7290, which has all 72 cores activated and running at 1.5 GHz for a the best performance of 3.46 teraflops at double precision, is not worth the money or the extra heat it generates (burning up even more money).

Seven of the thirteen machines use the Xeon Phi 7250, which has 68 cores running at 1.4 GHz for 3.05 teraflops, and only Intel’s own “Endeavor II” cluster has the integrated Omni-Path fabric on the Xeon Phi package. One that we will talk about more in depth momentarily is on the forthcoming Stampede supercomputer, which is using the 68-core part with some interesting results and observations (and others using the 64-core KNL).

Three of the systems on the Top 500 are using Xeon Phi 7230 chips, which have 64 cores running at 1.3 GHz for 2.66 teraflops a pop, and another three are using the lowest bin Xeon Phi 7210, which has 64 cores running at 1.3 GHz as well but which has a slower MCDRAM memory speed and therefore lower memory bandwidth. That low bin Xeon Phi part has about twice the bang for the buck based on Intel’s pricing, which we have previously analyzed.

Knights Landing and Companions in Context

In previous years, the headline for list of the world’s top supercomputers has been around the latest, greatest accelerator. We do not think that organizations are interested in the Xeon Phi as an offload engine, and Intel doesn’t seem all that interested, either, but if you want to use it that way, it does make a PCI-Express card for the new Xeon Phi chips.

The offload model will be alive and well for the GPU generations to come, but what made the next-generation Xeon Phi, Knights Landing (KNL), compelling from the outset was that it was to be self-hosted. This, matched with the ability to take advantage of Cray’s Aries interconnect or Intel’s OmniPath, has left the supercomputing world waiting for the first real results of KNL on both real-world applications and microbenchmarks, which we described with one of the first KNL/Aries machines, the Theta supercomputer at Argonne National Lab. Other large systems with this combination have appeared with their initial benchmarks, including the Trinity and Cori machines, both of which are designed to tackle big science problems at national labs.

For a university-centered supercomputing site that handles wide-ranging workloads like the Texas Advanced Computing Center (TACC), which has almost ten months of KNL data from the first 500 nodes it received at the beginning of its Stampede2 supercomputer project, the application considerations are a bit different. The first phase of Stampede2 is underway, bringing the cluster to 4200 Knights Landing nodes with OmniPath. In the second phase later this year, TACC will round out the system with 1736 Skylake-based nodes and will add 3DXpoint in DIMMs in 2018 when finally available (they were initially expected to ship with Skylake Xeons). The machine is currently #12 on the Top 500 list, but will add far more floating point capability, quite possibly in time for the next incarnation of the rankings in November. Ultimately, TACC predicts the system will be capable of 18 petaflops of theoretical peak performance.

Comparing the microbenchmark results from Cori and Stampede is not apples-to-apples. One uses the Cray Aries interconnect while Stampede2 uses OmniPath. Also, Theta uses the 7210 variant of KNL with 64 cores whereas Stampede is using the 7250 with 68. The results on the microbenchmarks are not that much different, despite what we know is quite a difference in price. “We get slightly higher performance but nothing dramatic in Stream and Linpack,” says Dan Stanzione, TACC Executive Director.

“To me, what is more interesting than any microbenchmarks is the real application performance we can get,” he adds. “There is huge variation among the different applications, but when we had good parallel code to start, something like that is MPI nested with OpenMP for good strong and weak scaling and lots of threads, the Knights Landing has been the most cost and power efficient way to bring performance. With these good codes, we are getting 2-3X versus our Stampede 1 Xeons.” Keep in mind, of course, these are Sandy Bridge processors he is talking about here. Other codes get 5-6X but there are also others that don’t perform as well. These tend to fall in the camp of codes that don’t scale well at all. Stanzione points to the genomics code BLAST as a particularly good example. And it is here where the plot thickens.

TACC does not just handle the big HPC community codes like WRF or NAMD. Their large machines, including the first-generation Stampede system, are NSF workhorses across a large number of domains, some of which do not have established, highly parallelized codes. This means TACC has to go Xeon/Xeon Phi heterogenous with its system to serve this other base of users where performance and efficiency will be much higher on the high-clock Skylake. “If you look at what the DoE talks about most, those codes that don’t scale don’t get mentioned much, but we have a lot of single-node jobs in our workflow as a center. It’s most of the jobs (although maybe 20-30% of the cycles) that can’t use 60 or more cores. There is a clock rate difference between Xeon Phi and the new Xeons but it is not huge per core if you’re running full-on AVX512 instructions on both chips. But then again, good codes run AVX512. If you have some hacked together Python for instance, you’re going to love the much higher clock rate on the new Xeons. This drove our design of Stampede2 with a two-thirds Xeon Phi and one-third Xeon mix.”

Herein lies the point. While many of the leadership-class DoE machines that have a solid sense of their workloads (many of which fall into that massively parallel/MPI and OpenMP camp) can get away with having an all Knights Landing-based system, this cannot be the case for everyone else. For a center like TACC, which hosts many NSF and other scientific workloads that run the application and code-base gamut, there is a new sort of heterogeneity born of necessity—being able to serve more general purpose workloads alongside true HPC.

“My theory is that there are two types of nodes on this system and very few, if any, will use a mix of both at once. It will be one or the other. I think the Knights Landing will handle the bigger MPI jobs and the serial codes will run on the future Xeons. Since 60-70% of our cycles go to those large MPI based codes, it is no accident that we have 60-70% of the machine to serve that,” Stanzione tells The Next Platform.

The other point to make is that the national lab-level systems with all-KNL approaches are not like the world’s largest enterprises. In fact, TACC probably looks more like an enterprise with a mixed bag workload-wise. Even still, for an HPC-heavy mix, Stanzione says it is difficult to beat the Knights Landing in some key ways.

Stanzione cannot talk about per-node pricing, but recall that this is ultimately approximately a 6,000 node system between Knights Landing and the forthcoming 1736 Skylake parts, all for $30 million—a price that includes the network, storage, and other elements. If one just looks at the projected Skylake cost (which we hear could be in the ballpark of $10,000-$12,000 if rumors of the price bump prove true) those will be 60% more per node since unlike KNL, these are dual-socket nodes that require 16 DIMMS (versus 6 with KNL single-socket machines). The Skylakes also require a bigger power supply, have a more complicated motherboard, and other bells and whistles that really drive a big cost difference. “We obviously pay a small fraction of the list price from Dell but it is still around a 60-65% difference in node cost. And for me, it is not about absolute speed, but how much performance can I buy across a system for a fixed budget. That is what drives part of our decision-making.”

We can insert here as a reminder that the two-socket Skylake will have about twice the performance of a two-socket Haswell or Broadwell and about the same as Knights Landing. With AVX512 on the new Skylake, it really levels the playing field to some degree, which is why we have projected that there will be centers that forgo this heterogeneity between Intel parts for pure Skylake. For HPC-centric centers, however, the memory bandwidth—which is the centerpiece—will remain higher. And that definitely still matters.

When asked about the rationale of just waiting for Skylake given these features, most notably AVX512, Stanzione said it was an interesting tradeoff, but of course, Skylake is arriving later than expected. “If I have a lot of good MPI codes I can get a lot more performance for a lot less money with KNL. But if I have code that doesn’t scale, I have to think about how much it would cost to refactor that code. For a small parallel cluster the new Xeons can be very attractive if your software represents a big expense. But if you’re spending millions on a machine and it only costs hundreds of thousands to make the code better, that is a relatively smaller expense.” In the latter is where KNL will shine—but the market for that is limited in size of customer base at scale, even if it will drive up Intel’s performance share of the top systems.

“The great thing is there are a lot more architectural choices, but the bad thing is that there are a lot more architectural choices,” Stanzione laughed. “The key is to benchmark very carefully and look at the balance of hardware cost to the people cost to do the code work necessary.”

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

4 Comments

  1. “We can insert here as a reminder that the two-socket Skylake will have about twice the performance of a two-socket Haswell or Broadwell and about the same as Knights Landing.”

    Are you sure Skylake is twice as fast?
    Afaik AVX512 is just instructions and the execution units are still 2x 256bit.
    Can you point to benchmarks for this claim?

    • Not sure if the vector units have been widened or not on Skylake Xeon there is nothing saying anything but it seems to miss a few Instructions that KNL has, apparently it is going to miss reciprocal and exponential and prefetch instructions.

    • You are confusing the desktop Skylake implementation with the server one, which does have 512-bit vectors.

      • Right, desktop i7 has one 512b unit locked, but it might be available in some i9.

        But first 512b vector should be splitted between 256b units at port 0 and 1.

        Measured latency of port 5 is different that what is measured at first port(s)…

        So there is still too much unknowns. I guess we will have to wait for Hotchips or IDF for more accurate informations.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.