To a certain extent, the “Knights” family of parallel processors, sold under the brand name Xeon Phi, by Intel were exactly what they were supposed to be: A non-mainstream product that tried out a different architecture than its mainstream Xeon family of server processors and that was aimed at the high performance computing jet set that is, by definition, supposed to take risks on new architectures.
That is not the same thing as saying that the Xeon Phi line was the success that Intel or its many Xeon Phi customers had hoped it would be. It wasn’t.
And so, with Intel facing tremendous issues with its 10 nanometer manufacturing processes, the chip maker has quietly sunsetted the last of the Xeon Phi chips, the “Knights Landing” Xeon Phi 7200 series that debuted in June 2016, in the very unceremonious way of issuing a Product Change Notification. This is just a little more obscure a killing than the launch of the “Knights Mill” variant of the Knights Landing chips, which late last year debuted with half-precision math support by having the feeds and speeds and prices of the devices added to Intel’s product catalog with nary a peep out of the company.
That 10 nanometer process was supposed to be used to etch the “Knights Hill” Xeon Phi processors that were originally slated to debut about a month or two ago in the “Aurora” pre-exascale system at Argonne National Laboratory. Knights Hill was killed off late last year for an as-yet-secret processor architecture for a follow-on Aurora A21 system being built by Intel with the help of Cray for delivery 2021. (We have expressed our thoughts about what this architecture might look like, with a mix of serial and parallel compute units and a mix of stacked and regular DRAM memory, and maybe even Optane 3D XPoint persistent memory DIMMs.)
All of this change leaves HPC customers that had bet on the Knights family in a bit of a lurch, but this is something that HPC centers are accustomed to. It is not even a bit ironic that the Knights architecture that was pitched as a replacement to IBM’s massively parallel BlueGene family of machines, which Big Blue spent billions of dollars developing and from which it got back many billions of dollars in sales, was itself spiked on what looked like a whim. IBM killed off BlueGene and its custom, massively parallel processor and funky interconnect because of a need to cut costs, cope with process manufacturing issues, and competitive pressures. Intel is killing off the Xeon Phi line because the delay in 10 nanometer manufacturing, which both GlobalFoundries and Taiwan Semiconductor Manufacturing Corp can beat with their impending 7 nanometer wafer baking, has radically altered all of Intel’s costs, its processor roadmaps, and its competitive position against AMD and Cavium. This is all very familiar. It is just unexpected from Intel, which has been the manufacturing leader for chips for so long no one can remember who else might have been.
The product change notification comes as no surprise, of course. Once the Knights Hill Xeon Phi processors were mothballed and Avinash Sodani, chief architect of the Knights Landing chip who walked us through the architecture way back in November 2015, took a job at Cavium, the jig was up. And when Raja Koduri, the chief architect of AMD’s GPU processors, joined Intel last November, the plan for accelerated computing seemed to be obvious. Intel wants to make massively parallel GPUs to take on Nvidia and AMD directly, not with an oblique shot based on the X86-centric Xeon Phi line, which ironically has its roots in the failed “Larrabee” X86-based GPU project that Intel undertook in the 2000s as Nvidia was breaking into the datacenter with its Tesla coprocessors. Intel has said nothing about this, but is having a datacenter compute event on August 8 and may divulge more. It certainly needs to clear up its roadmaps.
In any event, customers can order Knights Landing chips, but after August 31, the orders are not cancelable or returnable; Intel will ship Xeon Phi 7200s until July 19, 2019. After that, caput. End of life. There is one last change that was expected to be coming, based on a roadmap for HPC customers that our peers at Anandtech got their hands on. According to Anandtech, this roadmap was leaked by accident by Intel presenters at the Central South University in Changsha City, Hunan Province in China. (We were lucky to score a similar roadmap that was accidentally published by Intel back in 2015 for the “Skylake” Xeon SPs and their “Purley” server platforms. You can’t win them all.) Here is that purported Intel HPC server and interconnect roadmap that Anandtech scored:
There has been some talk about a special variant of the future 10 nanometer “Ice Lake” processor, called the Ice Lake-H, that was designed to replace the Xeon Phi chip. This was rumored to be a dual-chip module with a total of 44 cores and eight memory controllers, possibly with HBM2 stacked memory. This more recent roadmap above does not bear this out and we think that is because Intel has to make do with 14 nanometer processes until 10 nanometer Xeon chips start ramping maybe a year from now. Maybe later.
This roadmap above is interesting on a lot of fronts. First, the Xeon Phi Knights Landing chips will be equipped with “Apache Pass” Optane 3D XPoint memory DIMMs, which as you know were supposed to be part of the Skylake/Purley launch and which are still not shipping. There is a refresh of the Knights Landing and Knights Mill processors that supposedly happened at the beginning of this year, but heaven only knows what that was. After that, in the HPC part of the business, Intel is focusing on the 14 nanometer “Cascade Lake-AP” processor that is being coupled with the “Walker Pass” second generation Optane DIMMs, and that looks like it is slated for around May 2019. Beyond that, about a year later, there is another generation of AP processor – it should be “Ice Lake-AP” but that is not on the roadmap coupled with a third generation of Optane DIMMs – it will carry the “Pass” name, but we don’t know which one. (Hopefully not the one that the Donner party used.) The AP is short for “Advanced Processor,” and everyone is speculating that Intel will be taking a page out of the AMD Epyc playbook and putting multiple Xeon chips into a single package and interconnecting them.
Let’s say that Intel could do this with Skylake-SP processors, just to show how it might work. As you know from our coverage on the Skylake Xeons back in July 2017, there are actually three different Skylake Xeon chips. The Low Core Count (LCC) version has up to ten cores in a mesh interconnect grid that is inspired by the grid used on the Knights Landing Xeon Phis, replacing the ring interconnect that the Xeons used for many years. The High Core Count (HCC) version tacks another eight cores on the bottom of the grid, boosting it to 18 cores. Like this:
The Extreme Core Count (XCC) variant of the Skylake-SP processor has adds two columns of five more processors, boosting the core count to 28. Like this:
If we have learned anything from AMD’s Epyc line, it is that it doesn’t make sense to take two huge monolithic chips and put them in the same socket. AMD has four eight-core chips in the Epyc package, making a 32 core baby NUMA system that has eight memory controllers and 128 lanes of PCI-Express 3.0 I/O bandwidth. If Intel did make a two-way XCC package using Skylake chips, it would have 56 cores, a dozen memory controllers, 3 TB of memory, and 96 lanes of PCI-Express traffic coming out of one socket. The Purley platform might not be able to handle that – we don’t know. Hence the need for an Advanced Platform that might be different. If Intel took four of the LCC modules and lashed them together, it could do 40 cores, 24 memory controllers, 6 TB of memory, and 184 lanes of I/O, which does not seem balanced. Two of the HCC modules is interesting, in that it would provide 36 cores, 12 memory controllers, 3 TB of memory, and 96 lanes of I/O. A process shrink might have enabled Intel to put four of these into a package, but again, the socket would need a huge number of pins to get all of the I/O. The answer might be to crimp the I/O, or to use a lot of it to interconnect the chips as AMD does.
It would be funny as all get-out if the Cascade Lake-AP multichip modules fit into a Knights Landing socket. That would solve a whole lot of problems. Hmmmm.
On the Xeon front, the “Cascade Lake” kicker to the Skylake-SP processors for the Purley server platform, which are implemented in a more refined 14 nanometer process, come at the very end of this year – that sure looks like mid-December to us, just barely qualifying for shipping this year – and basic feeds and speeds of the chip stay the same at 28 cores maximum and six DDR4 memory channels. At the very end of 2019 comes another Xeon kicker, also presumably etched in a 14 nanometer process, called “Cooper Lake-SP. We don’t know anything about this chip yet, but we presume that it shared the same socket and server platform as the 10 nanometer Ice Lake-SP chips; this new platform is purportedly codenamed “Whitley.” We do know from this roadmap that Cooper Lake-SP will use an Optane DIMM codenamed “Barlow Pass,” which will also be available on the Ice Lake-SP processors. The tail end of 2019 will see another future Xeon chip, and a different Optane DIMM.
What we don’t see on this roadmap is any mention of HBM stacked memory. The expectation is that for HPC workloads in particular, Intel would shift to a hybrid memory architecture that had low capacity, high bandwidth stacked memory close to the processor, and high capacity, low bandwidth DDR4 or DDR5 main memory as well as Optane DIMMs sitting further out on a different set of memory controllers. Like this idea that was espoused by the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory for a potential architecture for its future “NERSC-9” supercomputer:
We don’t see anything like this above on the Intel roadmap further above. It was a nice dream, and we think, the right idea. But not if Intel has completely shifted to a hybrid CPU-GPU architecture, as we think that it will.
The Ice Lake-SP chip is the interesting one. All of the rumors had it that Ice Lake-SP would have eight memory controllers and a larger socket to give more lines out to memory and I/O, and importantly, all the talk was about how Ice Lake-SP would support up to 32 GB of HBM2 memory with up to 650 GB/sec of memory bandwidth. A variant of Ice Lake, again called Ice Lake-H, was supposed to take two 22 core modules and put them into one package, presumably with double the memory bandwidth going out to HBM2 memory. We suspect that these chips supported Optane DIMMs off regular memory controllers and in regular memory slots. We also see that Intel had to put in a Cascade Lake-AP as a bridge to the Ice Lake-AP because of the delays in the 10 nanometer rollout.
Some of the mystery around the future Omni-Path 200 networking from Intel, which we talked about recently, is also cleared up. Intel told us that the switches would have more ports than the current 48 ports in the Omni-Path 100 series switches, and now we see that the Omni-Path 200 is going to have 64 ports, a 33 percent increase and significantly higher radix than the 40 ports on the Quantum InfiniBand switches from Mellanox Technologies. Intel has a mammoth 2,048 port Omni-Path director switch coming, called “Santiam Forrest” and an adapter card with a single port running at 200 Gb/sec that eats a PCI-Express 4.0 x16 slot. This all implies that the Whitley platform in late 2019 will finally give Intel faster peripheral links.
Here is the thing to consider as Knights sunsets. The AVX-512 floating point unit, the mesh interconnect on the die, the integration of Omni-Path controllers, and the integration of high bandwidth stacked memory into the processor were all prototyped on Knights, and three of these features have gone mainstream in the Xeons and the fourth might in the Ice Lake generation of Xeons. And equally importantly, all of the work with the Knights coprocessors and processors does, in fact, give Intel an intellectual property foundation on which to make its own discrete GPUs. They just might not have so many X86 cores on them this time. Perhaps none at all.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
And perf/$ is?
Its all very well, but amd zen fabric architecture, now has a product spread from a single 4 core ccx on an apu, to a pair of ccx on desktop am4 cpuS, 4 in a workstation, 8 in a server and 16 in a dual socket, and inevitably, 32 in a quad socket server in time.
thats currently up to 64 cores, yet all based on the same; cheap, efficient, mass produced lego block chiplet – the 4 core ccx core complex.
exciting stuff that has the world agog.
But wait, out of the blue at computex, these numbers of ccxS have have now doubled as we see initially in TR, and these new cpus run fine on existing platforms.
The eye glazing intel products discussed seem customised expensive ~monolithics for low volume high price.
Having the best unit becomes irrelevant once you can team units. The issue becomes, which team does the job cheapest?
Quantity has a quality all of its own, as they say.
You didn’t mention FPGAs. They are so much more efficient than CPU cores. They are in the HPC mix for sure. There many algorithms were a CPU core is a big waste of power and performance. As soon as a FPGA manufacture lets the world understand the bitstream instead of hiding it people will realize the true power of FPGA based computing. Imagine if Intel would not tell anyone what the instruction set of a device was. There would be no GCC compilers, no MSVC no Portland Group compilers. There would be only Intel compilers with no pressure to improve performance and no real innovation. That’s where FPGAs are at. When that mold gets broken FPGA will dominate the data center.
Fully disagree, we are deep in research for crypto mining manipulations.
Three weeks ago got a designed Xeon Phi 7210 racks. August 7th 2018, $20 profit minus $2.7 consumption on $0.10 per kWh 1170w, $500 per month with a cost of $3650.
All due we have made ASIC algo miners for argon2d and yesscrypt.
Was hard to get, due hand made only, minimum order $100k