As readers of The Next Platform already know, the future “Knights Landing” Xeon Phi massively parallel processor was sighted at the Open Compute Summit a few weeks back, with an Intel motherboard called “Adams Pass” being dropped into an Open Compute sled and shown off by partner Penguin Computing as well as by Intel itself. The actual Knights Landing processor was not shown, and the memory and peripheral slots were covered up on the demo boards and sleds, so that left a lot of people guessing.
Some of the guessing is now over, since Intel has decided to reveal some of the feeds and speeds of the Knights Landing chip ahead of its launch sometime in the second half of this year.
It is no coincidence that the Knights Landing sled appeared ahead of the GPU Technology Conference hosted by Nvidia and the OpenPower Summit that occurred at the same venue in San Jose. Both Nvidia and IBM are pushing their respective Tesla GPU coprocessors and Power8 processors hard as the key components in hybrid systems for accelerating modeling, simulation, analytics, machine learning, and other workloads. It will be a few months before Intel launches the Knights Landing chips, which will be available as processors in their own right and as coprocessors that hook to CPUs like Intel’s Xeons through PCI-Express 3.0 links.
Intel hosted an event for techie journalists and analysts in its Hillsboro, Oregon chip plant to talk a bit about the Xeon Phi and the HPC and commercial uses of the Knights Landing variant, and The Next Platform was on hand to get some insight as well as some new feeds and speeds on the chip.
The first thing you will notice is that Knights Landing is a big chip, and in fact it is one of the biggest packages that Intel has ever manufactured.
“At over 8 billion transistors, it is a big honking die,” says Hugo Saleh, director of marketing and industry development for the Technical Computing Group at the chip maker. And despite the fact that we have been anticipating this for many years and that Intel has been clear for more than a year that Knights Landing is a processor in its own right, Saleh says some people don’t quite get that yet and there are rumors going around that Knights Landing is just a coprocessor. “This is a full server processor. It is an enterprise-class, performant, reliable processor. Anything that a Xeon can do, a Knights Landing can do.”
That includes not only running various Linux distributions, but also running Microsoft’s Windows Server operating system. That’s because the heavily modified “Silvermont” Atom cores at the heart of the Knights Landing chip support all of the instructions that a regular Xeon does, with the exception of the TSX transactional memory feature that is just now coming to market in the Xeon D chip for microservers that Intel just announced. TSX support was embedded in the “Haswell” Xeon E5-2600 processors that Intel launched last September, but the company found a bug and disabled it. This bug has been fixed in the “Broadwell” cores that are in the Xeon D chips, and presumably will be fixed in upcoming Xeon E5 and E7 processor launches that are expected later this year.
Big Chip, Big Bandwidth
We did not have a ruler on hand to measure the size of the Knights Landing package, but it is about as wide as a credit card and a little bit taller. This package includes the Knights Landing processor in the center and has eight banks of what Intel calls “near memory” that is right next to the Knights Landing and that provides high bandwidth, local capacity to the cores on the Knights Landing die. (That transistor count above is just for the Knights Landing chip and obviously does not include the 16 GB of near memory on the package.)
When we saw the Adams Pass board, minus the Xeon Phi chip, at the Open Compute Summit, Intel was hiding the memory configuration, but no longer. In fact, Saleh gave us a tour of the Intel lab and showed us a Knights Landing server node all snuggled into a Supermicro four-node chassis sled, and it booted up a standard Linux operating system that showed 60 cores and 240 threads fired up and ready to do work. Here’s what Knights Landing looks like on the Adams Pass board:
Intel has not said how many watts Knights Landing will burn, but the chip is sufficiently – and understandably – hot enough that Intel is not trying to put two Knights Landing packages on one of these server sleds. It probably can be done with different cooling and that means it probably will be done for HPC shops that want to get the most flops in a rack.
As you can see, the Knights Landing processor has six DDR4 memory slots, which was a bit of a mystery until now, and also has two PCI-Express peripheral slots in the front. Intel has been saying that the Knights Landing chip will have main memory of its own (what it is calling far memory) that has capacity similar to that of a regular Xeon chip, and now we know precisely how much memory that is:
The Knights Landing chip has six memory channels, which are spread out three each in two memory controllers on the die. Those memory controllers support DDR4 memory with capacities up to 64 GB per stick, which yields up to 384 GB of far memory for the processor. The package has up to 16 GB of that high-bandwidth memory that Intel is creating in conjunction with memory partner Micron Technology. This is known as MCDRAM and is often abbreviated HBM, but this is not to be confused with the High Bandwidth Memory that Nvidia, AMD, and Hynix are creating. (As far as we know, Intel’s MCDRAM memory is a variant of Hybrid Memory Cube memory that has a proprietary interconnect between the processor interconnect and that memory, so it is neither HMC or HBM memory, properly speaking, but its own thing.) It is our guess that the Knights Landing SKUs will include variants with 8 GB and 16 GB of this memory. We will get into the nature of this memory and its modes in a minute. The important thing is that this near memory has very high bandwidth and will allow for applications running across the cores to take full advantage of the threads on the die.
Avinash Sodani, chief architect of the Knights Landing chip at Intel, tells The Next Platform that the DDR4 far memory has about 90 GB/sec of bandwidth, which is on par with a Xeon server chip. Which means it is not enough to keep the 60-plus hungry cores on the Knights Landing die well fed. The eight memory chunks that make up the near memory, what Intel is calling high bandwidth memory or HBM, deliver more than 400 GB/sec of aggregate memory bandwidth, a factor of 4.4 more bandwidth than is coming out of the DDR4 channels. These are approximate memory bandwidth numbers because Intel does not want to reveal the precise numbers and thus help people figure out the clock speeds ahead of the Knights Landing launch. That near memory on the Knights Landing chip delivers about five times the performance on the STREAM memory bandwidth benchmark test than its DDR4 far memory if you turn the near memory off.
That near memory’s bandwidth is also considerably larger than the 288 GB/sec aggregate main memory bandwidth that IBM has for its twelve-core Power8 processor. As we reported last week when talking to IBM about the HPC space, a lot of supercomputing workloads scale more in line with memory bandwidth than they do with raw floating point performance. So this high memory bandwidth provided by the near memory in the Knights Landing processor is a big deal, competitively speaking.
As for performance, Intel has talked a bit about this before and has been promising that the chip would deliver more than 3 teraflops at double precision for floating point operations, or about three times the performance as the current “Knights Corner” chip, which has 61 cores and delivers a little more than 1 teraflops. Intel is not divulging the clock speeds or core counts for the Knights Landing chips, probably because it will not get perfect yield on the chips using its 14 nanometer process and some chips may not have the full complement of cores available. The rumor for the past year and a half is that the Knights Landing core has as many as 72 cores, but Intel has not confirmed this. Suffice it to say that Intel will be dialing up the cores and clocks as necessary to hit its 3 teraflops-plus performance goal with the initial Knights Landing chips, and if there are latent cores that it can turn on later for a performance boost, it will do that.
The interesting new bits of data is that the Knights Landing chip will deliver around 6 teraflops of performance with single precision floating point math, which puts it in the ballpark with Nvidia’s new “Maxwell” Titan X GPU card, which scales up to 7.1 teraflops single precision. The difference is that Nvidia does not have very much double precision on the Titan X and it does not yet have a Tesla variant of its Maxwell chip that offers double precision performance. Instead, Nvidia put two “Kepler” GPUs on a card last fall to create the Tesla K80 coprocessor, which has 1.87 teraflops double precision and 5.6 teraflops single precision. The K80 can turbo boost up to 2.91 teraflops double precision and 8.73 teraflops single precision if the thermal envelope in the server allows for it. The K80 has 24 GB of GDDR5 memory for the two GPUs to share and 480 GB/sec of aggregate memory bandwidth from that memory block.
See our related story on where Intel expects Knights Landing chips to be initially adopted and how the market will expand from there.
Saleh also disclosed that the Knights Landing processor will have around three times the single-threaded performance (meaning the X86 work, not the floating point work) as the custom Pentium 54C cores used in the Knights Corner variant of the Xeon Phi chips. Without getting too specific, Saleh said that this very big increase in performance was due to a combination clock speed increases enabled by the process shrink, a radically improved core, and bigger and faster cache and main memories. Our guess is that the clock speeds of the Knights Landing chips will be in the range of 1.2 GHz to 1.3 GHz to keep the heat dissipation down and that the heavily modified compute unit inside of the future Xeon Phi processor coupled to that big jump in near memory bandwidth is what is making all the difference when it comes to performance.
Intel revealed a few more things about the future Xeon Phi chip at the briefing. First, the chip will have on-die PCI-Express 3.0 controllers and specifically will have up to 36 lanes of I/O. (It would be funny to hook Xeon Phi coprocessor cards onto Xeon Phi processors, wouldn’t it?) This peripheral capacity will no doubt be used to attach local storage and possibly networking to the chip. Intel will have three variants of the Knights Landing chip, as shown below:
Because some customers still want to deploy Xeon Phi coprocessor cards instead of running their applications solely on Xeon Phi processors, Intel will be creating a coprocessor variant of the device. There will also be a variant that comes in a server package that fits into the Adams Pass motherboard and perhaps others that want to support the Xeon Phi socket, and there is yet another version that will have ports for Intel’s 100 Gb/sec Omni-Path interconnect on the package. (The picture of the Xeon Phi package above is the one without Omni-Path links.) Intel has not divulged how many links will come off the Knights Landing processor, but considering that many supercomputer shops have dual-rail networks, it stands to reason that the networked version will have two ports coming off it.
Taking A Look Under The Knights Landing Hood
The starting point for the Knights Landing processor is the heavily modified Silvermont Atom core, which has been so changed that Sodani says that it would probably be more accurate to call it a Knights core. Because Intel has streamlined the Silvermont core radically, yet while maintaining full compatibility with the Xeon processors as far as Linux and Windows applications are concerned, there is room to put lots of AVX floating point processing oomph into the chip.
The basic Knights Landing component is called a tile, and each tile has two of those modified Silvermont cores, each with 32 KB of L1 instruction cache and 32 KB of L1 data cache. The cores are topped with a pair each of a custom 512-bit AVX vector unit, which supports all of the same floating point math instructions as the ones used in the Xeon chips even though they are not literally lifted out of the most current Xeons. With more than 60 cores, that means Intel is putting more than 120 of these AVX units on a core. “That kind of power density is hard to attain, even with a Broadwell,” says Sodani. “This is more efficient.”
The Knights core is an out-of-order processor, and both the integer and floating point units use this technique, which has been common on server-class processors for decades. The out of order depth of the Knights core is more than twice that of the Silvermont core, and those L1 caches are also bigger than that on the Silvermont chip. Those cores, by the way, are real cores and they can do real X86 work. They are not just setting up work for the AVX units, and this is one thing that makes Knights Landing a real server processor. “Our single-thread performance is actually pretty respectable,” says Sodani.
Two of the cores with their dual AVX units each are linked to each other on the tile by a shared L2 cache that weighs in at 1 MB of capacity, and a hub chip links the tile to the other tiles on the die. The tiles are linked to each other over a 2D mesh, and it is this mesh that provides the cache coherency between the L2 caches on the die. With 60 cores, that is at least 30 MB of L2 cache, and with 72 cores max, if that is indeed the number, then Knights Landing would peek at 36 MB of L2 cache. The L2 caches are separate and private, but coherent across the mesh. In plain English, what that means is that an operating system will see this as one processor with one cache and one memory space. (Well, if you want to.)
The 2D mesh on the Knights Landing chip has two DDR4 memory controllers, and each one of the 2 GB near memory MCDRAM units has its own controller, too, which hangs off the end of the mesh. The way the routing works, all hub routers on the mesh can move along the Y axis and then the X axis on the grid, and always in that fashion, which helps Intel keep the contention on the mesh down to a minimum.
The interesting thing about the Knights Landing processor is that it will have three memory modes. The first mode is the 46-bit physical addressing and 48-bit virtual addressing used with the current Xeon processors, only addressing that DDR4 main memory. In the second mode, which is called cache mode, that 16 GB of near memory is used as a fast cache for the DDR4 far memory on the Knights Landing package. The third mode is called flat mode, an in this mode the 384 GB of DDR4 memory and 16 GB of MCDRAM memory are turned into a single address space, and programmers have to allocate specifically into the near memory. Intel is tweaking its compilers so Fortran can allocate into the near memory using this flat addressing mode.
You might be thinking, as we were, why Intel doesn’t create a two-socket version of Knights Landing. If one socket is good, why not two? Intel could, in theory, put QuickPath Interconnect ports on the Knights Landing chip and make a two-socket or even four-socket variant with shared memory across multiple sockets.
“Actually, we did debate that quite a bit, making a two-socket,” says Sodani, “One of the big reasons for not doing it was that given the amount of memory bandwidth we support on the die – we have 400 GB/sec plus of bandwidth – even if you make the thing extremely NUMA aware and even if only five percent of the time you have to snoop on the other side, that itself would be 25 GB/sec worth of snoop and that would swamp any QPI channel.”
As is ever the way with systems, the bottleneck has shifted, and in this case from inside the chip to the point-to-point interconnect. For now, Intel is not supporting cache coherency across the Omni-Path fabric, either. Again, there is just so much memory bandwidth that such coherency would swamp the interconnect.
Now that the feeds and speeds are out, The Next Platform has done a separate analysis concerning what kinds of workloads the Knights Landing processors and coprocessors might be suitable for. So check that out, too.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
All the cache in the world won’t help for big data problems. 75% of this chip will not be useful for applications that have to go through a petabyte data base and find some information. AVX and cache is useless for the text based unstructured data analytics that is overwhelming the data centers.
It is clear that Intel only focuesses on floating point with AVX/AVX2/AVX512 since only when there is a large number of computations to be made, AVX brings the performance boost. For integers and text AVX brings only a very small benefit. To boost test-based algorithms in the same way as FP-based algorithms, some instructions need lower latencies, and some new instructions are required to get rid of the large and slow requence of instructions to get something done.
For example, there is a fast and potentially powerful instruction VPCMPEQB which compares 32 1-byte integers which takes only 1 clock cycle to do 32 parallel compares. One could call this a great feature but sadly it is not because it takes 4 more slow instructions to do something useful with the result. So instead of 1 clock cycle a program needs at least 12 clock cycles and a potentially slow conditional branch.
I can appreciate Intel not yet wanting to share absolute performance numbers, but have they shared, even roughly, how much faster a basic access might be – say as a cache fill to first access – for data residing in “near memory” relative to “far memory”? I am assuming the “flat addressing” mode when I ask this question.
Not that I know of. We just know the relative bandwidths and that each 2 GB near memory chunk has its own controller handing off the mesh.
Intel Xeon PHI is but a stepping stone to Intel’s next gen chip which will incorporate TBytes of R-RAM memory ON DIE !
At the same time it will shrink the server package down to a credit card sized package. Xeon PHI is but an architectural stepping stone to Exabyte computing 1×10^18.
Isn’t living in the future wonderful? I will take two of these puppies, then. OK, maybe 100,000 and do something really useful….
Nvidia cant release anything with double precision until GP100 comes out on the 16nm node. GM200 doesnt have the capability physically on the chip itself since they got stuck with 28nm.
It will be interesting to see how GP100 and GV100 compare to Knights Landing and Knights Hill.
nVidia has nothing to do with creation of HBM. HBM is AMD and Hynix baby. nVidia is going to be just a user of the chips, that’s all.