The natural place for Intel to launch the next iteration of its “Knights” family of parallel X86 processors is at one of the two major supercomputer conferences that are hosted each year, which is the ISC conference in Germany and the SC conference in the United States. Many people had hoped for the “Knights Landing” Xeon Phi processors, which have been anticipated for a few years now, at this week’s SC15 conference in Austin, Texas. But Intel is not yet ready, so we have to wait.
Well, not everybody has to wait. As Charles Wuischpard, general manager of the HPC Platform Group within Intel’s Data Center Group, explained in a prebriefing ahead of the SC15 conference, some early adopter customers are getting their hands on early versions of the Knights Landing chips, even though they have not been formally announced. This has been Intel’s practice for a number of years for both its Xeon and Xeon Phi compute engines, with hyperscalers or HPC shops (or sometimes both) get to play with early silicon.
“We are making great progress and meeting some of the commitments that we made,” explained Wuischpard. “Part of this is just getting silicon out and into the hands of our partners and our early users.”
Three sites have pre-production, Top500-class supercomputer systems installed. Cray has to have a large system as a precursor to delivering the “Cori” system at the National Energy Research Scientific Computing Center (NERSC) and the “Trinity” system being shared by Los Alamos National Laboratory and Sandia National Laboratories. The Cori and Trinity machines will be installed in the first quarter of 2016, and according to Wuischpard, a large Xeon Phi-machine is up and running in Cray’s lab testing a suite of applications “at full functionality.”
Cray was awarded a $174 million contract back in July 2014 to build Trinity, which is a joint effort between the Alliance for Computing at Extreme Scale (ACES) at Los Alamos National Laboratory and Sandia National Laboratories as part of the National Nuclear Security Administration’s Advanced Simulation and Computing Program (ASC). Trinity is using a mix of “Haswell” Xeon E5 v3 and Knights Landing Xeon Phi processors for its compute elements and will be installed at Los Alamos. The Trinity system is based on Cray’s next-generation “Shasta” XC40 system, and is expected to have 9,346 dual-socket Xeon E5 v3 nodes and over 9,000 Knights Landing nodes, with more than 2 PB of DDR4 main memory and 42.2 petaflops of aggregate peak performance across its compute elements. About 30.7 petaflops of that compute should be coming from Knights Landing, and the remaining 11.5 petaflops come from the Haswell Xeons.
Trinity will use the “Aries” interconnect created by Cray and acquired by Intel several years ago, and will have 82 PB in its parallel file system with 1.6 TB/sec of system bandwidth. The system has a DataWarp burst buffer from Cray that weighs in at 3.65 PB and that delivers 3.28 TB/sec of sustained bandwidth, and the entire system should consume under 10 megawatts of juice. Los Alamos started taking delivery of the Xeon E5 nodes back in February. Los Alamos and Sandia expect for Trinity to deliver at least eight times the performance on its applications compared to the “Cielo” supercomputer currently at Los Alamos, which was installed by Cray in 2011 and delivers a peak performance of 1.37 petaflops across its 143,104 AMD Opteron cores and “Gemini” interconnect.
Cray won the $70 million contract to build the Cori system at NERSC back in April 2014, and like Trinity, it is a hybrid Cray XC40 machine that mixes and matches Haswell Xeon E5 nodes and Knights Landing Xeon Phi nodes. The Haswell Xeon E5 portion of the machine has already been installed and has 1,630 nodes with a total of 52,160 cores with 203 TB of aggregate main memory on the nodes and 28 PB of scratch storage with more than 700 GB/sec of peak I/O bandwidth. The Cori phase 1 machine has a 750 TB burst buffer based on non-volatile memory that delivers 750 GB/sec of I/O bandwidth. The Aries interconnect linking the Haswell nodes, which uses a dragonfly topology, delivers 5.6 TB/sec of global bandwidth in the current Cori phase 1 configuration. By the summer of 2016, Cori will be augmented with 9,304 Knights Landing Xeon Phi processor nodes and have 1,920 Xeon E5 nodes, and 384 burst buffer nodes. The whole thing will fit in 64 cabinets, and oddly enough, we have not seen a peak number-crunching performance figure for the machine published anywhere, but it should be somewhere around 34 petaflops based on the performance figures given for Trinity and the specs given for Cori.
In addition to Cray, the Bull systems unit of French systems integrator Atos has also received early versions of the Knights Landing Xeon Phi chips from Intel for the foundational work for the Tera 1000 system that Atos is building for the Commissariat à l’énergie atomique et aux énergies alternatives, or CEA, which is the abbreviation for the French Alternative Energies and Atomic Energy Commission. The first phase of the Tera 1000 machine was installed last week ahead of the SC15 supercomputing event, and it includes a mix of Xeon E5 v3 nodes and pre-production Knights Landing Xeon Phi nodes; this first phase of the Tera 1000 machine is expected to have about twice the peak performance of the current Tera 100 machine, which has 4,730 two-socket Xeon E5 v1 nodes linked by QDR InfiniBand running at 40 Gb/sec and delivering 1.25 petaflops peak. In mid-2016, Atos will roll out its own Bull Exascale Interconnect for the Tera 1000 machine, and in 2017 phase two will launch with more than 8,000 Knights Landing Xeon Phi processors added to the complex. The final configuration is expected to have in excess of 25 petaflops of performance.
The third facility that is getting early access to the Knights Landing chips is Sandia National Laboratories, which has a bunch of machines with earlier generations of Xeon Phi coprocessors and which is working with Penguin Computing on the machines. This particular test machine is using a mix of Xeon Phi rack based on Intel’s new Omni-Path interconnect, which Sandia is using to test their codes.
It the ISC15 conference in Germany last summer, Intel had said to expect for first commercial shipments of the Knights Landing chips before the end of the year, but made no promises about when it would actually do the launch. The company has been doing a rolling thunder release of features for both the Knights Landing Xeon Phi and Omni-Path for the past year, and never promised to do the Knights Landing launch at the SC15 supercomputing conference, but that is clearly where many had expected it. With over 8 billion transistors and using Intel’s latest 14 nanometer technologies, it is fair to guess that Intel is working on getting the yields up before it commits to general availability, and this is the gating factor to the formal announcement for the Knights Landing chips.
Back at ISC15, Intel confirmed that the Knights Landing chip would have 72 cores, and that it would come in a variant that plugged into a socket, another one that plugged into the socket with dual integrated Omni-Path interfaces, and a third that would be packaged as a PCI-Express coprocessor card like current Xeon Phi accelerators. The a month later, at the Hot Chips 27 conference, we learned that the Knights Landing chip actually has 76 cores, with four being spares, which are there to help with yields and which might eventually be activated for compute. We are not going to review all of the feeds and speeds of the Knights Landing design here, but suffice it to say that it is a sophisticated processor and one of the largest circuits that Intel has ever made, so it is no surprise to us that it is taking a bit longer to get it to market.
“The general availability is still expected in the first half of 2016,” confirmed Wuischpard. “The thing that we are wrestling with is that we have actually got our production volumes in the factory right now for all of the first deliveries and we have quite a bit to deliver even pre-GA. So we are going to have an early ship program and we have already got a number of orders against that, and we expect the GA with more than 50 system providers. And when we look at the application suite that really needs to be tuned and optimized, there are about 80 to 100 that support the majority of the workloads in the HPC segment. We have got active collaborations there.”
Intel is also going to be creating a single-socket Xeon Phi workstation, with the appropriate main memory and PCI-Express peripheral slots, that it will make available to developers so they can port and test their code without trying to gain access on an early adopter machine like the ones mentioned above and the ones that will no doubt follow in the first part of 2016 ahead of and after the official Knights Landing launch. This development machine will not be a server with Xeon processors and Xeon Phi coprocessors, but a real workstation with all of the software and developer tools needed to port and test code.
Given that the Knights Landing implementation is available as a standalone processor as well as a coprocessor, you might be thinking that Intel expects for a lot of the machinery built using Knights Landing will be a mix of Xeon and Xeon Phi systems clustered together and working side-by-side but not with the Xeon Phi being linked as a coprocessor to the Xeons – what the Trinity, Cori, and Tera 1000 systems above look like to our eye and what the Tianhe-2A supercomputer does not.
“You have the duality of running in a coprocessor mode through a PCI-Express connected device or running in a true native bootable mode, and what we have seen is that by far the larger interest is in native mode processing,” said Wuischpard in regards to the Knights Landing chip. “If you look at the large supercomputing that have been announced such as Trinity, Cori, and on and on, they are almost exclusively made up of Xeon Phi processors running in native mode. If you look at the HPC industry going forward, I think that you are going to see people that run Xeon because it keeps getting better and better, and people that will decide to run in a sort of mixed mode with half of the nodes being Xeon and half being Xeon Phi, and then there will be those who will say based on their workloads that it will be best to run 100 percent on Xeon Phi. Even within that, I think the need for coprocessors will diminish over time as you will be able to achieve various levels of performance and compute density in the various configurations mentioned above.”
We happen to think – and have been saying all along – that the uptake for Xeon Phi could be stronger than Intel originally anticipated, and that may be another reason that it is coming to market a little bit later than many had expected. Intel might see strong demand and want to meet it. (This is a relative measure in a world that consumes roughly 20 million Xeon processors a year, of course. The three big supers mentioned above have on the order of 27,000 Xeon Phi chips in them.)
Xeon Phi performance is going to drive demand if Intel gets the price right and can get sufficient yields on this monster chip using its 14 nanometer process.
Back in August, we showed off some performance benchmarks that compared a single Xeon Phi with 72 cores testing against a two-socket Xeon E5 v3 server using ten-core E5-2697 processors on a variety of raw processor and application workload benchmarks. The single Xeon Phi has about 2.5X the peak raw double-precision teraflops as a pair of Xeon E5s, and also can run the AlexNet neural network training algorithm about 2.5X as fast and the STREAM memory bandwidth test about 3.5X as fast. If you adjust this for performance per watt, the gap is even larger.
If Intel prices the Knights Landing chip very aggressively – making it less than twice the price of that Xeon E5 mentioned above, for instance – the uptake could be quite large indeed. In fact, we suspect that there are some hyperscalers that are early testers of the Knights Landing chips right now, even though Intel did not mention that, and that Intel is taking time to work out how to price the future “Broadwell” Xeons and Knights Landing Xeon Phis to present their true value while remaining competitive with GPUs, FPGAs and other acceleration technologies.