There is little doubt that 2017 will be a dense year for deep learning. With a sudden new wave of applications that integrate neural networks into existing workflows (not to mention entirely new uses) and a fresh array of hardware architectures to meet them, we expect the space to start shaking out its early winners and losers and show a clearer path ahead.
As we described earlier this week, Intel has big plans to integrate the Nervana ASIC and software stack with its Knights family of processors in the next several years. This effort, codenamed Knights Crest, is a long-term initiative, backed by numerous investments as early as 2017 with the roll-out of a 28nm process Nervana chip fabbed at TSMC and set to follow the original plans that the company’s former CEO (now head of the AI Solutions Group at Intel), Naveen Rao, told us about back in August 2016 immediately after the Intel acquisition announcement. At that time, Rao pointed to a strategy that VP and GM of the Xeon and Cloud Platforms group at Intel, Jason Waxman, told us this week they will be sticking by in the coming year and beyond.
As Rao said in August, “in the future, we would love to get access to the better 14 nm process technologies. As a startup, we had to use the most basic, bare-bones thing possible. But even with inferior process technology it is possible to beat a more general purpose processor.” Additionally, with the capabilities of 3D-Xpoint memory, what this tiny chip can do for deep learning (deeper piece on the architecture here) could be nothing short of incredible, assuming the workload doesn’t change significantly in the years required to get to this level of integration.
We chatted with Rao again in the lead-up to Intel’s AI Day last week to get a better sense of what an integrated product might look like—and what could shift the tides for deep learning algorithmically and architecture-wise. At the high level, he pointed to the expected performance gains with a coupled Knights Mill and Nervana product. “We are not just talking about something that is going to be 10% better here; we’re setting bar really high for what will be possible in terms of a processor designed for neural networks. This represents a true commitment to the deep learning community and the space in general from Intel, and that is exciting from an industry standpoint,” he says. “This is only the beginning, but we are enabling an order of magnitude greater performance in the coming year and by 2020, two orders of magnitude…This will be based on a combination of silicon, software, and algorithmic innovation.”
Recall that Knights Mill, which will appear sometime in 2017, is a variant on the Xeon Phi family of processors, which has roots in high performance computing. This processor, however, is designed specifically for deep learning with mixed precision capabilities (32- and 16-bit, but we suspect 8-bit capabilities and more memory as well though that’s not confirmed). Rao and team will get their 14nm wish, although we predict that seeing this on a 10nm process in time for even late 2020 will be a stretch.
When asked about this need for deep learning workloads to hum together on the same system, Rao tells The Next Platform that Knights Mill coupled with Nervana technology will have important capabilities to this end cooked in. “Knights Mill has direct access to large memory footprints with its DDR4 interface, which other platforms don’t have because they use add-in cards. This means we can have many active models in memory and do inference against them with very low latency.” He adds that it is possible to train very large models this way on a bootable platform that doesn’t reply on such add-in cards and managing the interface with host memory. “The Nervana engine is a breakthrough in terms of how computation is organized and can scale; it will be the highest density compute chip ever made by a vast amount; you wrote about the numbers from before, and we are committed to building exactly that.”
We asked Rao what the technical hurdles are for integrating a Xeon and their own architecture. “It’s not difficult per se, it’s more a problem of what is the best, more optimized way to do it. That takes some thought in a rapidly evolving space to make sure we’re heading in the right direction. It’s not that a host processor on X86 can’t be tightly coupled with what we’re building today, it just has to be done the right way.” This is a reasonable question since the way Nervana’s chips and Xeons are architected are quite differently. “It is all dense compute logic, ultimately,” Rao says. While this wasn’t the more technically complex answer we were hoping for, he does say that finding the right balance between how tightly things are integrated while still leaving enough flexibility to build a wide range of products that can handle real use cases is a hurdle. “The architecture can lend itself to many different applications, from autonomous cars to data science in the cloud, but we have to make sure we have a flexible methodology so we can quickly make these products fit into different niches as well.”
“There is a commitment on Intel’s roadmap to increase the performance we have stated in the past by a big margin compared to other products out there. We have lots of I/O on the chip with the high-speed links between chips to distribute computing across them to make it appear as one big machine. This is something that is not possible today and will let us build larger models and work faster on even larger datasets,” Rao says. Of course, on the scalability and distributed front, GPU maker, Nvidia, is showing some impressive gains by scaling its own Pascal-equipped DGX-1 appliances on an Infiniband network for impressive floating point performance on the this most recent Top 500 supercomputer benchmark.
With all of this in mind, much of the emphasis on pulling in the Nervana assets is still more of a software problem than a hardware one—a fact Rao says applies to the upcoming 28nm parts for 2017 all the way to the future integrated Knights Crest product. Rao says the software effort is 10X the hardware one once the chip architecture basics are solved. When it comes to future integration, “the hardware will be the enabling first step once we get those first basic primitives into silicon and make the software robust,” he says, adding that the teams are “investing heavily in distributing workloads across chips, which is hard and represents one of the biggest design goals.”
The Nervana team will be working with Intel on the all-important software elements for its 2017 and future products as well with the Neon framework at the center. Other efforts, including the Nervana graph compiler, will move to open source in the next few months. “If you decompose neural networks into various execution components, it shrinks into a graph,” Rao explains. “This is done in TensorFlow and other frameworks. Intel is committed to building this as an open source project in cooperation with other key frameworks, including TensorFlow, Caffe, Torch, and others.
Now that we have a better sense of what this Nervana roadmap looks like for Intel, the question is how this might change the current way of thinking about architecture for these workloads—if at all. When it comes to deep learning system architectures, the dominant mode is to have two separate clusters, each with unique capabilities. As of now, that is more than like a GPU cluster, outfitted with a range of GPUs—from the high-end HPC-focused K80s to the lower-end TitanX cards, at least while more shops await the latest Pascal generation GPUs. On the inference side, the processor matters far less—this is not where the bulk of the complicated, compute and memory-driven work happens. While those in the supercomputing set are looking to push both of these cluster elements together for mixed workloads (for instance, simulations that are handed off to a deep learning component in the workflow—something supercomputer maker Cray sees on the horizon), one can imagine that any strategy for future chips will provide both training and inference capabilities in the same package.
Intel is not looking to Nervana to be the end-all when it comes to the entire deep learning workflow, especially not at this early point. In addition to beefing up its model training chip offerings, there is an important piece for FPGAs, as we saw last week with the announcements of some FPGA-based deep learning accelerators (and future etched-on FPGA variants with Skylake in the near future, we hear). “We have different use cases and we’re thinking about both training and inference. Nervana can dispatch training well and efficiently and FPGAs can do single inference with low latency. We can have pools of these things available for the workloads they are suited to, so it’s not that the Altera piece has been left on the floor; it’s part of this whole rack design that will suit many different workloads,” Waxman tells The Next Platform.
It is noteworthy (and smart) that a CPU maker is getting behind technology that threatens to minimize the role of the CPU over time. As we have described before, memory is the next platform for at least some compute workloads, and deep learning is one of them. For fun, we asked Waxman if the CPU would be increasingly pushed into housekeeping duty as its main function, at least for such architectures and workloads. “The way we see it,” he says, “given the rise of cloud computing, there will always be two factors in balance; first, there is a great demand for general purpose computing where you need different functions, algorithms, and use cases, so having a general purpose CPU that’s high performance and repurposeable is important. Further, having that as the base to add application specific accelerators like this one goes hand-in-hand with that.” Ultimately, he says it is a fine balance between general purpose and application specific—something that Intel is anticipating with their rack-based approach to datacenters with deep learning and mixed, complex workloads.
It is difficult to assess how risky it is to work on an integrated architecture for such a rapidly evolving space. In just the last year, the sheer number and complexity of neural network frameworks increased dramatically, in essence, recreating itself with the addition of new approaches, including TensorFlow, as just one example. With no clear winners on the algorithmic/framework side and no standard architecture for training now outside of GPUs, it is fair to say that this whole area could be turned on its head by the time 2020 rolls around. However, Intel’s strategy is backed by companion developments on the CPU front, including Knights Mill (and whatever comes after that) and we can be sure that the chipmaker will balance the requirements of specialization in architectures with a continued eye on the needs of general purpose workloads.
With nvidia already having a 60x lead in efficiency over intel CPU:s running Intels MKL2017 and IntelCaffe, a jump in “one order of magnitude” simply wouldn’t be enough.
Considering that Nvidia will release Volta in 1H 2017, one wonders how they could keep up.
@jimmy: correct me if I am wrong, but the order of magnitude jump is over what Nvidia can do, not what Intel can do.
link to 60x claim: http://images.nvidia.com/content/pdf/tesla/184457-Tesla-P4-Datasheet-NV-Final-Letter-Web.pdf
That 60x and 12x is typical nVidia marketing speech as there is no detail on what exactly they used to compare on. If you use a standard Caffe install which is single threaded unless you actually spend some time and fix their Makefile that links to an actual multithreaded OpenBLAS I believe that 12x. Compared to the Intel MKL/Caffe using Intel’s DNN replacement for CuDNN and running on a XeonPhi I don’t think the numbers look actually that good anymore. But I haven’t verified it myself. That’s why vendor claims are always have to be taken always very sceptical does not matter if it is nVidia, Intel, IBM or ARM
The software is exactly detailed, Intel MKL 2017 aswell as IntelCaffe, ie, the best they’ve got.
Intel is crushed.