Intel has been making some interesting moves in the community space recently, including free licenses for its compiler suite for educators and open source contributors can now be had, as can rotating 90 day licenses for its full System Studio environment for anyone who takes the time to sign up.
In the AI space, Intel recently announced that its nGraph code for managing AI graph APIs has also been opened to the community. After opening it up last month, Intel has been followed up on the initial work on MXNet with further improvements to TensorFlow.
The Next Platform spoke with Ajun Bansal, vice president of AI software and general manager of the AI Lab at Intel.to find out more. Bansal came from Nervana who were scooped up by Intel in 2016 for $408 million. Interestingly, Bansal has a background in actual neuroscience at Brown and computer science at Caltech.
In reviewing the recent announcement, our eyes were immediately drawn to the “images per second” numbers for TensorFlow and the now ubiquitous benchmark graphs, which was not hard when Intel said that the newly released simplified bridge code can deliver up to 10X performance improvements over previous TensorFlow integrations. This sort of step function is what we are always on the lookout here. The performance of silicon against the ResNet50 image recognition benchmark is one of the key ways to show how fast your AI is, akin to the AI Linpack, if you will. Like any benchmark, ResNet50 is complicated and you need to look at the whole picture. Examining just plain raw horsepower and update numbers doesn’t quite show you what is actually going on here.
Looking at the performance graph made the science folk here at The Next Platform wince just a little, the now ubiquitous Y-axis overload showing different values aside, we focused a little closer on the 68.9 images per second number for the Xeon based TensorFlow XLA nGraph. Our quick back of the envelope calculations comparing TensorFlow’s own results show a single 2014 vintage Tesla K80 card (which has two GPUs) comes in at around 52 images a second. This somewhat elderly Tesla K80, now two generations back (three if you count “Maxwell” GPUs) could just about keep pace with the pair of new “Skylake” Xeon SP Platinums used in Intel’s test. Assuming you can find a K80 on eBay, (we found one for $1,500) that comes in about 28 bucks per image per second. The high grade Xeon 8180 silicon lists at about $10,000 per chip without the memory expansion and there were a pair of them needed to provide the 56 cores used for Intel’s test. So even at 69 images a second you are right up against $290 per image per second. That works out to be about an image per second per core, give or take.
Extrapolating these same numbers and using our new “dollars per image game” to the Tesla “Volta” V100 accelerator, even though these cards also list for about $11,500 a piece, one V100 card gets you just shy of 600 images a second, closing in at about $19 per image per second territory. This is all before you start to go outside of a single board, or distribute to larger systems or build some serious serious scale like the DGX-2 has. Based on these numbers, Nvidia still gets to keep the AI Benchmark crown.
However, this isn’t the real issue at play here.
Anyone doing these types of benchmarks seriously or being on the front line of AI in production settings takes a very different approach. Facebook, for example released their own numbers of being able to train all of ResNet in an hour using their “Big Basin” GPU servers stuffed full of Pascal P100s with a solid 50 Gb/sec network to help move things about, mostly so they can now recognize pictures of your cat quicker than ever before. Distributed systems and tightly integrated systems with storage and elegant networks are how to really increase velocity for AI methods, it’s all a whole lot bigger than a single piece of silicon.
The AI Ecosystem
So why release these numbers, what is this actually about? We said it was complicated, and it is. It became more clear when we spoke with Bansal. He stated that the real problem is that the number of pieces of silicon and the number of unique platforms in this game are increasing dramatically. When Bansal was at Nervana, they had a one to one relationship with their code and their silicon. It’s not the same game anymore, there is a whole AI hardware ecosystem ranging from Movidius at the edge in the embedded, drone and mobile space, FPGA from Altera acquisition, to the to be released Nervana Neural Processor, and good old standard X86 Xeons – they all need to be supported out of Bansal’s shop. Not only that, we have spoken before about the lack of standards in AI, and it’s true. Folks are trying to effectively design the PDF for neural networks. When you boil off the performance numbers and really dig into the Intel announcement it’s actually about mobility of the graph. It’s all about the API. It’s actually about making a move to release Intel software as open source to help to further advance the field by allowing folks to look behind the curtains and help with development. We counted seven frameworks and five platforms, that’s a lot of combinations of things to support, interestingly Intel also want to target the Nvidia silicon with their nGraph code. It is a fascinating new world of complex systems integration on the go here.
Bansal noted that Intel researchers started the project a couple of years ago, primarily as a way to support work on NNP, when they noticed there was a shift in the industry to be more based on graphs. The move to Intel allowed them to expand the code base and target new silicon. Their vision is to have a wide API and simplify the engineering taking months and months out of the development cycle. ONXX, the Open Neural Network Exchange has taken a crack at some of this, but misses the piece where the graph hits the hardware.
The nGraph software is essentially a “DSO” or a dynamic shared object, taking the lead from the Apache webserver days where software could be built outside and separate from the main httpd codebase, but now separating the graph from the framework be it TensorFlow, Caffe, MXNet etc. Discussions on the XLA development list are also interesting as you can follow the Intel team working with the community in how to best “insert” the DSO into the XLA code base, there is a lot of complexity even when systems are open. XLA itself is also a currently rapidly moving and interesting target to provide the right levels of performance for graph-based methods. The interaction between how TensorFlow builds HLO with a just in time compiler and then bootstrapping the nGraph software into this pipeline results in some tricky engineering.
The real core of what the nGraph announcement is all about isn’t a headline benchmark, it is way more subtle. It is actually all about the integration of complex software. Our earlier research into Nervana shows promise that the “images per second” benchmark will certainly improve the numbers from more generic multipurpose Xeon silicon, although we have yet to see these systems being released into the wild. The Flexpoint systems with 16-bit multipliers and adder trees should in theory provide the appropriate horsepower to run these new graph-based workloads at high speed – if they can get the power balance right on the final released systems. But for now, it’s not about the benchmark, it is all about the integration and preparing complex software for scale and being open and transparent about it.