As a thought exercise, let’s consider neural networks as massive graphs and begin considering the CPU as a passive slave to some higher order processor—one that can sling itself across multiple points on an ever-expanding network of connections feeding into itself, training, inferencing, and splitting off into multiple models on the same architecture.
Plenty of technical naysay can happen in this concept, of course, and only a slice of it has to do with algorithmic complexity. For one, memory bandwidth is pushed to limit even on specialized devices like GPUs and FPGAs—at least for a neural net problem. And second, even if keeping the memory beast fed wasn’t a bottleneck, the software development undertaking scales with the problem. The only answer to speed at scale for neural networks is a custom architecture. That’s not ruling out general purpose accelerators, but it does take them to task.
Backing up for a moment, the real question is what that next generation processor will look like if it is going to upend GPUs or FPGAs and certainly the trusty CPU. For some, it looks like (an often stacked) memory-based device (like this one, or this one, perhaps this one, or maybe this, among others) that can grip an entire model and reduce latency by minimizing (or at least reducing) off-chip memory access. For others, the future looks more like a specialized computational graph processor—one that is architected for speedy I/O with a custom interconnect, numerous but simple cores, and can tuck all the messy bits of graph processing at massive scale behind a robust compiler.
One can make the argument Pascal GPUs are already capable of doing much of this—and will continue to do more since they excel at the kind of sparsity found in neural network algorithms. But there is performance and efficiency left on the table that can be snatched up with a tuned architecture. The punchline is that, if such a graph architecture existed and could be just as efficient at training as it was at inference, it could upset the whole apple cart. If there is any coveted prize at this early stage of deep learning, especially since the pool of users is still shallow enough to see it from top to bottom, it is having a single architecture that can perform well on both parts of that workload.
Like the many startup machine learning chip vendors and researchers we’ve spoken with, Graphcore thinks it has the bottlenecks broken, the scalability wall scaled, and the performance/power balance right. The company also thinks it can do all of these things via a graph processor and make that intelligent processing unit (or IPU, as they call it) do double-duty with both training and inference on the same architecture across multiple form factors eventually (server and device). And guess what? They actually make a damned good case for all of the above. That is, if the claim that they can get 100X memory bandwidth speedup over even new generation high-bandwidth/stacked memory is for realsies. More on that in a moment when we talk about the hardware, which is set for 16 nm FinFET at TSMC (like Pascal GPUs) with slated date of early user delivery by the end of 2017.
It helps that CEO, Nigel Toon, who gave The Next Platform more detail than anyone has seen to date about what this thing is they’ve developed, has experience with specialized architectures for specific workloads. He was a pre-IPO field applications engineer at FPGA maker Altera for well over a decade and has worked with a number of startups (acquired by Nvidia and Broadcom) that had laser-sharp focus on a single set of workloads for highly tuned hardware devices. Much of the engineering team at Graphcore have similar backgrounds. And, as he tells us, what they know is how to target a specific application set and build to suit—something others are certainly doing, but without the same emphasis on the interconnect and underlying software stack.
“This same architecture can be designed to suit both training and inference. In some cases, you can design a piece of hardware that can be used for training then segment that up or virtualize it in a way to support many different users for inference or even different machine learning model deployments. There will be cases when everything is embedded, for instance, and you need a slightly different implementation, but it’s the same hardware architecture. That’s our thesis—one architecture—the IPU, for training, inference, and different implementations of that machine that can be used in servers, cloud, or at the edge of the network.”
At the core of Graphcore’s IPU is a graph processing base. “If you look at the underlying machine learning workload, you’re trying to capture a digest of data—at set of the features and relationships of those features that you learn from the data. This can be expressed as a neural network model, or more correctly and universally, as a computational graph with a vertex representing some compute function on a set of edges that are representing data with weights associated,” Toon explains. “You’re trying to build an understanding of those features and relationships between them on a graph. Building that graph can be complex and multilayered since you’re getting those different levels of relationships.” The problem does not end here, either. The master plan for neural networks is to create ever-larger networks that loop into one another, learning and improving. These must be recurrent—not just in the RNN way, but recurrent over many layers and many networks. In short, the end game is deep, wide reinforcement learning, or more simply, building networks that improve with use. And as one might imagine, the computational load of such a task is immense and far more nuanced than simply throwing CPU meat in to feed it—even if that’s easiest to scale linearly (in theory).
And speaking of theory, building such a self-feeding, self-learning network will bring major challenges that push current limits of even the most specialized devices.
The Graphcore IPU is still technically under wraps until later this year with no details about the architecture appearing anywhere yet, but we have discovered a few new, interesting things. First, the entire neural network model fits in the processor (not the memory). Instead of relying on the narrow long latency path to external memory, the thousand simple cores can keep everything inside, avoiding the hop. Of greatest interest is Toon’s statement that “even with the efforts of adding HBM and 3D stacking, you’re talking about having something on the order of 700 GB/s access to external memory where we have on the order of 100X that bandwidth to memory by keeping the model inside the processor.” This is aggregate memory bandwidth on the device, of course.
“What we are trying to do is map a graph to what is in effect a graph processor—the IPU. The key to that is in the software that allows us to take these complex structures and map to a highly parallel processor with all the memory we need to hold the model inside the processor. We expand neural networks into a graph, our software maps that to a relatively simple processor with some interesting attributes like a very rich interconnect system that is controlled entirely by the compiler,” Toon says. “There is a lot of innovation in the interconnect and cores themselves—we’re not using standard cores. There are over a thousand on a chip.” He notes that this is a true thousand cores versus the way Nvidia segments its 56 processor blocks that create a massive number of cores. “This is part of the issue when it comes to Nvidia and what we do—if I’m trying to share data between 56 blocks and in the case of the GPU, the only way I can do this is to write to external memory and read it back again, that’s a problem.” He says that architecting for thousands of cores requires a different architecture—one where you’re not relying on pushing data to external memory; it is shared internally and kept in the processor and load balanced across all of the cores.
So, we can assume that there might be a custom interconnect for this architecture and some non-standard cores, so the next question is about precision—one area that Nvidia is standing out for these workloads with Pascal. We had a hard time extracting any info about what these chips can do, but in the way of a hint, Toon says, “the reality is that you need to provide support for 32, 16, and quantized integer too—all of which will be needed at a certain point. You don’t need double-precision, that’s just a waste of silicon here.”
“The whole model is held inside the processor, so it’s not correct to say the memory is doing the processor, but the processor has a level of memory capability never before seen that allows us to hold these complex models inside—and that lets us have the compute much more efficiently deployed to manipulate highly sparse data structures in these models…We have complex instruction sets to let compilers be simple—if you make a processor simple and easy to compile to, it’s possible to build more complexity into the compiler. If the instruction sets are the same, they all run in one cycle, and I can run more operations, creating yet more complexity in the compiler.”
A graph approach leaves machine learning users with a structure that can expose a huge amount of parallelism (each of the vertexes might have, for example, 25 million parameters) and that is a lot of parallel compute that can be applied to a hugely parallel machine. But there are also all of those pesky layers and levels of relationships in that data. One vertex might be connected to tens or even thousands of others, which are themselves connected to many others. So, that data in the graph, if one tries to map this into traditional linear addressed memory, could only have two nearest neighbors when what is really needed is something that could be spread across a massive well of memory (which requires also doing the operation, writing it back, and so on). The point is, what is needed are machines that can collect the data, write it back, and do so within nearly unlimited memory bandwidth constraints. It is this sparsity problem that is causing consternation, but Toon says they’ve tackled that challenge head on.
When it comes to that sparsity problem, Toon says users are wasting compute elements in large vectors. “If you go back to training versus inference, in training, what you do at the expense of memory size, is induce some amount of data parallelism that will fit well on a vector (parallelizing into mini-batches for image training, for example), and using these mini-batches to full the vectors on a wide-vector GPU. But there’s the multiplication of the memory at each compute stage of the problem. So if you’re doing inference, it’s not possible because there’s one new piece of data to understand—there isn’t a set of data to parallel up and feed into the machine, which is why people are seeing how GPUs are inefficient for inference.”
Toon says that training and inference are just manipulations on a graph—one is more compute intensive because the graph has to built up through many layers and iterations to work out the right features and weights, but for inference, deploying such services might require many thousands of users needing their own volume of compute. “If I build a machine correctly, it can be efficient on graph structures and use that same machine for training and inference, which lets me build future networks that will get better over time and learn more.”
We will be able to get a full briefing in some detail later this year to help us place this squarely in the game—and fully understand what the interconnect and core stories are, not to mention where the magical ball of massive memory is wound for these devices. Toon, in the meantime, is counting down the days when the CPU owns the server market—and when GPUs are too general purpose for an endless paring down of the hardware in favor of interconnect/bandwidth (in the face of growing model complexity).
We have a follow-up article on the way that provides more hints via a description of the C++ and Python based “Poplar” software framework for the Graphcore IPU.