Graphcore Builds Momentum with Early Silicon

There has been a great deal of interest in deep learning chip startup, Graphcore, since we first got the limited technical details of the company’s first-generation chip last year, which was followed by revelations about how their custom software stack can run a range of convolutional, recurrent, generative adversarial neural network jobs.

In our conversations with those currently using GPUs for large-scale training (often with separate CPU only inference clusters), we have found generally that there is great interest in all new architectures for deep learning workloads. But what would really seal the deal is something that could both training and inference at a lower price point than high-end graphics cards. It is for these reasons that early interest was piqued around Graphcore—and could very well explain why big name investors have approached the company (instead of the other way around).

The company’s own benchmarked performance (grain of salt warning) is a reported 10X to 100X performance boost over CPU and GPU—a strong, striking claim they shared via their own results comparing Hawell Xeons and (important distinction) the lower-end compared to more recent Pascal/Volta GPUs, the GeForce TitanX with variable precision. Internal benchmarks are interesting but the real proof from what we expect are some “household names” according to the company’s CEO, that are testing the chip on their own workloads, will finally become clear in the near future since the first silicon will be in the hands of select early-stage users this quarter.

These early users have been using the company’s Poplar software environment as an interface to test against a simulation of the hardware platform that allows for comprehensive benchmarking of performance. That same software will run on the IPUs, which means we should hopefully see independent benchmarks from leading companies in the very near future.

Plans to press ahead with future generations have been sped thanks to a $50 million injection from Sequoia Capital, which sought out Graphcore (versus the company seeking a fresh round) as a contender in the emerging custom AI hardware market.

In an update on current progress, Graphcore’s CEO, Nigel Toon tells The Next Platform that his team is mapping out a plan to scale their deep learning chip, called an Intelligent Processing Unit, or IPU, to the next process node, going from the current 16 nanometer to the projected TSMC 7 nanometer by the beginning of 2019.

“As a wider industry, we are moving away from this idea of programming computers to do exactly what we say, step-by-step in program and moving to a new world where we can learn from data and build systems based on that approach instead. It is a new way to compute. CPUs will still be important as I/O processors to bring data in and out like a spinal cord but what we really need is a brain—something that learns the structure of the model rather than being explicitly told what the model should be.”

The first product shipping to early customers is a 300 watt part that can plug into PCIe and is designed so several can be connected together into a cluster of IPUs. These connections allow communication between the IPUs and training or inference data that comes in through PCIe but they are not using PCIe for the chip to chip links for this generation.

Toon points to early user interest focusing on image, video, and speech recognition, all areas that fit well within the range of the TensorFlow focus the company talked about this past year at the AI-centric NIPS conference. Without mentioning early user names he said that in video, there are numerous analysis features that customers have shown interest in and for speech and translation, more ambitious goals that go beyond mere word recognition to context are key.

While Google already has its TPUs to do image, video, and speech, as well as other GAN-based workloads, companies like Facebook, Amazon, and others that need to do AI at scale are all candidates to try out Graphcore’s approach and see if there is truly any match for the mighty GPU on the training side of the datacenter (and FPGAs and lower power CPUs for inference).

The reason Graphcore is on our watchlist for 2018 is because the company appears to have an efficient and high performance way of doing parts of the deep learning workflow on the same piece of hardware. As Toon says, “This same architecture can be designed to suit both training and inference. In some cases, you can design a piece of hardware that can be used for training then segment that up or virtualize it in a way to support many different users for inference or even different machine learning model deployments. There will be cases when everything is embedded, for instance, and you need a slightly different implementation, but it’s the same hardware architecture. That’s our thesis—one architecture—the IPU, for training, inference, and different implementations of that machine that can be used in servers, cloud, or at the edge of the network.”

With that in mind, Toon tells us that the IPU will not cost more on a per-unit basis than a GPU when it comes to thinking about building systems out of the first generation devices. It is our guess that this is one time that the price is compared to high-end Volta GPU part instead of the $1000 TitanX parts.

“The customers we have spoken with are finding GPUs too expensive at scale; they’re good at training. Someone comes along with a programmable approach that can train and deploy on the same hardware and replace 10 GPUs for every IPU—this changes the economics for deployments,” he adds.

We will certainly be keeping an eye open for third-party benchmarks.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. Looks like common practice cherry picking when they:

    -> Use P100 instead of V100 ( V100 = 6x P100, tensor cores!)
    -> Use P100:s FP16 instead of its int8 capacity for inferencing ( int8 = 2x FP16 = 2x FP32).
    -> set specific latency cutoffs (how does it do on more general inferencing loads?)

    This is so common, everybody does it, more or less.

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.