In the deep learning inferencing game, there are plenty of chipmakers, large and small, developing custom-built ASICs aimed at this application set. But one obscure company appears to have beat them to the punch.
Habana Labs, a fabless semiconductor startup, began sampling its purpose-built inference processor for select customers back in September 2018, coinciding with the company’s emergence from stealth mode. Eitan Medina, Habana’s Chief Business Officer, claims its HL-1000 chip is now “the industry’s highest performance inference processor.” It’s being offered to customers in a PCIe card that goes by name of Goya.
According to Habana’s internal testing, Goya can inference 15,012 images/second on the ResNet50 image recognition benchmark, which would certainly qualify it for the world record. And the results are being spit out with a latency of just 1.3ms, which is more than adequate for real-time interaction.
Nvidia finest AI chip, the V100 GPU, manages something over 3,247 images/second at 2.5ms latency (which drops to 1,548 images/second if it wants to match Goya’s self-reported 1.3ms latency). For FPGAs, the best results so far for ResNet-50 inferencing appears to be the Xilinx Alveo U250, which clocks in at 3,700 images/second. CPUs are further behind at 1,225 images/second, and that’s for a dual-socket Intel Xeon (Skylake) Platinum 8180 server. The upcoming Cascade Lake-SP Xeon, which will benefit from the addition of Vector Neural Network Instructions (VNNI), is expected to deliver about twice the inferencing throughput as its Skylake predecessor.
Since the Goya card only sucks up about 100 watts while crunching through ResNet-50 models, the energy efficiency for this operation is on the order of 150 images/second/watt, which would appear to be another high-water mark for the benchmark. The top-of-the-line V100 GPU delivers about 18 images/second/watt, while the dual-socket Xeon setup comes in at about 6 images/second/watt.
Despite the uninspiring performance numbers for CPUs, that’s where most inferencing is performed these days. That’s primarily due to the fact that inferencing is a high-volume activity, which generally relies on large-scale distribution across a cloud-sized infrastructure. Plus, inferencing neural network models doesn’t require nearly the same compute-intensity as training those models, most of which is done on elite, power-sucking GPUs like the V100. These high-end GPUs are very good at inferencing as well but tend to be less than optimal for hyperscale environments. Which is the reason why so many chip vendors (including Nvidia) are treating inferencing as a distinct opportunity.
While the Goya was strictly designed for datacenter inferencing, it’s meant to be suitable for a wide array of deep learning models (and certainly more than ResNet-50). Medina says the design is general-purpose enough to address the most popular deep learning application categories, including recommendation systems, sentiment analysis, and neural machine translation, as well as image recognition. Those encompass most of the commonly used neural network topologies – convolutional, recurrent, and so on.
It’s worth noting that the HL-1000 processor’s impressive performance is not a result of cutting-edge semiconductor manufacturing. Medina says their chip is built on the 16nm process node, which suggests even better results are possible if they moved to the current state-of-the art, say TSMC’s 7nm technology. That makes the results for this first-generation chip all the more impressive.
The HL-1000 is powered by a GEneral Matrix to Matrix Multiplication (GEMM) engine and 8 Tensor Processor Cores (TPCs). Each TPC has its own local memory, along with access to on-chip shared memory. Supported data types include FP32, INT32, INT16, INT8, UINT32, UINT16, UINT8. Unlike some inference hardware implementations, Habana opted not to support data types under 8 bits. Access to off-chip memory is provided via a DDR4 interface.
The Goya PCIe card comes in a few different flavors: single-slot or dual-slot form factors, and memory capacities of 4, 8, or 16 GB. Although power required is not expected to exceed 100 to 140 watts for most inferencing work, the card can draw up to 200 watts.
Building a usable software stack for custom chips is often the most challenging task, especially for startups. Here Habana opted to stick with standards as much as possible. That starts with ONNX, an open format for deep learning models that is supported in some of the most popular frameworks, including Caffe2, Microsoft Cognitive Toolkit, MXNet, and PyTorch. It enables models that were trained in one framework to be transferred to another for inferencing. Habana-supported frameworks include TensorFlow, MXNet, Caffe2, PyTorch, and Microsoft Cognitive Toolkit, with more on the way. As a result, Goya can use models trained on GPUs, CPUs or even Google TPUs.
The custom software tools include Habana’s SynapseAI compiler and runtime for HL-1000/Goya platform. The compiler creates a graph, delivering a “recipe” that is used by the SynapseAI runtime at execution time. The glue between ONNX and all the standard frameworks is accomplished via a SynapseAI API.
Medina says their current set of customers (who will remain nameless and numberless) represent a variety of different AI market applications. At this point, all the work appears to be in the trial phase. Logic would suggest that at least some of these customers are in the hyperscale business.
If Habana’s initial inferencing solution proves itself in the field, they will have leap-frogged the competition by a comfortable margin. However, that won’t deter its rivals, especially the more established chipmakers. Nvidia already produces inference-specific GPUs, the newest being the T4 Tensor Core GPU. While that one looks to be a good deal less performant than Habana’s solution, it won’t be the GPU-makers last shot at this burgeoning market. Intel, meanwhile, is getting set to deliver its Nervana Neural Network Processor (NNP-I), a chip they co-designed with Facebook, specifically for the hyperscale inferencing space. It’s expected to launch in the second half of 2019.
And then there are FPGA-based solutions, using either Intel or Xilinx parts. These can only get more powerful, since both companies have their sights set on inferencing market for their respective technologies. And with Microsoft heavily invested in FPGA-powered inferencing, the state-of-the-art in programmability should advance rapidly.
Habana intends to pursue the deep learning training market as well, but with a different product. Its upcoming solution, codenamed Gaudi, is already in the works, with the company promising that it will scale linearly to thousands of devices. Although there is little information on its design, it’s apt to be a very different type of product than that of Goya.
“Training and inferencing have different requirements,” says Medina. “This requires very different compute, very different memory, very different latency and power envelope. So, designing one processor for both would have meant asking our customers to compromise on performance and power consumption and cost.”
Gaudi is slated for its debut in the second quarter of 2019.