OpenCL Opens Doors to Deep Learning Training on FPGA

Hardware and device makers are in a mad dash to create or acquire the perfect chip for performing deep learning training and inference. While we have yet to see anything that can handle both parts of the workload on a single chip with spectacular results (the Pascal general GPUs are the closest thing yet, with threats coming from Intel/Nervana in the future), there is promise for FPGAs to find inroads.

So far, most of the work we have focused on for FPGAs and deep learning has centered more on the acceleration of inference versus boosting training times and accuracy figures. For instance, earlier this week, we pointed to how binarized neural network inference can be dramatically sped by FPGAs on the backend. Recent work out of Intel, which is using the Altera assets acquired last year, is focused on bolstering deep learning training (in this case, convolutional neural networks—useful in computer vision and classification) using the OpenCL framework—a higher level abstraction over other approaches to FPGA programming.

The research team at Intel behind the effort is well aware of the barriers FPGAs have historically had in entering the larger training race. “FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices, such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device.” However, using their Deep Learning Accelerator (DLA) approach, written in OpenCL they were able to show some impressive results on the standard AlexNet benchmark.

While they do not blow GPUs out of the water by any means, this work does highlight how efforts to tune FPGAs for this workload could continue to pay off, providing more in the way of competition to GPUs on training. If this is a successful effort, an FPGA might be another recipe for efficiently doing high-accuracy training and inference on the same hardware devices—a goal for shops with deep learning workloads at scale. Similar work focused on optimizing FPGA use with OpenCL has been ongoing. For instance, a team at Arizona State took a slightly different approach against AlexNet with an Altera Stratix V, and others have taken a dataflow approach with a similar goal of cutting down the memory bandwidth requirements of CNNs.  For smaller-scale neural networks, other FPGA approaches hinge on compression and could eventually find their way into large-scale implementations.

Using their own OpenCL based DLA approach, the Intel/Altera team was able to extract performance of 1020 images per second (23 images per second/per Watt) on AlexNet. The performance still has a ways to go, but in terms of efficiency, that 23 images per second per Watt figure is roughly on par with the TitanX GPU. Of course, the question is, what matters most to deep learning shops with relatively high node counts? Does efficiency, accuracy, or speed win? Ideally, it’s all three, of course. It is now on the developers of FPGA hardware and tooling to rush to keep pace with GPUs, especially Pascal, which is designed for both HPC and machine learning workloads. The AlexNet results the team describes are based on the Arria 10 FPGA. They compared the results to a similar, competitive FPGA, the UltraScale architecture from Xilinx and argue this is a 10X better throughput and 8.4 FGLOPs performance boost.

There are far more details in the full paper describing benchmark and OpenCL programming techniques, but from a high level, the team created a workaround that reduces memory bandwidth requirements by an order of magnitude by using an on-chip stream buffer. This relatively simple addition allows for input and output feature map storage, cutting down on the I/O requirements. They also leveraged the vectorization capabilities, which allowed for 60% DSP efficiency and used other techniques to reduce the number of DSPs required to perform the convolutional layers.

“Our DLA is targeted for high performance. In most CNN topologies, the total amount of floating point computation is dominated by the convolution layers. For instance, in AlexNet, convolutions are 92% of the total floating point operations. Hence, the DLA hardware is optimized to maximize the throughput of the convolution layers by exploiting parallelism in computations.”

Also of interest, the team used half-precision (FP16) versus single to reduce the requirements on the DSPs. FP16 is not currently supported on the Arria 10’s DSP blocks, but they found a workaround by using a shared exponent technique to allow multiplications in fixed-point which reduced the overhead for the half-precision operations. The DLA is able to implement all layers of AlexNet (an achievement) using the FPGA SDK for OpenCL. The team says it expects to apply the DLA approach to other CNN methodologies. “Adapting our DLA for a different CNN topology will not require vectorizing different loops, but will just require changing the vectorization factors according to the dimensions of that topology.”

FPGAs still have a long climb to the top of the training accelerator stack, but efforts like this show there is clear interest. The goal for Intel with its Altera assets will be to build that much-heralded integrated device and make it hum on both sides of the deep learning workload—training and inference alike.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.

Subscribe now

4 Comments

    • But that one is based on a 16nm process node. But thanks for mentioning. So Intel is (quite) a bit behind since they won’t have 14nm inference card in 2017, only Arria 10.

    • Don’t believe the hype of either. As no one knows what they really measured was it the full in to out classification result or was it just the steps on the GPU or whatever.

      • Without any strong evidence of falsehood, dismissing results of a peer-reviewed research paper is unhelpful (to put it mildly). Please go ahead and prove the authors are wrong, made a mistake, claimed falsehoods, etc. While I’m not familiar with the conference where the work was accepted, AFAICT it is a legit one.

        More to the point, the authors of this paper have compared to an older version of TensoRT benchmarks. The most recent ones linked by jimmy claim a 65% higher performance even on the M4 than the numbers the paper’s authors cite and compare against. So it seems that even on the 28 nm NVIDIA chip (+TensorRT) is 40% faster than the published Intel implementation on the 20 nm Arria 10.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.