Wrenching Efficiency Out of Custom Deep Learning Accelerators

Custom accelerators for neural network training have garnered plenty of attention in the last couple of years, but without significant software footwork, many are still difficult to program and could leave efficiencies on the table. This can be addressed through various model optimizations, but as some argue, the efficiency and utilization gaps can also be addressed with a tailored compiler.

Eugenio Culurciello, an electrical engineer at Purdue University, argues that getting full computational efficiency out of custom deep learning accelerators is difficult. This prompted his team at Purdue to build an FPGA based accelerator that could be agnostic to CNN workloads and could eke maximum utilization and efficiencies on a range of deep learning tasks, including ResNet and AlexNet.

Snowflake is a scalable and programmable, low-power accelerator for deep learning with a RISC based custom instruction set. It implements a control pipeline for a custom instruction set via a custom compiler that generates instructions and handles data in main memory. The software structure drills into the high level model representation from Torch7, which feeds into an instruction stream that runs on Snowflake. The Snowball software framework provides hardware usability and utilization.

The hardware architecture for Snowflake is based on 16-bit matrix multiply and accumulate units (MACs). A vector MAC (vMAC) is comprised of 16 MACs that process 256 bits in one cycle. A compute unit (CU) has four vMACs (and each of these has its own kernel scratchpad buffer). Data transfer time is overlapped by MAC compute time by using a double buffer strategy. These groups are put into clusters with their own RISC based pipeline. Two ARM Cortex A9 processors serve as hosts.

“The architecture was designed to provide high performance given optimal sequence of instructions. But, manually crafting assembly-like instructions can be cumbersome and error prone, especially when a model is comprised of several layers like in ResNet,” the creators explain. “Even if one was patient enough to manually write code for some state of the art deep learning models, further customization on both sides: the hardware and software would require modifying thousands of lines of assembly code, preventing experimentation on custom systems for deep learning.”

“Snowflake is able to achieve a computational efficiency of over 91% on entire modern CNN models, and 99% on some individual layers,” Culurciello says. “Implemented on a Xilinx Zynq XC7Z045 SoC is capable of achieving a peak throughput of 128 G-ops/s and a measured throughput of 100 frames per second and 120 G- ops/s on the AlexNet CNN model, 36 frames per second and 116 G-ops/s on the GoogLeNet CNN model and 17 frames per second and 122G-ops/s on the ResNet-50 CNN model. To the best of our knowledge, Snowflake is the only implemented system capable of achieving over 91% efficiency on modern CNNs and the only implemented system with GoogLeNet and ResNet as part of the benchmark suite.”

An interesting comparison with Google’s TPU here for performance per watt. The team does not explain the results for TPU but says they “kept an optimistic number for the TPU, which performs much lower than 100% utilization when running models it is not optimized for.” Culurciello points out that the Snowflake SoC 1k will have the same computational power as the latest Nvidia Titan X GPU. – See bottom for notes on comparison data for these projections of the chip in FPGA emulation.

Culurciello concludes that Snowflake and the Snowball framework address some of the key software design points that lead many hardware accelerators to leave efficiency gaps. These include model structure parsing, workload breakdown, loop rearrangement, and memory access balancing. These have been tested out with positive results on their own Snowflake accelerator, but the team is confident these can apply to other custom accelerators. An in depth description of both the Snowflake architecture and Snowball software framework can be found here.

The technical issues with custom accelerators represent one challenge; the other, larger issue on the horizon is how much traction these custom chips will get as larger companies work to integrate valuable features for deep learning frameworks on standard chips (as Intel is doing with its Knights family of processors, for instance). With wider-ranging deep learning features in familiar software environments, the need and use for custom accelerators for these workloads might begin to disappear. However, work to make FPGAs move closer to that starting line is interesting in that it adds to the architectural mix.

Note about comparisons used the highest available performance numbers and lowest power numbers from these sources:

Google TPU: https://arxiv.org/abs/1704.04760

For NVIDIA V100 used: https://www.nvidia.com/en-us/data-center/tesla-v100/, power 300W, Deep Learning performance: 120 Tflops

For NVIDIA Titan Xp: https://www.nvidia.com/en-us/geforce/products/10series/titan-xp/, power 250 W, single precision FP32 performance: 12.1 Tflops — FP16 performs worse than FP32 so have not used it

For NVIDIA Tegra X2: https://en.wikipedia.org/wiki/Tegra#Tegra_X2, power: 7.5 W, performance: 1.5 Tflops

There is huge market here for custom accelerators for this. In the automotive market especially if we are transitioning away from internal combustion engine to electric at the same time. A 100W compute board like Tegra X1 is not going to fly as it will limit seriously your range.

Also IoT edge devices where currently any nVidia solution is like 100-1000 to power hungry.

jimmy says:

August 8, 2017 at 3:28 am

The tegra X1 pulls around 10 watt in high loads, if you really stress all the components itll do around 15 watts.

The tegra x2 runs at 7.5 watt peak at similar loads.

Reply
- Matt says:
  
  August 9, 2017 at 11:20 am
  
  I don’t think he meant to be discussing the Tegra X1. I think he meant the full Drive PX2 system (which incidentally uses a Tegra X2 SoC not a Tegra X1) with the discrete GPUs. I’m not sure how he got from 100 watts to 1000 watts though.
  
  That Drive PX2 system, however, is a development platform. It is to be replaced in production cars with the 20 W to 30 W Xavier. If an electric car has a 75 kWh battery then it would take about 104 days of continuous (24 hours a day) usage to run it down using only the 30 W Xavier. The Tesla Model 3 uses about 24,000 Wh for every 100 miles driven. During that time the Xavier will generally have used under 100 Wh, or less than one half of one percent of the energy. I don’t think that’s much of a concern.
  
  Reply

OranjeeGeneral says:

August 7, 2017 at 12:39 pm

There is huge market here for custom accelerators for this. In the automotive market especially if we are transitioning away from internal combustion engine to electric at the same time. A 100W compute board like Tegra X1 is not going to fly as it will limit seriously your range.

Also IoT edge devices where currently any nVidia solution is like 100-1000 to power hungry.

- jimmy says:
  
  August 8, 2017 at 3:28 am
  
  The tegra X1 pulls around 10 watt in high loads, if you really stress all the components itll do around 15 watts.
  
  The tegra x2 runs at 7.5 watt peak at similar loads.
  
  - Matt says:
    
    August 9, 2017 at 11:20 am
    
    I don’t think he meant to be discussing the Tegra X1. I think he meant the full Drive PX2 system (which incidentally uses a Tegra X2 SoC not a Tegra X1) with the discrete GPUs. I’m not sure how he got from 100 watts to 1000 watts though.
    
    That Drive PX2 system, however, is a development platform. It is to be replaced in production cars with the 20 W to 30 W Xavier. If an electric car has a 75 kWh battery then it would take about 104 days of continuous (24 hours a day) usage to run it down using only the 30 W Xavier. The Tesla Model 3 uses about 24,000 Wh for every 100 miles driven. During that time the Xavier will generally have used under 100 Wh, or less than one half of one percent of the energy. I don’t think that’s much of a concern.
    
m says:

October 26, 2017 at 7:30 am

Once the OpenCL to FPGA compilers begin to approach even 40% of the sophistication of the current compiler state of the art (it’s really just a matter of putting the effort in, and maybe a but of creativity;), the solution they produce to the multiobjective constrained optimization problem of mapping a skilled and aware programmer’s opencl to the hardware should improve rapidly.

We’ll know we’ve arrived when we can smash out code that looks like a (skilled and aware!) python one-liner that’s using zeromq, which instantiates some arbitrary network of kernels communicating asynchronously over infra and intER device pipes, and finishing in an hour or few our previous 72 hour runtime stochastic global nonconvex nonlinear constrained optim problem with ad-box ever changing by the day spaghetti-code objective function, which is run weekly now cuz it’s a fire hazard to bystanders, but which nevertheless brings home the bacon.

Wrenching Efficiency Out of Custom Deep Learning Accelerators

Sign up to our Newsletter

4 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

The Year Ahead In Datacenter Compute

How The FPGA Can Take On CPU And NPU Engines And Win

The Fourth Wave Of FPGA Compute

4 Comments

Leave a Reply Cancel reply