Around this time last year, we delved into a new FPGA-based architecture that targeted efficient, scalable machine learning inference from startup DeePhi Tech. The company just rounded out its first funding effort with an undisclosed sum with major investors, including Banyan Capital and as we learned this week, FPGA maker Xilinx.
As that initial article details, the Stanford and Tsinghua University-fed research focused on network pruning and compression at low precision with a device that could be structured for low latency and custom memory allocations. These efforts were originally built on Xilinx FPGA hardware and given this first round of investments, we can imagine this will continue to be the case.
What Xilinx sees in DeePhi, however, goes beyond a hardware partnership; if their views about the future of inference workloads are spot-on, the two companies could be in for some serious interest among hyperscale companies on the datacenter front and at the edge of the network for a range of devices.
As Xilinx distinguished engineer, Ashish Sirasao tells The Next Platform, the hard-wired network pruning and compression techniques central to DeePhi’s approach have already been tuned for convolutional neural networks and long-short term memory (LSTM) models, but there is an increased drive toward meshing these two networks to do things like near real-time video translation and captioning, for instance. This inference job is a computationally intensive task, but Sirasao says DeePhi’s ability to do this in the 8-bit precision range across drastically reduced model sizes will be more important—and can be very competitive over CPUs and GPUs for this same dual-tasking.
“DeePhi has the technology to prune LSTM and convolutional neural networks in a multilayered way, making it possible to do image classification with natural language processing at the same time. We see a lot of momentum with people trying to merge these technologies and we want to make sure there is an absolutely optimized implementation of CNNs and LSTMs for these multilayer problems.” He adds that they have proven internally that going to 8-bit for these merged problems is where the efficiency is since the computation can be doubled. “This new wave in inference is what DeePhi is doing now; we are helping to create more proof points and engagements to drive research.”
As Ravi Sunkavalli, senior director of IP engineering at Xilinx adds, “if you have looked at the original Google TPU paper, one-third of the workloads they have are in the same area. Here also, the computations are very irregular. Some of the challenges DeePhi has addressed (and what FPGAs are google for) could transform this by allowing things like online deep compression while computing with special purpose logic.” He notes that when computations re irregular, DeePhi on a FPGA can take advantage of sparsity by doing custom sparse matrix multiplication techniques. “When parallelization is challenging in CPUs or GPUs with their notions of threads, they can get hard to use. With the DeePhi approach, there is a hardware load balancer, a custom memory data path with custom hardware and scheduling, so it is instead possible to use a parallelized pipeline architecture.” He says this is how DeePhi is proving its performance over CPU and GPU as well as energy efficiency.
Overall, DeePhi has developed a complete automation flow of compression, compiling and acceleration which achieves joint optimization between algorithm, software and hardware. With server workloads on the horizon, DeePhi has already been collaborating with leading companies in fields of drone, security surveillance and cloud service. CEO and co-founder (Tsinghua University) Song Yao says “The FPGA based DPU platform achieves an order of magnitude higher energy efficiency over GPU on image recognition and speech detection.” Deephi believes a joint optimization between algorithm, software and hardware via a co-design approach represents the future of deep learning.
Model size reduction and network pruning means that it is possible to fit more into local memory, which is part of where the FPGA advantage exists given the ability to build a custom cache. On the network pruning side, DeePhi has figured out how to reduce the total number of computations required to run a network—doing more inference with less, essentially. As Sirasao explains, “This network pruning and approach does not work well on CPU and GPU. You can do it, but you don’t get an absolute advantage. When you reduce the model size and can do effective compute better, it translates into a lower latency implementation, which is key in speech recognition and real-world environments and some of the edge devices.”
In the last couple of years, much of the attention on deep neural networks has been focused on training at scale. This was an exciting area to cover because it brought a new range of architectures to the fore and pushed general purpose processor companies like Intel and Nvidia to do some interesting things they might not have done otherwise. However, training is taking a backseat lately to inference, which is where the performance and efficiency will definitely matter more as time goes on. It is no surprise to see FPGAs finding a fit here or to see companies like DeePhi rolling in the funding and pushing new research into paring down trained networks for speedy, cool inference.
For the technically inclined, here are some slides highlighting the LSTM work on FPGA that has been keeping DeePhi occupied.
Where do you guys think this will be offered? AWS F1 offers some inference solutions but not from these guys
Embedded and mobile environments which are mostly off-grid and off-cloud services or anything else were latency matters. This is were the real big market is anyway don’t believe the hype of google,facebook,microsoft, baidu and the other internet-giants as they want you to do everything in their cloud!