It’s possible to count the number of major FPGA vendors on one hand, and despite a desire to differentiate, they are flocking together around some key areas where FPGA market growth seems most promising.
Despite the clinging drawbacks of poor double-precision performance, overall price, and what is arguably a much trickier (albeit more exact) programming approach, FPGAs might have something going for them that discrete GPUs do not—a power profile that is primed for the cloud. If the FPGA vendors can convince end users that, for key workloads, their accelerators can offer a considerable performance boost (which they do in some cases), offer a programming environment that is on par complexity-wise with other accelerators (CUDA, for instance) by encouraging OpenCL development, and wick away the price concerns by offering FPGAs in the cloud, there could be hope on the horizon.
Of course, that hope is bolstered by the urge to snap FPGAs into ultra-dense servers inside cloud infrastructure versus selling on accelerated bare metal. This already happened for FPGAs in financial services, which is where they are a prime fit with their solid integer performance. But if they haven’t seen mass adoption elsewhere, it’s time to look to a new delivery box–not to mention some new application goodies to put inside.
Just as their GPU accelerator cousins are rallying around deep learning to make the leap to a broader set of users, particularly in the web-scale and cloud space, the FPGA set is seeing a real chance to invade the marketplace by tackling neural network and deep learning workloads. These new hosts of applications mean new markets and with the cloud removing some of the management overhead, it could mean broader adoption. Efforts to move this along are working in some key machine learning, neural network, and search applications. FPGAs are becoming more commonplace outside of the hyperscale context in areas like natural language processing (useful for a growing array of use cases from clinical settings to consumer services), medical imaging, deep packet inspection, and beyond.
Over the last year there have been a few highly publicized use cases highlighting the role of FPGAs for specific workloads, particularly in the deep learning and neural network spaces, as well as image recognition and natural language processing. For instance, Microsoft used FPGAs to give its Bing search service a 2X boost across 1,632 nodes and employed a creative 2D torus, high throughput network to support Altera FPGA-driven work. China’s search engine giant, Baidu, which is also a heavy user of GPUs for many of its deep learning and neural network tasks, is using FPGAs for the storage controller on a 2,000 petabyte array that ingests between 100 terabytes to a petabyte per day. These and other prominent cases of large-scale datacenters using FPGAs, especially when they do so over GPUs, are bringing more attention to the single-precision floating point performance per watt that FPGAs bring to the table.
While some use cases, including the Baidu example, featured GPUs as the compute accelerator and FPGAs on the storage end, Altera, Xilnix, Nallatech, and researchers from IBM on the OpenPower front were showcasing where FPGAs will shine for deep learning in the cloud. The takeaway from these use cases is that the speedups for key applications were hosted inside ultra-dense machines that would melt the Xeon if a GPU was placed in concert. For tight-packed systems, they are a viable choice on the thermal front and even though there might not be as many algorithms where FPGAs can show off (compared to GPUs) this could be the beginning of a golden era for the coprocessors, especially now that there are CAPI and QPI hooks for taking advantage of shared memory on FPGA-boosted systems.
If you ask Altera, Xilinix, and others, this is happening because of what we can call the “three P’s of FPGA adoption” – performance, power, and price. We were able to sync up with several of the main FPGA vendors at the GPU Technology Conference (the irony) and co-located OpenPower Summit, where we heard quite a bit about the golden age of the FPGA—all brought about by the cloud. With an estimated 75 percent of all servers being sold to live a virtualized life, the market rationale is not difficult to see—but performance per watt is the real story, especially when compared to GPUs, says Mike Strickland, who directs the compute and storage group at Altera. That puts Strickland in direct contact with HPC and hyperscale shops and gives him an understanding of their architectural considerations.
Although FPGAs have the reputation of being expensive, at high volume they are on par with other accelerators, Strickland explained, pointing to Microsoft as a key example. However, he says that the efficiencies of the performance boost far outstrip GPUs for neural algorithms, which leads to additional savings. There are numerous charts and arts highlighting the price/performance potential of FPGAs in both bare metal and virtual environments, but the real question is that stubborn fourth “P’ – programming.
There are programming parallels that make the possibility of an FPGA boom more practical. Strickland estimates there are around 20,000 CUDA programmers in the world, which he says demonstrates the size of the potential OpenCL-based approach to coding for FPGAs. The CUDA and OpenCL models are quite a bit more similar than they have been in the past, but both accelerator programming frameworks come with a reasonably large learning curve. For developers to branch out to either GPUs or FPGAs means they must see the potential for big performance, efficiency, and other gains—and that’s the message the FPGA world is trying to push with its focus on deep learning and neural networks.
It is not unreasonable to see how key advancements might lead to FPGAs as a service in, for instance, the Amazon cloud. There are already GPU instance types available, which one could argue might lead to more testing and development with CUDA code for new workloads. For Altera or Xilinx to find their FPGAs offered on IaaS clouds could encourage more OpenCL and programming progress, and might prove to be an ultra-efficient accelerator addition for cloud providers hoping to provide users with a boost without high power and heat complications.
Without a simpler way to run complex deep learning and neural network code, all the potential power and acceleration boosts are lost on the market. During a presentation at GTC and the OpenPower Summit last week, Manoj Roge, who directs the datacenter division of Xilinx, said that FPGAs stand to make gains in the near future for specialized workloads. This has always been the case (there are a few places where FPGAs do really well, on Monte Carlo simulations for instance), but the cloud is making access more practical and helping users onboard faster.
“We are in the age of software defined everything—to virtualize all elements of the datacenter from compute, storage and networking and deliver it as a service or cloud,” Roge said. “A lot has gone into virtualizing compute and storage, not as much on the networking side, but that’s where standards and a robust ecosystem come into play. There’s a need to build things with standards but some, including standardizing on X86, are not good for all workloads. Some are seeing how they can get speedups for specific workloads with FPGAs and GPUs.”
The challenges, even for software-defined datacenters, still boil down to power and thermal density, something that multicore processors sought to tackle. “We need to rethink datacenter architecture so we can boost performance and reduce latency. The answer is heterogeneous architecture for specialized workloads.”