For some users in high performance computing and a growing range of deep learning and data-intensive application segments, the shift toward mixed (heterogeneous) architectures is becoming more common.
This has been the case with supercomputers, which over the last five years have shifted from a CPU-based model to a growing set of large systems outfitted with GPUs to add number crunching capabilities. But as both the hardware and software ecosystems around other accelerators gather steam, particularly on the FPGA front, there is momentum around building systems that incorporate a range of accelerators in the same system, which handle different parts of the application.
As one might imagine, none of this is simple from a programmatic point of view—the FPGA piece alone is riddled with complexity landmines. This opens the door for approaches that find a way to automatically discover and assign the right processor—be it GPU, FPGA, or the good old fashioned CPU—preferably at runtime. This might sound like a mythical software layer, one that can “automagically” mesh these multi-accelerator approaches together and auto-select the right tool for the right job, but one HPC startup says it has stitched together such a layer and can provide it as a software-only addition to the heterogeneous system stack, as an appliance, and via a cloud pulled together on Rackspace (and other providers) where FPGAs are offered and can be used alongside a CPU application—and in the same way from the application’s point of view.
And for a startup rolling out of the supercomputing world, the fact that they’ve secured funding to do this is in itself noteworthy. It is (unfortunately) quite rare that true high performance computing startups emerge with significant funding, but Bitfusion, the company with this multi-accelerator approach, sees a path ahead that will extend beyond the hallowed halls of supercomputing and find a fit for the areas where GPU and FPGA-based systems will thrive.
The key, according to the company’s CEO, ex-Intel product developer, Subbu Rama, is to be able to do all of this meshing seamlessly—which means that code changes and other complexity barriers need to be removed. After all, he agrees, FPGAs are notoriously difficult to program, a fact that limits their market, even with all of the progress that’s been made with OpenCL by companies like Altera and Xilinx. The goal of BitFusion is to provide this software layer and ensure that no code changes happen, even with such smart doling out of accelerator and CPU resources happening at runtime.
“When you have a system that has different architectures, including these three processors, our software layer automatically discovers what device is available, what part of the application is right for what device, and then it automatically offloads that part to the right device—again, without code changes.
Specifically, the code work they’ve done sits on top of OpenCL to talk to the various processors, which works with Nvidia GPUs and both Altera and Xilinix FPGAs. “We believe OpenCL will be the basis for multiple platforms, but it is not enough. Generally, if you have an application that you want to move from CPU to run on an FPGA now, there is a lot involved. We are building the software libraries in OpenCL, which are performance portable and highly parameterized, meaning it’s possible to run those same libraries on any of those three processor types. We discover what hardware is available and dynamically swap our libraries and that’s what makes us different than other approaches that have tried to do similar things in the past.”
Although this may sound like it would have to add some overhead, Rama says that if one looks at the majority of applications, they are using a finite set of fundamental building blocks and libraries. What BitFusion has done is to take those swaths of open source libraries and moves them into a common OpenCL framework where they can be performance portable across CPU, GPU, and FPGA in a way that is transparent to the user. Internal benchmarks in a few key areas Rama and his team think will be viable markets for the young company include bioinformatics (where they reported an 10x speedup with Smith Waterman), scientific computing (they claim an 11x improvement for NNLS), and data analytics (based on R, NumPy, and Octave for a 31x boost).
There are three ways to make use of the company’s work to tie together these multiple processing types. First, as a standalone software layer, called BitFusion Boost (BT), which can be used as either an agent that can sit on top of the OS or as an accelerated Linux image users install on their machines. In either case, the pricing can be shifted to either a per node basis or with a license fee. While we were not able to glean the licensing fees, Rama says BT is currently priced somewhere between $1000 and $2000 per node, according to Rama, a range which is variable based on how they are getting an early handle on user adoption and how the first sets of use cases take shape. One can imagine the market for this is going to be rather limited since only a certain set of users will have machines outfitted with both FPGA and GPU accelerators.
In addition to this software-only model, the company is making an appliance featuring a low-end Xeon, which Rama says is used mostly for orchestration. Configurations can include 4-6 FPGAs from either Altera or Xilnix and on the GPU front, the system can support either an Nvidia K40 or K80, but what is most interesting to Bitfusion is the Titan for its high performance and far lower price point. These come with Bitfusion’s software pre-installed in 1U, 2U and 4U form factors. Further, as noted previously, the company is also working with a few key cloud providers on what it calls the BF Supercloud, which lets users tap into an FPGA-enabled cloud provider’s boxes for automated acceleration.
Although this might sound compelling for some early stage users who have already seen the performance of the trio of accelerators in action, Bitfusion does not want to be in the hardware business. “We may end up going away from the appliance model at some point,” Rama explains. “We are experimenting with the model and cost and eventually we won’t want to be selling hardware, we would rather channel through OEMs and sell our software, which is the key piece.”
Aside from the potential range of applications and use cases this opens for users with existing FPGA and GPU combination systems (or those who might consider an appliance that is already equipped with such a setup) the fact is, this might be a leap forward in terms of expanding the FPGA in particular into new arenas. Rama says the use cases for FPGAs in particular are expanding, noting that what Bitfusion is doing by making it seamless to move from the CPU to the FPGA, is allowing companies to cut potential costs of FPGA experimentation. “Users are not going to be moving everything to the GPU, and certainly not to the FPGA. It might be 10% of an application that they are moving to a new device. They want to see how it works, how it performs, and that can assist with making a decision about whether or not to use a device.”