Another Step Toward FPGAs in Supercomputing

There has been plenty of talk about where FPGA acceleration might fit into high performance computing but there are only a few testbeds and purpose-built clusters pushing this vision forward for scientific applications.

While we do not necessarily expect supercomputing centers to turn their backs on GPUs as the accelerator of choice in favor of FPGAs anytime in the foreseeable future. there is some smart investment happening in Europe and to a lesser extent, in the U.S. that takes advantage of recent hardware additions and high-level tool development that put field programmable devices within closer reach–even for centers whose users want to focus on their science versus what the underlying hardware is based on.

The most obvious reason why progress has been stunted here is because most HPC applications require dense double-precision and FPGAs are not associated with this (even if it’s possible) and further, because scaling beyond a single node or FPGA is still not a mainstream capability outside of major hyperscale centers like Microsoft, for instance, with its massive FPGA deployment to support Bing and various AI focused initiatives.

The other elephant in the room when it comes to questioning where FPGAs are in supercomputing clusters boils down to programmability—something that obvious is connected to the other two areas (precision and multinode/multi-FPGA scalability).

All of these issues are set to be addressed at Germany’s Paderborn University where the first phase of the ten million Euro Noctua cluster project is underway. Coming online in 2018, this initial section will feature 32 Intel Stratix 10 FPGAs for early stage experiments porting, programming, scaling, and understanding how some traditional (and non-traditional including K means, image processing, machine learning) HPC applications respond to an FPGA boost. That may not sound like many FPGAs to cram onto a system but recall that most work that happens on FPGAs for these applications is limited to one device or node and rarely focuses on scalability.

More standard applications in HPC particularly in bioinformatics are set to be another target, according to Dr. Christian Plessl, who tells The Next Platform that more high-level tools to support FPGAs are paving the way for more potential than ever before, especially with an FPGA like the Stratix 10.Once completed, the system will mark what we expect will be the one of the largest FPGA cluster sfor serving scientific computing users, although there are several sites with FPGAs in various stages and sizes of deployments worldwide and of course the collaborative Catapult project that serves both Microsoft and research users at TACC with over 350 Stratix 8 FPGA equipped nodes.

“The selected FPGAs, with 5,760 variable-precision DSP blocks each, are well suited to floating-point heavy scientific computations. They reached general availability just in time to be installed in the Noctua cluster. A first set of applications that benefit from FPGA acceleration is currently ported and was reengineered in close cooperation with computational scientists. This infrastructure will be used to study the potential of FPGAs for energy-efficient scientific computing, allowing the center to maintain its leading role in establishing this technology in HPC.”

Aside from the FPGAs, the system, a Cray CS500 system will have 272 dual-socket compute nodes and 11,000 cores comprised of Skylake processors of the high-end 20-core variety. The machine will have a 100 Gbps OmniPath interconnect, something Plessl says will be interesting in the context of multi-FPGA communication as the team rolls out research and benchmarking results.

“The single precision floating point is very good and there are thousands of floating point units that can use the full bandwidth internally. We also have high capacity and more logic resources and DSP logic blocks than we had with the testbed FPGAs,” Plessl explains. The testbed was comprised of Arria 10 FPGAs but the added DSP blocks in the Stratix 10 were what shifted his center’s thinking toward Altera/Intel in this case.

As with the early days of GPUs in HPC centers, the question was always what percentage of domain expert HPC resource users would be willing to care enough about the underlying hardware to work hard to adapting their codes. Early on it was not many but as the ecosystem of support and tools grew, it became easier and richer in terms of libraries for different groups. The FPGA ecosystem is not rich in that way for HPC by any means, but Plessl says that there is interest from a subsection of their users with higher level tools that deliver high performance results.

“FPGAs are only a small part of the overall cluster and most users will use the CPUs but the group of those interested in exploring new architectures is growing and includes internal users and and some industrial users that want to develop the relevant pieces starting at the kernel level and others are porting complete applications to get started,” Plessl notes.

On the block for FPGA acceleration are specific solvers for electromagnetics and nanostructure materials as well as computational chemistry for areas like electron structure analysis where doing on operations on large matrices is common and single precision is possible.

Plessl says he has seen the opportunities evolve for FPGAs over the years, stemming from his early experiences working with embedded reconfigurable devices a couple of decades ago. He says the time is finally right from a programmability and usability perspective finally and now with devices that have the hardware required to be competitive in an HPC environment, the important work of scaling and porting can pave the way for FPGAs to become a force in HPC–eventually, at least. This will be a center to watch over the next few years as performance, portability, programmatic and scalability lessons are learned and shared.

I agree scalability of a multi-node cluster is a big problem. Abstractions exist because programming and managing one FPGA can be cumbersome, on a data center scale this is a really difficult problem. The key to this is the appropriate abstraction layer.

HLS exists to allow the easy creation of IP cores. However interfacing with I/O can be painful, for this exists platform abstractions (Xilinx SDx, Intel FPGA OpenCL SDK). Now how about multi-FPGA? There are works that build a “middleware” layer on top of single node FPGA abstractions.
Some of them are localized (not in a data center scale) such as the Bee3, Bee4, Maxeler MPC-X projects. Other projects such as the one I am working on in particular looks at creating multi-FPGA abstractions on top of network connected FPGAs in the cloud. Refer to https://dl.acm.org/citation.cfm?id=3021742, and https://ieeexplore.ieee.org/abstract/document/8056790/ for more details.

Shameless plug aside, I think a new definition of a hardware stack is in order. This is the real full-stack. From digital circuitry all the way up to an abstraction layer on a heterogeneous cluster!

Steven J says:

April 5, 2018 at 12:16 am

FPGAs will not become the next big thing in general data processing for one simple reason. Undocumented cores. FPGA vendors do not document the luts or bitstreams, your always forced to use their tools. This means that tool makers and innovators can not invent new tools or ways of doing things. Its VHDL or Verilog and that’s it. They are great for hardware design but general data processing not so much. And the tools the vendors provide are usually clunky and arcane.

IF the FPGA vendors opened their specifications and documented the bitstreams needed to program the chips there would be a revolution, but that won’t happen because FPGA vendors love lock in.

Andrei V says:

April 23, 2018 at 9:13 am

I have to respectfully disagree with Steven. HLS and OpenCL opened doors to great innovations in FPGAs. The scalability of the FPGAs was always there also: the FPGAs are programmable accelerators which are not limited to any fixed silicon limitations. Just think of them as a programmable compute fabric. For example, one can build a cluster of FPGAs compliant with OpenCL memory and execution model. OpenCL parallel codes are scalable via sub-NDRange execution as long as unified global memory is shared among multiple FPGAs.
On top of that FPGAs have direct access to external data at 40-100GE or even 1TB+ bypassing CPU/host system memory. This frankly, removes all limits to what users can do with FPGAs. Our company provides infrastructure and expertise for innovation in FPGA clusters to happen: http://www.sci-concepts-int.com

Naif Tarafdar says:

April 26, 2018 at 10:34 pm

I agree scalability of a multi-node cluster is a big problem. Abstractions exist because programming and managing one FPGA can be cumbersome, on a data center scale this is a really difficult problem. The key to this is the appropriate abstraction layer.

HLS exists to allow the easy creation of IP cores. However interfacing with I/O can be painful, for this exists platform abstractions (Xilinx SDx, Intel FPGA OpenCL SDK). Now how about multi-FPGA? There are works that build a “middleware” layer on top of single node FPGA abstractions.
Some of them are localized (not in a data center scale) such as the Bee3, Bee4, Maxeler MPC-X projects. Other projects such as the one I am working on in particular looks at creating multi-FPGA abstractions on top of network connected FPGAs in the cloud. Refer to https://dl.acm.org/citation.cfm?id=3021742, and https://ieeexplore.ieee.org/abstract/document/8056790/ for more details.

Shameless plug aside, I think a new definition of a hardware stack is in order. This is the real full-stack. From digital circuitry all the way up to an abstraction layer on a heterogeneous cluster!

Another Step Toward FPGAs in Supercomputing

Sign up to our Newsletter

3 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Intel To Set Its FPGA Unit Free To Pursue Its Own Path

Intel To Broaden FPGA Lineup And Make Them At Home

Now Comes The Hard Part, AMD: Software

3 Comments

Leave a Reply Cancel reply