Graphcore IPU Put Through the Supercomputing Paces

With performance comparable to the Nvidia V100 GPU, a common accelerator in HPC but better energy consumption numbers and memory bandwidth potential, Graphcore can turn heads in supercomputing. That is, if the software stack can be built to meet some tough portability, programmability requirements.

While the Graphcore IPU will not be a fit for all HPC workloads by any stretch, work out of the University of Bristol on stencil computations for structured grid operations proves the IPUs mettle, even if such testing takes some extra software footwork. These problems in HPC are high-value in that they are the core solvers for differential equations used in areas including computational fluid dynamics—a compute-heavy workload frequently run at scale on supercomputing resources.

The University of Bristol has had Graphcore hardware for several months and have been exploring its relevance in a broader array of scientific computing domains, including particle physics (for CERN in addition to Bristol). Although the latest MK2 IPU is out now (main difference is core count increase along with 3X SRAM) the HPC evaluation was done on the MK-1 IPU (1216 cores with 32 and 16-bit support).

One of the reasons we are not surprised to see the university as home to some of the early work in proving the IPU concept is because Graphcore is Bristol-based. Unlike Cerebras and SambaNova in the U.S., however, Graphcore devices haven’t been publicly touted as being connected to any major supercomputers—at least, not yet.

Part of what was attractive to the researchers, including Simon McIntosh-Smith, a well-known name in supercomputing both on hardware and software fronts, was that it was possible to low-level attack programming via Poplar, Graphcore’s C++ framework. Unlike some other ML accelerators, there is no need to retrofit HPC code into an overarching, higher-level ML framework. As McIntosh-Smith and team describe, “Poplar combines with a tensor-based computational dataflow graph paradigm familiar from ML frameworks such as TensorFlow. The Poplar graph compiler lowers the graph representation of a program into optimized communication and communication primitives, while allowing custom code to access all of the IPU’s functionality.”

“We achieve very good performance, far exceeding that of our comparison implementation of 48 CPU cores and comparable with the results we see on the Nvidia V100 GPU,” the team summarizes. They add that the 150W power consumption, however, is a rather dramatic improvement over the 250W (TDP) for the V100. In addition to power consumption and comparable performance next to the latest GPU, McIntosh and team found scalability “relatively transparent” via the Poplar framework, amounting to selecting a “multi-IPU” device for the programmer, which creates a virtual device with the required tiles and the compiler handling all the IPU-to-IPU communication.

Another promising feature on the device was the memory bandwidth. Their findings show “an order of magnitude greater [memory bandwidth performance] than the HBM2 memory on the Nvidia A100,” the result of the on-chip SRAM that sits next to every core.

The stumbling block, as with all accelerators that haven’t had the benefit of a huge communities (HPC open source and CUDA, for example) always seems to be software. As McIntosh and team conclude, “expressing our chosen HPC problems in Poplar was not always straightforward compared with familiar HPC technologies such as OpenMP, MPI, and OpenCL. We found Poplar code to be more verbose than our OpenCL implementations (around 1.6x lines of code for the Gaussian Blur stencils).”

Further, they add that since the graphs in Poplar are static, some common operations (dynamic grids, adaptive mesh refinement) this presents some limitations. They also note that the expensive graph compile time is quite a bit higher than what an OpenCL would produce, even though there is possibility for ahead-of-time compilation available.

One of the most important problems, especially if these devices take hold in HPC, is that nothing is portable. If there is anything the HPC software community takes seriously, it’s portability—an important missing piece, although not an insurmountable problem if Graphcore works to make this a non-issue.

Making use of the IPU’s specialised hardware, as we did for the 2D convolutions in the Gaussian Blur application, can yield large performance benefits, especially for 16-bit precision computations. Furthermore, since applications such as the ones we have implemented here are often limited by memory bandwidth, we expect many HPC applications to benefit from the large amounts of low-latency, high-bandwidth on-chip memory that chips like the IPU offer.

The team expects that with more experience, they can continue to optimize the IPU. They also think that despite some of these early limitations, many HPC applications can “benefit from the large amount of low-latency, high-bandwidth on-chip memory that chips like the IPU offer.

From our own perspective here at The Next Platform, making the switch to a new accelerator architecture in future production for HPC, especially given some of the extra software work required, should come with a much higher performance advantage over easily accessible GPUs, especially since GPUs have all the software for AI/ML and HPC packaged together and well understood under the CUDA umbrella. Granted, this is just one slice of a broad HPC application set.

While other ML accelerator startups that have been able to secure early footholds in academia/research environments have shown similar and sometimes better results—sometimes far better—the space is still emerging and clearly not all applications will work well on these architectures, either because of the structure of the problem (precision requirements, etc.) or the inability for codes to be well adapted.

Still, this is Graphcore’s quiet “coming out party” in HPC as they are one of three we are watching in particular to prove their might in HPC. After all, a big university or national lab win for a functional environment (versus just a prototype/test) will signal much to the legions of companies who are interested in dipping a toe into custom ML hardware waters but are still wearing the Nvidia lifejacket.

Far more detail in the University of Bristol’s full study of the Graphcore IPU.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. “comparable with the results we see on the Nvidia V100 GPU”

    But wait. Wasn’t this thing supposed to me many many times faster than a GPU? Esp one that’s 3 or 4 years old?

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.