For the last few years, Graphcore has primarily been focused on slinging its IPU chips for training and inference systems of varying sizes, but that is changing now as the six-year-old British chip designer is joining the conversation about the convergence of AI and high-performance computing.
There are now 168 supercomputers in the Top500 and quite a few more outside of that list that use accelerators to power these increasingly converging workloads. Most of these systems are using Nvidia’s GPUs, but the appearance of seven new systems with AMD’s fresh Instinct MI250X GPUs — which includes Oak Ridge National Laboratory’s Frontier, the United States’ first exascale system — shows there is an appetite to consider alternative architectures when they can provide an advantage.
Graphcore hopes it can soon get a slice of this action with its massively parallel processors.
Phil Brown, a Cray veteran who returned to Graphcore in May as vice president of scaled systems after a four-month stint at chip startup NextSilicon, tells The Next Platform that the IPU maker has recently “seen significant, sustained interest” from organizations that are considering deploying Graphcore’s specialized silicon for these converged AI and HPC needs, and this includes large deployments.
“I think we’re now at the point where there is going to be significant interest in doing large-scale deployments with the systems. The technology space and machine learning capability has evolved sufficiently that it can deliver significant value to the scientific organizations, and so I’m expecting those to follow quite rapidly in the future,” he says.
Graphcore views three key opportunities around the convergence of HPC and AI: using IPU’s “class-leading” performance for 32-bit floating point math to tackle HPC applications, training large foundation models like DeepMind’s 280-billion-parameter language model, and “using AI to complement and accelerate traditional HPC workloads” to create a feedback loop of sorts.
It’s the latter area that Brown says is likely the largest opportunity for Graphcore in HPC.
“This may be having surrogate models, elements of a traditional HPC simulation, replaced by a machine learning kernel parameterization in a weather forecast, for example,” he says. Surrogate models are computationally expensive, he added, so replacing them with a machine learning models that are “much cheaper but equally accurate” can help reduce the overall cost of running simulations.
These opportunities are based on exploratory work Graphcore has conducted with partners that has yielded promising results. For instance, the company says its IPUs were used to train a gravity wave drag model for weather forecasting five times faster than Nvidia’s V100. In another example, Hewlett Packard Enterprise trained a deep learning model for protein folding using Graphcore’s IPU-M2000 system and found that the second-generation IPU was around three times faster than Nvidia’s A100.
To help more the conversation forward, several government labs are in different stages of trying out Graphcore’s IPUs to see if the processors hold promise for large systems in the future.
Most recently, this includes the US Department of Energy’s Sandia National Laboratories and Argonne National Laboratory. Both are adding Graphcore’s Bow IPU Pod systems to their AI hardware testbeds, and Argonne is doing so after reporting “impressive results” with Graphcore’s first-generation IPU systems. These Bow Pods will use the chip designer’s recently announced “Bow” IPU, which makes use of Taiwan Semiconductor Manufacturing Co’s wafer-on-wafer 3D stacking technology to provide more performance while using less power compared to its second-generation IPU.
Michael Papka, director of the Argonne Leadership Computing Facility, says the addition of Graphcore’s Bow IPU Pod supports the testbed’s goal of understanding “the role AI accelerators can play in advancing data-driven discoveries, and how these systems can be combined with supercomputers to scale to extremely large and complex science problems.”
The University of Edinburgh’s EPCC supercomputing center is also installing a Bow IPU Pod system, which will use it for a “broad range of use cases” as part of the multi-industry-supporting Data Driven Innovation Programme that is funded by the governments of Scotland and the United Kingdom. EPCC has expressed interest in Graphcore’s in-development “Good” computer, which the company has promised will deliver more than 10 exaflops of AI floating point compute with next-generation IPUs.
If we were to travel 226 miles south of EPCC, we’d find support for Graphcore from England’s Hartree Centre, which plans to access IPUs through cloud service provider G-Core Cloud to conduct research on fusion energy as part of a partnership with the UK Atomic Energy Authority.
While Graphcore is building its own exascale supercomputer for AI with the “Good” system, Brown saus he believes the company’s IPUs will be well-suited for other exascale supercomputers in the future, ranging from those that are “very AI-focused” to those running traditional simulation software that could benefit from performing such calculations at a lower precision on IPUs.
This means that, in Brown’s mind, an exascale system could consist mostly of Graphcore IPUs or the processors could be a component of a larger heterogenous system, which he says is based on feedback he’s heard from people in the HPC community.
“The message that we’ve been getting from them is that they’re very interested in exploring exascale system architectures that include components of different types that give them a good balance of overall capability for their systems, because they recognize that the workloads are going to become more heterogeneous in terms of the space but also the performance and the value proposition you get from these heterogeneous processors is well worth the investment,” he says.