UPDATED Here is a story you don’t hear very often: A supercomputing center was just given a blank check up to the peak power consumption of its facility to build a world-class AI/HPC supercomputer instead of a sidecar partition with some GPUs to play around with and wish its researchers had a lot more capacity.
The downside is that somebody is going to lose some parking spots to make it all happen.
Back in May, we talked to Simon McIntosh-Smith, principal investigator for the Isambard project and a professor of HPC at the University of Bristol, about its next generation Isambard 3 system, which is a CPU-only cluster based on the superchip node of a pair of Nvidia’s 72-core “Grace” Arm CPUs, all lashed together with Hewlett Packard Enterprise’s Slingshot 11 interconnect. And at the time, McIntosh-Smith mentioned in passing that around £100 million of the £900 million ($1.12 billion) in funding from the British government to build an exascale supercomputer in the United Kingdom by 2026 was earmarked for short-term, relatively quick deployment of AI infrastructure. So the GW4 collective – that’s the universities of Bath, Bristol, Cardiff, and Exeter – in the United Kingdom started thinking about how they might get some of that £100 million and build a beefier AI partition for Isambard 3.
Isambard 3 already has two partitions. One is based entirely on the Grace-Grace superchip and has 384 nodes across six racks of HPE Cray XD2000 machinery (that’s 64 Grace-Grace superchips per rack) to bring 55,296 Arm cores to bear on HPC workloads. The Grace chip uses the Neoverse “Demeter” V2 cores from Arm Ltd, which are based on the ArmV9 architecture, which we detailed back in August 2022. The other – and smaller – partition in the Isambard 3 system has 32 nodes with Nvidia’s “Hopper” H100 GPU accelerators, and specifically, these nodes use the Grace-Hopper superchips in a one-to-one ratio. All of the nodes have 256 GB of LPDDR5 memory on them, and the GPUs have their own 80 GB of HBM3 memory, of course. By using the Grace CPUs, the GW4 collective tick a few important boxes for a supercomputer in the UK, mainly that it is based on the homegrown Arm architecture and that the nodes are energy efficient and compact.
In the early summer, the UK government put out a call to various HPC centers in the country on how they might spend £100 million to build up some AI capability, and in July GW4 submitted its proposal.
“It was a bit later that month,” McIntosh-Smith tells The Next Platform, “that we got a call saying they really liked our proposal. And then they asked: What are your limits? How big could it go? And we were like, ‘Okay, that’s not the sort of question we’re used to.’ They asked if we were limited by space or are we limited by time, or are we limited by power? So we looked into it, and the first limit we would hit was power. We have five megawatts leftover at the site where Isambard 3 is going, and that is just over 5,000 Grace-Hoppers worth of power. And the government basically said, “That’s great. You basically fill it up as much as you can and tell us how much that would cost.’”
As it turns out, that costs £225 million ($281 million), and that also means that the Isambard-AI machine is not some sidecar strapped onto the back-end of Isambard 3, but a supercomputer in its own right and, as it turns out, what will be the most powerful machine in the United Kingdom when it is installed and running next summer.
The Butterfly Effect Of Weather Modeling In the Cloud
In a funny way, the decision by the UK Meteorological Office, which does weather forecasting for the country, to move its supercomputers from on premises datacenters to the Microsoft Azure cloud paved the way for the GW4 collective to have its own datacenters and systems and for them to be located at the University of Bristol, which in turn gave GW4 the ability to bid on a massive AI supercomputer and a datacenter to house it in the first place.
The Isambard 1 and Isambard 2 Arm-based supercomputers run by GW4 were hosted at the Met Office in Exeter, just like machines for the University of Tennessee are hosted at Oak Ridge National Laboratory, for instance. So GW4 looked around its own facilities when it had to find a home for Isambard 3, and eventually decided to put the machine in a modular datacenter in the parking lot of the National Composites Center at the University of Bristol. The NCC facility at the Bristol & Bath Science Park, shown in the feature image above, does research and development of various kinds of component materials used in manufacturing and has massive autoclaves for cooking up weird stuff, and could probably use a little access to some supercomputing. . . .
The Isambard 3 system did not have a big budget – £10 million ($12.3 million), including the cost of the Nvidia servers with Grace-Grace superchips and a modular datacenter from HPE with read door liquid cooling on the racks inside. GW4 is breaking ground on the construction of the modular datacenter in the parking lot of the NCC facility right now, and that should be mostly done by Christmas, says McIntosh-Smith. The pods for Isambard 3 should arrive in January and the systems should be installed by February and the whole shebang should be live by March.
With Isambard-AI, the site doesn’t even exist today, the proposal was written in July, and it will be up and running by next summer. “It is all going to happen within a year,” says McIntosh-Smith with a laugh. “If you crack on with it, it’s crazy. People can’t believe how quickly we can do this now.”
The Isambard-AI machine is a bit different from Isambard 3, and not just because it costs more than 20X as much dough. First, it is based on the liquid-cooled Cray EX4000 racks, which are made for denser computer than the Cray XD2000 racks used with Isambard 3 and which are used for all of the exascale-class machines that HPE is building.
For storage, the Isambard-AI machine will have 20 PB of all-flash Cray Sonexion storage running the Lustre parallel file system as well as just under 4 PB of all-flash NSF file storage from Vast Data, which will handle all of the small I/O work and boost the IOPS into and out of the machine and which will have an effective capacity of between 7 PB and 8 PB with data compression and deduplication software running.
By the way, there was competitive bidding for the Isambard-AI system, and GW4 looked at all kinds of machinery, including Intel and AMD GPUs, Intel Gaudi accelerators, and Graphcore accelerators. But given that AI models are changing so quickly and there is a need to support HPC workloads as well on this machine from time to time, it really came down to a GPU choice and right now, Nvidia arguably has the best platform with its Grace-Hopper design. (The MI300A is no slouch, of course, but it doesn’t have Arm CPUs on its chiplets for serial compute.)
All told, Isambard-AI will have 5,448 Grace-Hopper chips across twelve cabinets, with over 365 petaflops at double precision FP64 floating point math on the tensor cores in matrix mode and 185.2 petaflops at FP64 on its vector cores. That works out to 10.8 exaflops at FP8 quarter precision floating point with dense matrixes and 21.6 exaflops at FP8 with sparse matrix crunching turned on to double up the throughput.
But that may not be the end of it with Isambard-AI.
The current power distribution ring at the Bristol & Bath Science Park goes up to 7 megawatts, and the NCC facility uses 2 megawatts for its giant Easy Bake ovens. That left the 5 megawatts that could be consumed by Isambard 3 (which burns hardly anything) and Isambard-AI. Next year, the power ring is getting boosted to 10.5 megawatts, so GW4 has another 3.5 megawatts to play with, and the Bristol & Bath Science Park can be extended to 30 megawatts over the next couple of years, according to McIntosh-Smith.
You can certainly do a full-blown, real, FP64 machine at 1 exaflops sustained on the HPL benchmark in 28 megawatts. And several years from now, you should be able to do 1.5 exaflops or maybe even 2 exaflops in that power draw. So the question now is will Isambard 4 be the United Kingdom’s first true exascale machine? It is beginning to look like it.
Maybe with the other half of the UK budget Graphcore can build its proposed Good AI supercomputer, with which it hopes to lash together 8,192 of its next-generation IPUs to create a 10 exaflops system and giving the United Kingdom some true computational independence. That Good performance is an FP16 precision rating, so it is about as powerful as the Isambard-AI machine at FP16 precision. Graphcore says it can build such a machine for around $120 million. But this machine will not do FP32 or FP64 math, so it won’t be much good for anything but AI workloads.
Still, there is budget for it as well as a big, fat Isambard 4 or Isambard AI-2 machine – whatever you want to call it.
Let’s do the math. Starting with the $1,123 million UK budget for AI systems, take out the costs of the Isambard-AI and the Good machines and you have $722 million left over and also 23 megawatts spare to play with by 2026 at the Bristol & Bath Science Park. Take out 1 megawatt for the Good system.
Now, assuming GPU performance can go up by 1.8X in 2024 and increase again by 1.6X in 2026 – big assumptions, but this is napkin math – and assuming prices for GPUs stay about the same and thermals don’t go too wacky, perhaps hitting 1,200 watts per device two generations from now in 2026, then our back of napkin math shows that a GPU-based machine from Nvidia fitting in a 22 megawatt envelope should be around 1.4 exaflops peak at FP64 precision on vectors and twice that on FP64 matrix cores and should be around 160 exaflops at FP8 with sparsity support active. If the GPU prices stay the same, then such a hypothetical Isambard 4 machine would cost $721 million.
There’s a million bucks leftover to spend on whiskey for the programmers and cigars for the politicians. Which seems only fair. Legendary civil engineer and shipbuilder Isambard Kingdom Brunel would certainly approve.
Update: After this story ran, we had this exchange on Twitter with McIntosh-Smith:
Simon McIntosh-Smith: To be clear, the rest of the £900 million that’s been announced is to build an exascale machine in Edinburgh in the not too distant future. At Bristol we’ve already got plenty on our hands delivering Isambard-AI and Isambard 3, both in 2024, thank you very much…
Timothy Prickett Morgan: Forgive my excitement on your behalf then, Simon.
The same math applies in Scotland….
As long as they don’t call it ACHiLLeS, we will be fine with that. And what precisely is wrong with two exascale machines in the United Kingdom? <Smile>
So, they take the parking lot out of the Park, replace it with modular Bristol board, inject that with inverse-Roman-geothermal liquid-Cooled Bath Science, sprinkle the result with plenty of GrassHoppers, and voilà, an extra-crispy Isambard-4AI-2 Easy-Bake Oven Exacruncher by Summer (if I understood well!)! Brilliant! 8^P
Update: 900 million pounds of delicious kilted-bagpipe Clocksin-and-Mellish highland-scotch ExaProlog … Genius! d^8
“check” Hmm
After an ultra-dry summer, there’s now been major storms (Ciaran, Gordian, Domingos) and non-stop rain (for 30-40 days it seems), leading to Noah-style flooding deluges around the Channel (Northern France, Southern England), and even up to the Bristol/Bath neighborhood it seems (eg. River Tone level at Currymoor pumping station is 7.53m and rising slightly).
Powerful HPC systems should definitely be applied to physically-based prediction of the hydrologic response of European watersheds, in a spatially-distributed-parameter framework, through Geographic-Information-System (GIS)-supported modeling systems (calibrated against historical measurements), possibly HPC-SWAT, or HPC-MIKE-SHE, to help prepare those populations likely to be affected, and also more importantly to plan for future land-use developments, and update existing ones (re-design, retrofit), such that this sort of way pre-medieval anti-deluvian kind of entirely preventable and shameful flooding no longer occurs in advanced European states (folks have now been flooded for more than a week in Northern France)!
The first priority of tax-payer-financed HPC must definitely be to enhance the public infrastructure so that we are no longer so fragile, as a people, in the face of changing natural forces!
“Next year, the power ring is getting boosted to 10.5 megawatts, so GW4 has another 3.5 megawatts to play with”
Not necessarily. Maybe the Easy Bake ovens will get upgraded instead, so they can toast three times as many crumpets and waffles.
As ever, Betteridge’s law of headlines seems to apply!
https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines
I know, right!?!