We were away on vacation at a lakeside beach in northern Michigan when we caught the news that the UK government was pulling the plug on a plan for an exascale supercomputer to be installed at the EPCC at the University of Edinburgh. And then we caught COVID – because no good deed ever goes unpunished – and have been dealing with that, too. So forgive us for just getting around to advocating on behalf of the British HPC community until now.
At a time when the richest companies and countries in the world are making absolutely ginormous investments in AI hardware, this might be a good time to rethink the potential architecture of the successor to the Archer2 supercomputer that was installed in 2021 at EPCC and that is the flagship HPC machine in the United Kingdom. But it most certainly is not the time to pull the plug on any AI investment that can also benefit the HPC community.
This is particularly true as the United Kingdom stands separate from both Europe and North America, economically speaking. Brexit means not having to pay to fund EuroHPC, but it also means not benefitting from EuroHPC, too. And if you stand alone, you still have to make the investments that are necessary for your own economy, your own educational institutions, and for scientific progress.
Based on gross domestic product alone, we admit, it is hard to argue that the UK needs an exascale system. And given that HPC and AI promises were made by the Sunak government last October that the current Starmer government can’t cut checks for without political battles, we know there are some pretty severe budgetary issues and something has to get cut.
The United Kingdom had a gross domestic product of $2.9 trillion last year at current exchange rates between the US dollar and the UK pound; the United States raked in $27.4 trillion in GDP and China generated $17.7 trillion in GDP last year, by comparison. If you argue for exaflops of AI/HPC supercomputing per GDP, the United States now has three exafloppers that cost in the neighborhood of $1.6 billion to build (but one of them was discounted because of huge delays and architectural changes). We don’t know what China paid for its two exascale machines, but it is probably on the order of $1 billion, and there is no reason to believe China doesn’t actually have three. Or more.
The point is, if there is a proportionality between economic activity and government-sponsored HPC/AI at large scale, the UK probably doesn’t warrant an exascale-class machine. (I know, we are supposed to be arguing for such a machine. Sit tight, we will get to that.)
The US will have around 4.74 peak exaflops of exascale-class iron installed when the “El Capitan” machine at Lawrence Livermore National Laboratory is counted in a few months, and that works out to a ratio of 0.173 peak exaflops per trillion of GDP dollars. (The “Frontier” machine at Oak Ridge National Laboratory and the “Aurora” machine at Argonne National Laboratory are in this mix of US exa-iron.)
China, with its two machines – “Tianhe-3” at NSC/Guangzhou and “OceanLight” at NSC/Wuxi – has a total 3.55 peak exaflops of exa-iron against that $17.7 trillion in GDP, which is a ratio of 0.2 if you round up from the fourth decimal place.
If you take the average ratio of the US and China for Rpeak to GDP, that is 0.186, and if you work it backwards against that $2.92 trillion GDP for the UK last year, you get a 545 petaflops machine.
Whoa, we’re halfway there. Livin’ on a prayer. . . .
Now, here’s the important thing. The US has four major hyperscalers and cloud builders – Amazon, Microsoft, Google, and Meta Platforms. China has four major hyperscalers and cloud builders – Alibaba, Baidu, Tencent, and ByteDance. These companies together account for around half of datacenter server spending and an even larger proportion of research in AI.
How many big cloud builders and hyperscalers does the United Kingdom have? Zero.
And so, if you believe in indigenous capability, the United Kingdom has to make up that huge gap in research and development and raw capacity available to industry in an indigenous fashion. And we might argue further that, given the strategic nature of AI for economic, political, and military reasons, each country has to have sovereign AI capability at massive scale – the kind that the hyperscalers and cloud builders offer. This seems dead obvious. Yes, we know about the 365 petaflops “Isambard-AI” machine at the University of Bristol (using a mix of Nvidia “Grace” Arm CPUs and “Hopper” H100 GPUs) and the 200 petaflops or so “Dawn” machine at the University of Cambridge (which is based on the same Intel CPUs and GPUs used in the Aurora machine at Argonne). Both of these escaped the budget cuts by Starmer because they had already been funded by Sunak.
EPCC has already paid £31 million to build the facility that would wrap around the “Exascale” kicker to Archer2. You can actually see a tour of the empty facility compliments of the BBC at this link on Xitter. (Say that strange word for what used to be Twitter with a Chinese accent and it’s funnier. . . )
Now, let’s talk money and flops, starting with a history of the big machines installed in the 21st century at EPCC.
At the University of Edinburgh itself, our knowledge of HPC machines that were used by researchers comes from the Top500 lists of days gone by, and includes systems built by Meiko, Thinking Machines, Cray, and IBM from the early 1990s up through the middle 2000s. Then, with Hector – we are not going to use ransom note capitalization as EPCC does for HECToR, DiRAC, and ARCHER, which is either crazy or shouty and we don’t like either – things started to get serious, with a big jump in budget and performance by the UK government. Check it out:
We put in the HPCx machine, built by IBM as a federated NUMA cluster of forty 32-way “Regatta” Power p690 servers, with a homegrown federation switch called “Colony,” that a bunch of HPC centers bought as the last gaps of NUMA clusters in HPC before the massively distributed Linux wave utterly swamped the HPC centers of the world. (If you look at Colony, it is really a PCI-Express switch that linked the NUMA machines by their buses and ran something akin to the CXL protocol. There are few new ideas in IT, only new implementations. . . . )
Anyway, HPCx was a neat machine in a lot of ways, but mostly because it shows how not to build a supercomputer even if the programming model is easy. You change iron and change the programming model to make things cheaper so you can scale further. Look at how expensive per teraflops HPCx was. Egads! Ach! It’s enough to make you choke on your haggis. Over $7 million per teraflops.
Look at the step function in performance and affordability that came with the Hector systems, installed in 2007 and representing the first UK national supercomputer at reasonably large scale. There were four phases of Hector, based on Cray XT4, XT6, and XE6 CPU-only nodes and a Cray X2 vector sidecar. The final XE6 version of Hector offered a factor of 69X improvement in peak performance and a factor of 30.6X improvement in cost per unit of performance compared to HPCx. Which is why we don’t have 10,000-node NUMA supercomputers.
With the Archer supercomputer installed in 2014, performance went up by 3.1X over Hector, and the cost of a unit of performance dropped by 8.4X compared to Hector. And with Archer2, the performance increased by 10.1X and the price/performance improved by 6.9X compared to Archer.
This is the kind of progression in numbers that we like to see in HPC and now AI. Yes, the performance is climbing fast but the cost is going up faster, so machines are getting more expensive. But the unit of performance is still at least getting less expensive over time, even if the rate of change is diminishing.
Which brings us to the “game-changing exascale computer” that was proposed and presumably budgeted for last October as part of a £900 million investment by the British government.
The Isambard-AI machine ate £225 million of that, and then you take off £31 million for the EPCC facility expansion. That leaves you £644 million, or about $827 million, to build an exascale supercomputer. We think it could be done with less than that, depending on the architecture chosen. But at that price and assuming you wanted to have an exaflops on the High Performance LINPACK benchmark commonly used to rank supercomputers, that would work out to $636 per teraflops for a 1.3 exaflops machine.
This all sounds perfectly doable. And, as we have shown above, the UK needs a half-exaflopper just based on GDP and then a big wonking piece of homegrown lower precision flops to satisfy its need for indigenous and sovereign AI.
The easiest thing to do, from the point of view of the applications that currently run on Archer2, is to build a massive CPU-only cluster that burns up the £644 million. We do not think that this would be an exaflops machine. We will use the Isambard 3 machine as a benchmark. This is a cluster of Grace-Grace superchips from Nvidia, and 2.7 petaflops cost £10 million (about $12.3 million at exchange rates last year). That included the cost of a modular datacenter and the Cray XD2000 enclosures from Hewlett Packard Enterprise. If you made a 1.3 exaflops CPU-only Grace-Grace cluster, it would take $5.92 billion. So you see why that doesn’t work anymore. . . . Anyway, $827 million gets you about 181.5 petaflops of CPU-only compute with a Grace-Grace system, which is 7X more oomph than Archer2 at a cost of 8.1X higher. That works out to $4,557 per teraflops, and that is 15 percent more expensive than Archer2’s unit of compute.
This seems to argue for a machine that has enough CPUs to do Archer2’s CPU-only application work but also allows for very large, lower precision AI models to be run. That sounds like a hybrid CPU-GPU system, or now that SoftBank owns both Arm and Graphcore, maybe a hybrid of Arm CPUs and Graphcore “Bow” IPUs.
Graphcore said it could deliver the “Good” supercomputer with 10 exaflops of FP16 precision compute for $120 million, but its devices do not support FP64 precision. So maybe spend $240 million on 20 exaflops of FP16 AI compute and that is in the same realm as the machines that are being used to train the latest GenAI models. And, we might add, cost around $1 billion using the GPUs that are more widely available today. And then spend the rest on Arm CPUs that can do the work that Archer2 is currently doing, but faster or at greater scale. The remaining $587 million will get you just shy of 130 petaflops, or 5X the compute of Archer. Despite SoftBank being Japanese, this is a British-ish solution. And that might make it politically more salable.
Then again, if you just want to make the most general purpose machine you can get with no code porting issues from Archer2, get a clone of El Capitan at 5/6ths scale – with somewhere around 32,000 AMD Instinct MI300As, which are hybrid CPU-GPU compute engines – and get 2 exaflops of peak FP64 precision and around 32 exaflops of peak FP16 precision on the GPU sections of the MI300A, pay $500 million, and give the rest of that $827 million back to UK taxpayers. HPE and AMD would no doubt like to do such a deal, and they are the incumbents at EPCC, too.
We do not know the aggregate FP64 performance of the 24 cores per MI300A, but it is probably not far off the performance of the Epyc cores used in Archer2. There would be 768,000 Epyc cores in this proposed Exascale kicker compared to 716,800 cores in Archer2. Call it a wash. Any Archer2 code that fit into 128 GB of main memory per node would run on the CPU cores, and with a hell of a lot more memory bandwidth, since the MI300A uses HBM not DDR5 memory.
One last idea: Maybe instead of trying to build a machine all at once, you build it over time? Do a fifth of the machine each year, or a tenth, until you get there? Don’t change any of the technology – just don’t try to do it all at once. And then maybe leave it in the field for a decade and use the heck out of it, despite the electricity bill. Get science done, get AI research done, and don’t keep fighting for a new machine every three years. Get one big machine and beat it to death.
Well, one thing’s definitely for sure in this unfortunate pulling of the bowstring’s plug, and it is that James II, King of Scots, was most correct in declaring as early as 6 March 1457, that “football and golf should be utterly condemned and stopped”, replacing them at once with “ARCHERy displays” of supercomputational oomph, at a minimum of “four times in the year” (twice as often as Top500)! ( https://archery360.com/2020/02/27/scotland-hosts-worlds-oldest-archery-competition/ ). What an accurate setting of foresight! Today, £billions spent to hit balls with foot and club, and only pinched pennies for the arrowsmiths of strategic exaflopping … Edinburgh’s new Scotichronicon needs a Robin Hood of HPC (IMHO)! 8^p
Archer2 is a high node number NUMA machine.
Not by any definition of NUMA that I know. It uses a two-socket node with AMD 32-core Epycs, with the nodes linked by two-port Slingshot Ethernet adapters over a Slingshot fabric. Slingshot is Ethernet, not NUMA. There is RDMA for sure, but not memory coherency over its 5,860 nodes.
Surely you are talking UMA, rather than NUMA? The Cray EX nodes each have 128 cores and 8 NUMA regions (so 16 cores sharing L3 memory per NUMA region)
I’m talking Non-Uniform Memory Access across multiple nodes.
That’s the on-CPU NUMA region–totally different.
The UK doesn’t need yet another government IT farce and the taxpayer can’t afford this vanity project for the scientific elite.
The UK’s national supercomputing facilities are far removed from the kind of systems that end-up as “government IT farce”; perhaps you’re thinking of things like Post Office Horizon, 2000s-era NHS spine and various other government/local government business systems.
Successive generations of facilities have largely been successful, productive, and cost-effective in the forty-plus years with which I’m familiar. As for cost to the taxpayer, the return on investment has been demonstrated to be many times the cost of building and and operating the systems — see, for example, the London Economics study: https://www.ukri.org/wp-content/uploads/2022/07/EPSRC-050722-ImpactEPSRCInvestmentsHighPerformanceComputingInfrastructure.pdf
The UK can’t afford not to remain competitive in this capability.
It’s not a “government it farce”, it’s educational institutions and the science community. And we cannot afford *not* to do it.
The taxpayer doesn’t pay for anything except government debt, all government spending is newly created money, the only limit on government spending is what the markets think we can bear, capex doesn’t upset markets, free money in tax breaks does.
Your argument is what size of supercomputer do we deserve rather than do we need. UK research is consistently punching above it’s weight and national supercomputers primarily support that need. If you look at the H-index of the types of research that already benefit from computational infrastructure, the UK is consistently in the top four countries. Our proportion of the Top500 list, isn’t.
Some of that comparative efficiency is probably due to research being cross-funded by student fees (remember international student fees?)
Whether having a large supercomputer should lead to increased GDP would be for the Industrial Strategy to determine.
An interesting matching story to compare G8 countries (or is that G7 now) might be illuminating.
Point taken.