If you want to test out an idea in HPC simulation and modeling and see how it affects a broad array of scientific applications, there is probably not a better place than the Texas Advanced Computing Center at the University of Texas. This is the where the flagship systems of the National Science Foundation go, and therefore it is where a diversity of applications are plugging away, 24 by 7 by 365 and a quarter.
On the existing “Frontera” system, which was operational in September 2019 and which is an all-CPU system, is comprised of 8,368 two-socket Xeon nodes with a total of 468,608 cores and a peak performance of 38.75 petaflops. There was a partition on Frontera that had liquid-submerged Nvidia GPU nodes, testing out the idea of GPU acceleration for the NSF codes on a 3.5 petaflops box. Those NFS codes span from celestial physics to quantum mechanics to material science to drug design to climate modeling. In any given year, more than a hundred large scale, computationally intensive applications have been run on Frontera, many of which have been able to access the full oomph of the machine.
In November 2023, TACC made a deliberate move to GPU acceleration, with the “Vista” supercomputer that was meant to bridge between a Frontera that was getting pretty long in the tooth and a “Horizon” system that would radically improve the performance of the flagship NSF supercomputer, which was initially expected sometime in 2026. The Vista system was commissioned in late 2023 and delivered in the summer of 2024, and as we expected, did indeed point to the hybrid CPU and GPU future that TACC was banking on with Horizon. Vista had 256 all-CPU nodes and 600 hybrid CPU-GPU nodes, and is no slouch when it comes to performance across those CPUs and GPUs with various numerical precisions:
Interestingly, the Horizon system deal with Nvidia and Dell was struck back in 2021, Dan Stanzione, TACC executive director, tells The Next Platform, so the die was already cast, so to speak, to use the GPU after the “Hopper” H100 and H200 in the Horizon system – which is of course the “Blackwell” datacenter GPU – and not the one after that, the “Rubin” GPU accelerator. Even though that will be shipping at the end of this year if all goes according to Nvidia plan. (The evidence suggests that both the “Vera” CV100 Arm processor and the Rubin R200 GPU accelerator are right on time for 2H 2026 delivery.)
The way the NSF budget cycles and the Nvidia product cycles line up, it wasn’t possible to wait for the Rubin GPU to get into the Horizon system, which is why the CPU-GPU half of the Horizon machine, which has 2,106 nodes comprised of one 72-core “Grace” CG100 Arm processor coupled to a pair of Blackwell B200 GPUs. We calculate that the Grace CPUs have 6.9 petaflops of FP64 performance and the Blackwell GPUs have 161.3 petaflops of FP64 performance.
Significantly for the research that TACC plans to do using mixed-precision floating point – work that started on Frontera and progressed with Vista – those Blackwell GPUs have slightly more than 20 exaflops at FP16 precision, slightly more than 40 exaflops at FP8 precision, and nearly 81 exaflops at FP4 precision. (More one that in a minute.)
Because so many of the NSF workloads continue to be run on CPUs, having a CPU-only partition is also necessary. So there is a partition based on the Vera CV100 processors, which have 88 cores. We did a little backwards math and figure this partition is comprised of 4,752 Vera-Vera superchip nodes, and we further speculate that these run at 3.64 GHz and deliver a total of 131.8 petaflops at FP64 precision.
The good thing is that Nvidia is honoring the financial deal they pitched to TACC way back in 2021, even though the street prices of Nvidia GPUs have obviously accelerated upwards since then thanks to the GenAI boom.
“It worked out for us,” says Stanzione. “We didn’t know this level of explosion was coming with GenAI, but to Nvidia’s credit, they said this many parts at this much money. For Dell and other parts of the system, costs have gone up pretty dramatically – we didn’t have the same deal, but five years ago we did have a deal for the 4,000 GPUs at a fixed price was in the Blackwell generation. And Nvidia has, despite what has happened to the retail price of the GPUs, they have held the deal they made with us. I doubt our cost is substantially better than the deals that the DOE labs are getting, but it’s certainly better than the deals that OpenAI is getting.”
So there’s that. Hooray TACC!
But the neat bit with the Horizon system being installed this year is that TACC will be using the low-precision floating point in the tensor cores on the Blackwell GPU to emulate FP64 computations at the heart of the HPC simulation and modeling workloads that scientists run on TACC iron today.
“The way the world is changing, and given all the circuit emphasis that is going into low precision, we have to work up the numbers,” Stanzione explains. “We don’t have access to everything yet, but we will do native FP64 and then we are going to do emulated FP64 with the Ozaki scheme most likely, unless something better comes along. And for all of our acceptance apps, we are going to go through a verification process to look at the quality if the answers in native FP64 versus the not-quite-IEEE-compliant FP64 you get from the Ozaki scheme. But we are anticipating a 2X to 3X performance with emulated FP64 over native FP64 just given how much more circuits there are going into lower precision right now.”
Stanzione adds that TACC will do High Performance LINPACK results for the Top500 supercomputer rankings in both native and emulated FP64 for the Horizon system. This is not the same thing as the mixed precision HPL solver that already exists for LINPACK code that was developed by Jack Dongarra and others at the University of Tennessee, which gives a 10X performance boost.
We are keen to see how real world applications make-do with the Ozaki scheme on Blackwell and Rubin GPUs, and hope that data is also made available to compare “Ampere” and “Hopper” GPUs running the same Ozaki scheme emulation. (Here is the original paper on the Ozaki scheme from 2012, and this is a new paper on an updated Ozaki FP64 emulation algorithm that was put out last April.)
TACC has done early testing on its GPUs using an internal implementation of the Ozaki scheme, but with Blackwell, Stanzione says it will shift to the reference implementation of the FP64 emulator from Nvidia. The actual performance will depend, he says, on the bandwidth in the Blackwell GPUs, but using the Ozaki scheme will be as simple as changing the DGEMM call – native or emulated – that is used in the code.
“If there’s a case that this emulation is producing scientifically acceptable results, then suddenly our FP64 cost is now divided by 2X to 3X without actually changing the chip,” says Stanzione hopefully. “My presumption is that over time is there will be Ozaki and other schemes like it, ones that are probably slightly lower performance but hit all the corner conditions for IEEE 64 and at some point, you just hide it in the microcode and nobody knows anymore.”
None of this was on the table when TACC was doing its deal with Nvidia five years ago for those Blackwell GPUs. So if this happens, it will be another bonus that can help advance science, like locking in the price of the GPUs with Nvidia early.
The upshot of that price lock from Nvidia is not to be underestimated. It only takes about a third of the $457 million budget for the Horizon system to cover the cost of its compute and networking, which comes from Nvidia. (The 400 PB of flash storage from VAST Data that will link to Horizon was bid on separately, according to Stanzione.) Despite this being a fairly modest machine at around 300 petaflops aggregate at FP64 precision (and lower than the 400 petaflops we heard about in the early planning cycle for Horizon), $160 million or so for this machine ain’t a bad price at all. That is 2.7X as much money as was spent on Frontera for 7.1X more FP64 oomph.
The gap is even larger between Frontera and Horizon if you isolate to the GPU engines. Frontera had 3.5 petaflops at FP64 on its 448 “Volta” V100 GPUs, but Horizon will weigh in at 4,032 Blackwell GPUs that deliver 161.3 petaflops at FP64. That is a factor of 46.1X more FP64 throughput on the vector cores in the GPUs. And on the tensor cores, there is 20.2 exaflops at FP16, and you double the calculation throughput every time you cut the precision in half all the way down to 80.6 exaflops at FP4.
We await the FP64 emulation data, which might show the same machine getting 320 petaflops to 484 petaflops of emulated FP64 performance – and perhaps even more with tuning and less with higher precision fidelity.
This year won’t be boring in HPC. Mixed precision might just go mainstream as an emulator for high precision floating point.