It has taken three decades for HPC to move to the cloud, and the truth is that a lot of simulation and modeling applications are still coded to run on CPUs. The good news is that the major clouds and some of the smaller ones have tuned up their CPU estates with fast networking and proper HPC software stacks so they can be of good use for HPC centers that, for whatever reason or another, desire to rent rather than to buy some compute capacity.
This week, Amazon Web Services trotted out the latest of its HPC-tuned virtual servers, the HPC8a instances, which are based on a customized versions of AMD’s “Turin” Epyc 9005 series processors. The Turin CPUs.
The Turin CPUs were launched in October 2024, and come in two flavors: One set based on the regular Zen 5 cores that span from 8 to 128 cores, and from 16 to 256 threads, per socket and another set based on the L3 cache-halved Zen 5c cores that span from 96 to 192 cores and 192 to 384 threads. We think the new HPC8a instances – there is really only one at the moment, but that could change – are based on a custom Epyc 9R15 processor, much as the prior HPC6a and HPC7a instances were based on custom “Milan” Epyc 7R13 from March 2021 and “Genoa” Epyc 9R14 processors from November 2022. The Epyc 9R15 chip seems to be based on a the Epyc 9655 processor, which has the same base clock speed of 2.6 GHz and the same turbo clock speed of 4.5 GHz that the Epyc 9R15 is said to have.
We know that these HPC instances, with the exception of those based on the Graviton3E, are two socket machines because we know that all of the HPC instances have simultaneous multithreading turned off. On HPC workloads using the Message Passing Interface (MPI) protocol to share work across loosely couple memory pools composed of the HPC cluster nodes, the MPI communications is very sensitive to cache latencies, which in turn are adversely affected by SMT. When you turn SMT off, you are only using one thread per core and the cache behaves more predictably and performance doesn’t degrade.
So, the new hpc8a-96xlarge is not a single 96-core processor, but rather a pair of them that deliver 192 physical cores that are in turn virtualized into 192 vCPUs.
The important thing about the HPC8a is that by moving to the Turin design, there are a dozen memory controllers on each Turin processor across the 96 cores (and therefore 96 vCPUs) on the Epyc 9R15 socket instead of only eight memory controllers with the Genoa chips used in the HPC7a instances.
The combination of the higher memory controller count and the move to faster DDR5 memory means that on workloads that are constrained by memory bandwidth, the Turin instances will do up to 40 percent more work than the Genoa instances with the same vCPU count even though. This is despite the fact that the peak theoretical performance at the base clock speed of 2.6 GHz of the Epyc 9R15 for FP64 floating point precision used in the HPC8a instance is almost exactly the same as the peak FP64 oomph of the Epyc 9R14 used in the HPC7a instances.
You can see this in the AWS HPC instance table we have created below:
If you do the math, then, which is what HPC centers do for a living, then almost all of the performance gain and nearly all of the 25 percent price/performance gain that AWS cites in its HPC8a announcement when compared to the top-end HPC7a instance comes from that extra memory bandwidth from more channels and faster DDR5. Clearly, from the table, you see the trouble with using peak theoretical performance alone to gauge real-world performance and therefore price/performance. If you based your buying decisions on this table alone, you would go with the Genoa instance, not the Turin instance.
Both the HPC8a and HP7a instances make use of the 300 Gb/sec EFA-2 Ethernet adapters on the HPC8a instances, so there is no network advantage driving performance across the latest Turin instances. We are surprised that AWS has not yet delivered a 400 Gb/sec or better still 800 Gb/sec EFA-3 adapter for its formal HPC instances, to be honest, which might contribute to cluster throughput in a big way for Genoa and Turin instances used at scale.
You will notice a few things odd about the HPC instances that AWS has put together. The HPC6 instances came in two flavors: One based on a “Ice Lake” Xeon SP v3 processor from Intel and one based on the custom Milan Epyc 7R13 processor from AMD. The name of the instance tells you the core count and the vCPU number tells you the thread count. There is one instance per type, and it has the maximum core and thread count.
With the HPC7g instances, which are based on the HPC-tuned Graviton3E variant of the AWS-designed Graviton3 processor and which were launched in December 2022, and with the HPC7a instances based on the custom Genoa Epyc 9R14 processors, which launched in August 2023, AWS did something different. Rather than just have a single instance with maximum cores and a reasonably large memory, AWS made the memory capacity static but allowed for the core counts to be reduced so that customers could pick a ratio of memory capacity and memory bandwidth to cores as they often do when they configure server nodes in their on-premises CPU clusters.
So, with the Graviton3E-based HPC7g instances, customers could configure cores against that 128 GB of main with 2 GB, 4 GB, or 8 GB of memory across each of those cores. (As the core counts shrank, the amount of memory bandwidth per core grew in the same proportion.) Interestingly, AWS didn’t change the cost of on demand instances based on the core count or these other metrics. The price was the same at $1.68 per hour.
Ditto for the HPC7a instances based on Genoa. By reducing the core count by selecting a smaller instance slice (but with a fixed memory capacity no matter what the core count is), customers could configure 4 GB, 8 GB, 16 GB, or 32 GB of main memory per core and, again, a proportionally smaller amount of memory bandwidth per core as the core counts go up.
With the HPC8a instance just announced, AWS is back to just selling the full, fat configuration, in this case with 96 cores against 768 GB of main memory, or the 4 GB per core that is often used with nodes in HPC clustered systems.
One other thing you will also note: The Elastic Block Storage (EBS) bandwidth and throughput is the same across all of these instances, and it is not particularly high, at 87 Mb/sec and 500 I/O operations per second (IOPS). There is a special turbo mode for EBS that allows HPC customers to run at 2,085 Mb/sec bandwidth and drive 11,000 IOPS for a 30 minute period within every 24 hours. It is not clear if this is being used for snapshotting of state for the virtual HPC clusters, which do not have local storage with the exception of the one HPC6id instance launched many years ago.
Clearly, AWS does not want customers to set up parallel file systems atop its EBS block storage, although there is probably not a technical reason it could be done. And clearly, AWS very much wants HPC shops running their ModSim in the cloud to use is FSx for Lustre, a fully managed implementation of the open source Lustre parallel file system. That said, there is nothing that will stop you from setting up a VAST Data or WekaIO cluster on AWS iron and link that to an HPC instance cluster if you want to.