Microsoft is intent on bending the supercomputing set its direction. Instead of just focusing on competing with other public clouds, they’re aiming directly at on-prem HPC, showing comparable or better performance to existing top 10 supercomputers, for example.
“For years, HPC investments have been expensive up-front based on predictions– and then living with the consequences of those predictions,” says Microsoft Azure CTO, Mark Russinovich. “But predicting the future is hard. Looking ahead, in a timeframe that a customer would buy and operate a cluster on-prem, the same level of investment in Azure would yield significantly higher performance and performance per dollar.”
With noteworthy wins, including the UK Met Office, which has decided to forgo traditional supercomputers for the Azure cloud and new configurations designed to replicate big HPC capabilities, they might start swaying some users away from on-prem and, of course, from their main challenger, AWS, which was first and most ambitious in trying to sway supercomputing sites from their own clusters to the cloud with big name use cases in commercial and research HPC stretching back well over a decade.
In those early days, when AWS was capturing cloud customers for HPC (then a tough sell due to a lack of network capability ripe for MPI, lack of ISV cloud readiness, etc.) Microsoft was busy trying to keep Windows Server alive in HPC. But what a difference a decade makes. With big wins and apparent ambition, Microsoft is showing it understands the needs of HPC in 2021—and perhaps just at the right time. Cloud adoption in HPC has never exploded per se, but Hyperion Research and others are showing cloud is making strong gains.
In an effort to bolster HPC capability, Microsoft has announced serious cloud-based supercomputing capabilities with third-generation AMD Epyc 7003 processors, which provide over 2.5X higher virtual machine performance for lightly threaded workloads and almost double higher performance for large-scale MPI workloads than previous generation Epyc processors available through Azure.
The company is focused on MPI workload performance in particular. Evan Burness, Principal Program Manager for Azure HPC, says in a test scaling from 4.000 cores to over 33,000 the new VM showed a 1.2-1.8X performance jump over their HBv2 series VMs (based on AMD Epyc 7002 series).
HBv3 VMs feature up to 120 AMD Epyc 7003 series CPU cores, 448 GB of RAM, and no simultaneous multi-threading. HBv3-series VMs also provide up to 340 GB/sec of memory bandwidth, up to 32 MB of L3 cache per core, up to 7 GB/s of block device SSD performance, and clock frequencies up to 3.675 GHz. All HBv3-series VMs feature NVIDIA Mellanox HDR 200 Gb/s InfiniBand to enable 80,000 core MPI workloads.
There is some flexibility when it comes to cost/workload-optimizing use of HPv3 VMs. RAM, L3 cache, memory bandwidth, Infiniband and local SSD are all constant but it’s now possible to dial core count and clock frequency. Users can pick a lower number of cores to expose to the VM while keeping all other assets constant.
“Doing so increases how those assets are allocated on a per-core basis. In HPC common scenarios for which this is useful include providing more memory bandwidth per CPU core for CFD workloads, allocating more L3 cache per core for RTL simulation workloads, driving higher CPU frequencies to fewer cores in license-bound scenarios, or giving more memory or local SSD to each core,” Burness explains.
Microsoft is seeing opportunity for large-scale users outside of the traditional HPC areas. Russinovich says that in addition to CFD, weather forecasting, geoscience simulations, they’re also targeting physics-based machine learning and expanding workloads including financial risk analysis, RTL modeling for silicon design, and structural mechanics for product design and biological research.
“Our HPC customers are clear. They need to solve HPC-driven research and business problems at signficiantly higher velocity and resolution,” says Russinovich. “They also have a diverse set of workloads and need highly optimized solutions for each and they don’t want to be blocked into a single HPC hardware platform to capture efficiency and performance for years. They want a cloud partner who continuously innovates and offers new capabilities to meet their ongoing needs.
The new virtual machine type, called BMv3, are available now in the U.S. (East, South Central US) and Western European Azure regions with APAC regions coming soon.
This is the first major public cloud to optimize for Epyc 7300 series processors, not to mention for HPC.