It is very rare indeed to get benchmark data on HPC applications that shows it scaling over a representative number of nodes, and it is never possible to get cost allocations presented that allow for price/performance comparisons to be made for clusters of different physical sizes and the increase in throughput that more scale brings.
But the public cloud gives us an opportunity to test scale and count the cost, and that is precisely what some benchmarks tests done by Microsoft on its new HC series all-CPU instances on the Azure public cloud, has allowed us to do. The results are interesting and illustrate that customers need to test scale and do the math on what that scale costs before determining the optimal configuration that balances performance against price. And this holds true whether you are running applications on premises or on the public cloud.
In fact, we think there is a good case to be made to do performance tests on the cloud even if you are buying on premises HPC clusters so you have some sort of idea how the applications scale – something that is not possible to do before buying an on premises cluster. Vendors don’t generally let you install iron and then send some of it back, although they are always happy to get an order for more nodes to build out a cluster. The public cloud is a relatively affordable way to spend thousands of dollars to test the scale on hardware that might cost tens of millions of dollars to buy.
The HC series instances were announced and put out for beta testing last September. They are based on Intel’s 24 core “Skylake” Xeon SP-8168 Platinum processor, which run sat 2.7 GHz and which can turbo up to 3.7 GHz. The HC instances are an HPC and AI companion to the HB instances, which are based on the 32 core “Naples” Epyc 7551 processor from AMD. Both machines have two sockets, and both are populated with enough memory to be useful and can be configured with either 40 Gb/sec Ethernet or 100 Gb/sec InfiniBand interconnects. At the same time last fall, Microsoft also debuted the NV and ND instances, which are also two-socket Skylake servers but augmented by a “Maxwell” Tesla M60 or “Volta” Tesla V100 GPU accelerators. But for the purposes of this story, Microsoft focused on the Skylake-based, CPU-only HC configuration, which exposes 44 cores to HPC workloads (four of the cores are used up running the hypervisor and virtual I/O in the instances).
To test the HPC mettle of its virtual metal, Microsoft loaded up the HC Azure instances of various sizes with two test cases of the CP2K molecular dynamics application and pushed the scale up on each scenario. CP2K is one of thirteen applications that the European Union’s PRACE consortium uses to benchmark and acquire HPC systems, and it has been tuned to take advantage of the AVX-512 vector math units in the Skylake processors to boost the floating point performance of the application. The HC virtual machines on Azure expose 44 or the 48 cores in the system to applications, and HyperThreading is turned off because, as we know, in many cases the doubling up of threads with virtual threads sometimes leads to lower rather than higher performance on HPC workloads. Each pair the Skylake chips offers 3.5 teraflops of double precision floating point performance and 190 GB/sec of memory bandwidth. Each node also exposes 352 GB of main memory and has a 700 GB flash drive plus access to up to four additional disk or flash drives. It is not clear how much storage was used in the benchmark; we presume just the base flash drive.
The CP2K tests run by Microsoft used the ConnectX-5 network interface card, which supports 100 Gb/sec Ethernet and InfiniBand links and carved up the bandwidth using SR-IOV network virtualization and the same OFED driver stack that InfiniBand shops employ. Evan Burness, principal program manager for HPC on Azure, said that the HC virtual machines deliver MPI latencies of around 1.7 milliseconds, which also he said was in line with InfiniBand networking on bare metal servers, and was also anywhere from 8X to 16X lower MPI latency than Ethernet-based HPC setups that are available on other public clouds (and indeed, the portions of the Azure cloud using 40 Gb/sec Ethernet without RDMA no doubt fall into this category).
So that is the iron. Now, let’s get into the tests. On the first CP2K test, the simulation is a single-point energy calculation using linear scaling density functional theory (DFT-LS), and in this case on 2,048 water molecules. Microsoft scaled up the HC nodes, counted the time it would take to do each simulation – what is referred to as a case – and then figured out what the throughput of each setup would be in cases per day on each cluster size. Here is a graphical representation of the interplay of time to solution for a single case and the number of cases that could be batched up and run in a single day as the clusters scale:
That chart does not have all of the relevant data Microsoft gathered as part of the test, including the number of MPI ranks per node and the number of CPU threads per rank, which were determined by the MPI stack (we presume) and not hand tuned. It seems weird to us that the threads per rank (or cores per rank effectively with HyperThreading turned off) were not constant, but there you have it.
We stared at this data for a little while, blinked a few times, and then realized that cost scales perfectly linearly with the addition of HC instances added to the cluster, but performance does not scale precisely linearly. Because time is money – Einstein went back in time and proved that in conjunction with Adam Smith – public cloud vendors price by the hour and then charge by the seconds. This distinction is important to remember, because when the public cloud first started, everyone priced and charged by the hour. But given this variability of scaling of performance (for a lot of different reasons) it would be more fair to charge per unit of work rather than per unit of capacity over time. This would be impossible to turn into a business, perhaps.
In any event, we sat down and added the cost of running each configuration for a day – so the distinction between secondly pricing, minutely pricing, or hourly pricing doesn’t matter – to get the full CP2K case load done in that day, and here is what that full data set looks like:
And this is where it gets really interesting, then we charted it out so we could see throughput for each cluster size on the CP2K H2O DFT-LS application and total cost for running that cluster for a day to get that throughout. Take a look:
As you can see, the performance scaled up linearly from 8 to 192 nodes, but there is a big jump in performance going up to 256 nodes, and a corresponding decrease in the cost per case run, which is what you really care about with any HPC application. After that, as the node count goes up, the costs go up a little faster and the cost per case. The point is. The cost doesn’t just track along with the scale, and you have to think about that for any HPC cluster, whether it is on the cloud or in the datacenter. The lowest cost per case comes in at $5.18 a pop with a 256 node cluster on the CP2K H2O DFT-LS application. While $3.168 per hour sounds cheap, for a server of such beastliness, to run a cluster of them for a day at the largest scale Microsoft tested with this scenario – 392 nodes with 17,248 cores – incurred fees of $29,804.54 for the day. That still ain’t cheap, and if you did that every day, that would be $10.9 million and three or four or five times that to talk about the typical life of a cluster, and now you are up to around $44 million. That’s real money and considerably more than what a real, on premises cluster of that scale costs. (Our guess is 3X to 7X, depending on how long you keep the iron for.)
Now, before you just go out and buy a 256-node cluster to run CP2K based on one test, check out the results from the LiHFX benchmark, which is a single point energy calculation to simulate a lithium hydride crystal with 216 atoms and 432 electronics. Microsoft pushed the scale a bit here, up to 512 nodes, and what you can see is that it isn’t worth pushing it that hard for this particular LiHFX part of the CP2K test.
Well, actually you can’t see that because all you are looking at is performance data in the Microsoft chart, and anyone looking at this would probably always want the extra performance that comes with scale. In this case, going from 256 nodes to 512 nodes adds 50 percent to the performance. But, it costs 100 percent more money to get that 50 percent if you do the money math, as we did in this table:
As it turns out, the cheapest cluster — meaning the lowest cost per unit of work — on which to run the CP2K LiHFX scenario is one with 44 nodes, at $15,28 per case. The performance wiggles around a bit as the node counts, ranks per node, and threads per rank all change, and we strongly suspect that further tuning could be done to balance out the scale of CP2k and indeed any HPC application running on clusters. Here is the same data in chart format:
If it were me, I would be picking the 256 node cluster, which is a bit more expensive per case but below the knee of that curve where it starts to rise fast again. If you need to do more work, then you have to pay with increasing inefficiency, and that is just the way all systems work after a certain point.
So maybe a 256 node cluster was the answer after all. At least for the HC nodes running these two workloads. And maybe once the “Rome” chips from AMD are in the HB series, those will do even better in terms of throughput and bang for the buck. But the point is, do the math to figure that out. And thanks for Microsoft for giving us data that helps us play with the numbers and learn.