Public Cloud Giants Fight For HPC Supremacy

Microsoft Azure has been able to put actual Cray XC series supercomputers and CS Storm clusters in the public cloud for more than two years now, and it is unclear how many companies have commissioned Cray, now part of Hewlett Packard Enterprise, to do so. It is far more likely that customers who want to run HPC and AI workloads on the cloud ­– particularly GPU accelerated ones – will instead cluster up existing instances to create a virtual supercomputer.

But Microsoft, which is very eager to build up an HPC business on Azure, is splitting the difference by putting a HPC instances on Azure that build up a cluster that looks and feels like a cluster that customers might deploy in their own datacenters.

The new Azure instance, which is in tech preview this week as it is being announced at the SC19 supercomputer conference in Denver, is actually a single node of a 100-node cluster that Microsoft is putting in selected regions. (It is not clear which ones as yet, but we have asked.) That HPC instance, called the NDv2, is obviously just as suited to running machine learning training workloads as well as GPU-accelerated workloads for those who want to do that. The NDv2 is based on a single HGX tray of Tesla “Volta” V100 GPU accelerators all lashed together to share data across NVLink. (Think of it as half of the GPU complex in the DGX-2 system from Nvidia with the NVSwitch pulled out and direct NVLinks across those GPUs so they can address each other’s 32 GB HBM2 memory chunks.) This GPU compute complex is linked to a host CPU system that is based on a pair of 20-core “Skylake” Xeon SP-6168 Platinum processors, which run at 2.7 GHz and which are housed in Microsoft’s homegrown “Project Olympus” systems. The server nodes have 672 GB of main memory, which suggests there is a hypervisor in there somewhere burning up some memory; 96 GB is our guess, and the machine is loaded up across its 24 memory slots with 64 GB memory sticks.

Each of the NDv2 nodes has a single 100 Gb/sec ConnectX-5 network interface card out to a 100 Gb/sec EDR InfiniBand interconnect, both obviously from Mellanox Technology, which Nvidia is in the process of acquiring and which has sold both InfiniBand and Ethernet switching into Microsoft’s Azure public cloud over the years. It is not clear what topology Microsoft is using to link the NDv2 instances to each other, but we would guess that it is a fat tree topology as is commonly used in HPC and AI workloads and not the Clos topology commonly used by hyperscalers and cloud builders. Ian Buck, vice president and general manager of accelerated computing at Nvidia, tells The Next Platform that the machines in the NDv2 will be sold in blocks of eight servers, for a total of 64 GPUs, and that implies that as customers scale up their NDv2 clusters, they are buying adjacent branches of a fat tree. Knowing this, you would think that the top-end pod of NDv2s would be 96 nodes with 768 GPUs, but we are told that it is actually 100 nodes with a total 800 GPUs. Go figure. Our money says it is actually the former and someone in the communication tree rounded up the data.

The systems are obviously set up running some variant of Linux (it looks like CentOS or Ubuntu Server are the defaults, but Red Hat Enterprise Linux is an option as is SUSE Linux Enterprise Server), and the full Nvidia software stack is available either through the Nvidia NGC cloud or the Azure Marketplace. Microsoft says that the Mellanox OFED network drivers are installed (as if there was any other option) and all MPI types and versions are supported. Obviously there is a hypervisor in there somewhere, and presumably it is Hyper-V, which Microsoft uses to carve up the Azure cloud. There is no indication what performance penalty, if any, is there when Hyper-V is running. We are surprised that this is not a bare metal instance, to be honest.

Microsoft has not officially revealed pricing as yet, but we have heard on the street that it will be $26.44 per NDv2 instance per hour. This has got to be setting some kind of record, but look at all of that GPU performance and memory bandwidth that is being brought to bear. And the cost of that InfiniBand network has to be paid for even if customers do not make full use of it.

The interesting thing is that we can cost this thing out. Without any data storage services, running a 96-node cluster full out for three years would cost $66.75 million, and the machine would have 5.76 petaflops of aggregate peak double precision performance. A DGX-1V, which has eight Tesla V100s and two Xeon processors and is roughly analogous to the node that Microsoft has put together for the NDv2 instance, costs $119,000 at current pricing (down from $169,000 more than two years ago when it was launched). So 96 of these would cost $11.4 million, and that includes a fair amount of local flash storage and four times the network bandwidth coming out of the box. That number does not include power, cooling, real estate, systems management, or InfiniBand switching and cabling costs, but if you work it backwards and amortized it over four years, then the hardware alone for the same 5.76 petaflops of performance and it works out to $4.53 per hour for the DGX-1 node that is roughly similar. It is up to everyone to look at their own costs to build a 96-node cluster and see how it compares, fully burdened. Or what ODM and OEM equivalents would cost, which are even less expensive than what Nvidia charges. Microsoft just set the ceiling for HPC on the cloud.

The other thing to consider here is utilization. Let’s say for the sake of argument that an in-house DGX-1 cluster cost $10 an hour fully burdened for just the compute and networking without a local flash storage array from Pure Storage or DataDirect Networks or homegrown using Excelero, Vast Data, or Lightbits Labs software defined storage. If you your own hybrid CPU-GPU cluster and you only utilize it 50 percent of the time, you are really then paying $20 an hour to own that cluster. So the gap between cloud and on premises closes pretty fast. But you can also get the cost down by using an ODM or OEM machine – Inspur, Supermicro, Dell, and Hewlett Packard Enterprise will get you something that looks like the NDv2 node for a lot less that what Nvidia is charging. Probably something on the order of 40 percent less. So that cuts the overall cost back a little, but maybe not as much as you probably hoped. If you drive up the utilization, the cost per hour on premises comes down, too. What is clear here is that utilization is the deciding factor, and utilization patterns should probably drive your choice of what capacity to deploy on premises and what to deploy on the cloud.

Or, just say the hell with managing any of this and move it all to the cloud. More than a few HPC and AI practitioners will do that because they will never operate at huge scale.

In addition to the NDv2 instances, Microsoft is also previewing its HBv2 virtual machines based on the 64-core “Rome” Epyc 7742 processors, with 60 of those cores exposed above the Hyper-V hypervisor. The underlying node has two of these processors; the cores have a base speed of 2.25 GHz and boost as high as 3.4 GHz. Microsoft says that a two-socket HBv2 node delivers 4 teraflops of aggregate peak floating point performance at double precision (twice that at single precision, obviously), and moreover, the networks it has set up can span to over 80,000 cores using MPI to deliver 5.36 petaflops of peak capacity within Azure regions. By the way, these nodes are interlinked using 200 Gb/sec HDR InfiniBand from Mellanox, which is the first use of HDR InfiniBand on the pubic cloud. The HBv2 instance has 480 GB of main memory available for applications and delivers 350 GB/sec of memory bandwidth across the two sockets. It costs $3.96 per hour. At the full MPI scalability limit that Microsoft is offering with its HDR network on the HBv2 instances (we think it is 672 nodes), it only cost $2,661 per hour to rent that 5.36 petaflops cloud using on demand instances; reserved instances are not yet available, and this will drop the price quite a bit.

Microsoft wants to have the four workhorses of the data apocalypse on Azure, just like Intel wants to own and that AMD is doing through partnerships as well as its own chips. (That’s CPUs, GPUs, FPGAs, and NNPs.) To that end, Microsoft is previewing its NDv3 instances, which will have the same basic Olympus server node with the pair of Skylake Xeon SP-8168 Platinum processors with 768 GB of memory and eight Graphcore accelerators, each with a pair of IPU chips and delivering over 1,216 IPU tiles with 7,296 threads and 300 MB of in-processor memory and a stunning 45 TB/sec of memory bandwidth. The sixteen banks of IPU core banks on the Graphcore processors are connected through a proprietary IPU-Exchange crossbar with 8 TB/sec of aggregate bandwidth, and up to eight of the Graphcore chips are glued together in the NDv3 instances using the proprietary IPU-Links interconnect. (This is roughly analogous to NVLink with GPUs. The Graphcore chips hook into the CPU complex with a PCI-Express 4.0 x16 slot. These instances are equipped with Graphcore’s Poplar software development kit.

Additionally, Microsoft is promising that it will deliver a NP series instance on Azure that will expose from one to four of the Alveo U250 FPGA accelerators from Xilinx. This will be hosted by the same basic server instance as the other instances mentioned above, and will come with the SDAccel 2019.1 runtime environment from Xilinx preloaded on them.

Pricing on the Graphcore NDv3 series and the Xilinx U250 NP series instances were not yet released by Microsoft.

Over at AWS, which already sells its F1 FPGA instances and has not divulged any plans for any kind of NNP instances, the angle for SC19 is to talk about its new C5a and C5ad instances, which will be available in bare metal form with 192 virtual CPUs (vCPUs, which are threads across the cores activated) and 384 GB of memory. The C5a uses network storage and the C5ad has 7.6 TB of local NVM-Express flash storage. For virtualized instances, the ratio of CPU compute will be chopped up into eight different sized slices, and the Nitro SmartNIC will handle the vast majority of KVM hypervisor functions as well as network, storage, and accelerator virtualization, freeing up those Skylake cores to do real host work. In bare metal mode, the C5a and C5ad Rome Epyc instances will have a 100 Gb/sec Ethernet interfaces out to the network, and the Elastic fabric Adapter will scale this up and down with the CPU compute. Pricing of these Rome CPU instances on AWS were not revealed.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.

Subscribe now

1 Comment

  1. “It is far more likely that customers who want to run HPC and AI workloads on the cloud ­– particularly GPU accelerated ones – will instead cluster up existing instances to create a virtual supercomputer.”

    Ignoring for the moment GPUs and other accelerators, HPC involves raw CPU power, frequently with as many as possible cores sharing the same RAM in a single server.

    An ordinary cloud server vCPU is not even a core. It is a half an Intel core, split using hyperthreading, with the core switching back and forth between its two register sets, each for a different thread. When one thread has a cache miss, the other one can usually be run. The performance of each such thread is around 40 to 45% of the performance of the core concentrating solely on one thread. It doesn’t help to use only a number of vCPUs in the set which has been hired, since the hypervisor lets other customers’ threads run, including on the other half of the same cores you are running.

    This is one reason why ordinary cloud servers are far too slow at the best of times. Another is that the cores are invariably one of 18 or more in the same dual Xeon server, and the biggest bottleneck is access to main memory. Also, these cores in this large core count CPUs are not clocked very fast due to the difficulty cooling them.

    A “dedicated” cloud server vCPU might give you access to a thread where the other thread of the core is inactive. Still, you are fighting other customers’ vCPUs for access to main memory and to a certain extent shared cache. It may not be possible to rent the full set of vCPUs in a physical server, so you will always be competing with other customers’ loads. With both such types of cloud servers (vCPUs available by the second or hour, with no setup fees) I have observed significant drops in performance when most people in that part of the world are awake and using web sites, since most other customers’s vCPUs are running web sites or some local business related workload.

    The next step is on-demand, bare-metal servers – which specialise in. These are pricey but you get a full multicore machine, perhaps with GPU, on demand for hourly rates. For a job lasting a few hours or days, where you can get them going quickly, and where gigabit ethernet is OK, these would be a good choice.

    However, for jobs taking weeks or months, if gigabit ethernet is OK, and it is not absolutely necessary to have a massive number of cores, Hetzner has 6 core i7-8700 machines in Helsinki at very low monthly rates, with a ~1 month setup fee. These cores are faster than almost any in modern Xeon machines and they share a 2 DIMM wide main memory path. The cores have about 4 times the throughput of an average cloud server’s VM. They have 8 core machines too, but the main memory bandwidth is the same.

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.