The Shape Of AMD HPC And AI Iron To Come
August 8, 2017 Timothy Prickett Morgan
In the IT business, just like any other business, you have to try to sell what is on the truck, not what is planned to be coming out of the factories in the coming months and years. AMD has put a very good X86 server processor into the market for the first time in nine years, and it also has a matching GPU that gives its OEM and ODM partners a credible alternative for HPC and AI workload to the combination of Intel Xeons and Nvidia Teslas that dominate hybrid computing these days.
There are some pretty important caveats to this statement, of course. While everyone is excited to see competition returning to the CPU and GPU compute arena, the “Naples” Epyc CPUs and “Vega” Radeon Instinct GPUs are new and it is not precisely clear how they will perform on actual applications. But the basic compute, memory, and I/O of the hardware that AMD is delivering makes it a contended in its own right, particularly with the help of high bandwidth, low latency InfiniBand networking from Mellanox Technologies.
AMD has been winding up its one-two punch in compute for years, and fired off the Epyc X86 server processors concurrent with the International Supercomputing 2017 event in Frankfurt in June, but not actually at ISC17 but in its own event in its Austin development labs. The Vega 10 GPUs, the first in a line of chips that it is pitting against Nvidia’s “Pascal” and “Volta” motors, were previewed in the Radeon Insight Frontier Edition shortly thereafter and formally launched at the SIGGRAPH computer graphics show a week ago along with the commercial Radeon RX graphics cards using the same Vega 10 chips.
While the initial Vega 10 GPUs used in the Radeon Instinct GPU accelerators are going to raise a lot of eyebrows and definitely get some market traction in areas where lower-precision math is preferred, they lack the double precision math that many HPC and some AI workloads require. With subsequent Vega processors, AMD is expected to boost the performance of the Vega GPUs on double precision math, starting with the Vega 20 in 2018 if the rumors are right. Nvidia had similar issues with its “Maxwell” GPUs two generations ago, which had excellent single precision performance but which were never delivered in a version that had high double precision capabilities.
Still, AMD is able to put together a hybrid CPU-GPU compute complex with considerable capacity and technical capability – finally! – and is showing off a bit with ODM supplier Inventec and its AMAX division, which sells gear to end users as opposed to contract manufacturing. The Project 47 demonstration cluster that AMD built in conjunction Inventec/AMAX is a showcase of sorts for AMD’s aspirations in the HPC and AI markets and for its contention that a lot of two-socket Intel Xeon servers fitting with Nvidia GPUs would be better replaced with less expensive and more balanced single-socket Epyc servers with Radeon GPUs.
It has been a long time since a petaflops of compute capacity seemed like a lot for a system, but it is still a lot of number-crunch capacity to put into a single rack. And that is precisely what the Project 47 Epyc-Radeon cluster does. Now, to be sure, when most people talk about HPC workloads, they gauge it in double-precision floating point math, so you have to be careful comparing the Project 47 cluster to any of the machines on the Top 500 rankings of supercomputers, the last of which came out as the Naples Epyc chips were being launched. Still, a petaflops of 32-bit math in a rack is still a lot of compute.
Before we get into the feeds and speeds of the cluster, which will be available in the fourth quarter from AMAX, we need to unveil the specs of the Vega 10 GPU and the Radeon Instinct cards that use it. Some of the characteristics of these cards were unveiled back in July, but now they are out and we can make better comparisons to the Intel CPU and Nvidia GPU components.
The Vega GPU has 4,096 stream processors organized into 64 compute units, and delivers 484 GB/sec of bandwidth into and out of the double-stacked HBM2 memory packed on the interposer for the GPU. (Samsung is the supplier of the HBM2 memory subsystem.) The Vega GPUs used in the top-end Radeon Instinct MI25 card have 16 GB of HBM2 memory on the interposer, and also have dedicated NVM-Express ports that will allow for 2 TB of flash capacity to be directly attached to the GPU accelerator to extend its memory as a kind of fast cache for that GPU memory. That HBM2 memory is four times as much as was crammed onto the prior Radeon Fury X GPU accelerators based on the “Fiji” GPUs and is twice as much as the Radeon RX graphics cards that were announced at SIGGRAPH for gamers and workstation users. Just for fun, and for comparison, here are the specs on these Radeon RX cards, which we think will find use in HPC and AI as well as in gaming because some organizations need flops more than they need memory capacity – or at least they think they do.
The Radeon Instinct MI25 cards using the Vega GPU have support for half-precision FP16 floating point math, and deliver 24.6 teraflops of FP16 and 12.3 teraflops at FP32; double precision FP64 math units crank through 1/16th level of FP32 oomph with the initial Vega 10 chip, which works out to a mere 768 gigaflops and that is not much performance at all these days. (The future Vega 20 is expected to do a proper 1/2 ratio for FP64 compared to FP32, and that will make it a very zippy device.) The clock speeds for the Radeon Instinct MI25 or Frontier Edition, which runs at a slightly peppier 26.2 teraflops at FP16 and 13.1 teraflops at FP32, were not divulged, but we can guess them based on the Radeon RX table above. We estimate the peak clocks on the MI25 to be 1,502 MHz and on the water-cooled Frontier Edition to be 1,604 MHz. The MI25 and Frontier Edition Radeon Instinct cards both eat two PCI-Express slots in the system and consume 300 watts of juice for air-cooling; the Frontier Edition has a 375 watt water-cooled version that presumably can run more overclocked and deliver more flops. There is not a water-cooled variant of the M125, but there should be.
The HBM2 memory in the MI25 card runs at 945 MHz and has a 2,048-bit interface to deliver that 484 GB/sec of bandwidth. This is lower than the expected 512 GB/sec the card was supposed to deliver, but all vendors using HBM2 memory – including Nvidia – have been surprised that the speed and bandwidth is not as high as it was planned to be. The MI25 has Error-Correcting Code (ECC) on the memory, as does the Frontier Edition, and both have the NVM-Express ports for directly attaching flash memory and maybe someday Micron’s 3D XPoint or some other persistent memory. These features – bigger HBM2 memory, ECC scrubbing, and direct-attached flash – are what make them enterprise products. The Radeon RX cards do not have these features, which is why they are going to be less expensive. And that lower price is why some organizations are going to use them for enterprise workloads even if they don’t have all the bells and whistles.
For the CPU side of the Project 47 cluster, AMD and Inventec chose the top bin Epyc 7601 processor, which has 32 cores running at 2.2 GHz. (You can see the feeds and speed of the Epyc 7000 series chips here.) The Project 47 nodes were configured with only one processor, and instead of needing PCI-Express switches or NVLink interconnect to fan out to the Radeon Instinct GPUs, these just hooked into the system using PCI-Express 3.0 ports. The single-socket Epyc server has 128 PCI-Express lanes, and each Radeon Instinct MI25 card eats two ports of 16 lanes each, so there is no way to link more than four GPUs to the system because that is 128 lanes. The server nodes were configured with 512 GB of DDR4 memory from Samsung, running at 2.67 GHz, and because there were no spare PCI-Express ports, we presume that the flash SSD drives used in the machines – the capacities were not given – were linked to the GPUs, not the CPUs and that the storage for the operating system and system software hung off of SATA ports on the Epyc server motherboards. This is what we would do. Let’s assume that each Project 47 node had 2 TB of flash on each Radeon Instinct card, because this is what is possible according to AMD.
Across those 20 nodes, then, the machine has a total of 640 “Zen” cores humming along at 2.2 GHz. With four Radeon Instinct cards per server, that is 80 cards for a total of 5,120 stream processors with a total of 327,680 Vega cores. Add all of that GPU compute capacity up, and at 12.3 teraflops at single precision, you actually get 984 teraflops peak, not 1 petaflops, but the Zen cores can do some floating point work, too. The Zen core has two floating point units that are 128 bits wide each and that create four pipes that can do two fused adds and two fused multiplies per cycle at 64-bit double precision. So call it eight floating point operations per clock at single precision peak, and at the 2.7 GHz turbo boost speed, that is 21.6 gigaflops per core; across 640 cores in the Project 47 cluster that works out to 13.8 teraflops peak at FP32. So the cluster, in theory, can do 997.8 teraflops. That is pretty damned close to a petaflops, so we will spot Lisa Su, AMD’s CEO during the company’s transformation, that one on the 1 petaflops claim.
That said, the performance on double precision is pretty thin: About 68.4 teraflops. So this is not going to be a very good machine for FP64 work. But it was not designed for that work, and it is very good at what AMD intended it to be good at, which is for remote virtual workstations, render farms, HPC simulations where single precision works (life sciences, seismic analysis, and signal processing are key workloads here), and half precision machine learning training. There is also the possibility of doing 8-bit operations in the new Vega cores, which may prove useful in machine learning and other workloads.
All of the nodes were equipped with 100 Gb/sec ConnectX-5 InfiniBand adapters, and the nodes were linked to one another in a cluster using a 100 Gb/sec Switch-IB InfiniBand switch, both of which come from Mellanox Technologies.
Add it all up and the Project 47 system delivers 30 gigaflops per watt of power efficiency, according to Su, and if you work that backwards, that means the Epyc-Radeon Instinct rack built by Inventec burned about 33.3 kilowatts of juice. That is about as much as you can expect an air-cooled system to cram into a rack and not melt.
For a frame of reference, running double precision Linpack, the most power efficient machine on the planet is the Tsubame 3.0 system at the Tokyo Institute of Technology, which is a cluster of Intel “Broadwell” Xeon E5 processors and Nvidia “Pascal” Tesla P100 coprocessors networked with Intel’s 100 Gb/sec Omni-Path interconnect. This machine encapsulates 3.2 petaflops of peak theoretical performance and burns 142 kilowatts of juice, so its comparable peak efficiency number – the one most like the calculation that Su did – is 22.6 gigaflops per watt. But you have to be careful of these peak numbers since they are, in fact, theoretical. When the Linpack parallel Fortran test was run on Tsubame 3.0, it was able to do just under 2 petaflops at double precision, and its actual power efficiency was 14.1 gigaflops per watt. At single precision, Tsubame 3.0 would in theory be able to deliver 45.2 gigaflops per watt, which is 50 percent better than the Project 47 cluster, but the bill of materials is smaller for the AMD-based system and therefore the cost should also be smaller. Our math suggests that it had better be somewhere around 35 percent cheaper to build a rack of CPU-GPU hybrids to at least to be on par, from a dollars per watt per flops perspective, and it will have to be cheaper still to win deals against Xeo0n-Tesla hybrids. We happen to think Epyc and Radeon Instinct will be a lot cheaper at street prices, and that server makers using these parts will win deals.
At the SIGGRAPH event, AMD showed off the Project 47 cluster being used as a virtual workstation cloud, and with each Radeon Instinct GPU and each Epyc CPU supporting full hardware virtualization, each node could support 16 users for a total of 1,280 simultaneous sessions. Granted, that was only two CPU cores and four GPU stream processors per user, which is not a lot of oomph. More interesting was Koduri’s demo that showed all of the 20 nodes being coupled together as a giant rendering workstation, and the only jitter on the screen of the rending of a motorcycle was that which came from the network, not from the simulation. All designers will want one of the Project 47 machines as their workstations. . . .
AMD did not show off benchmarks for HPC or AI workloads, but obviously, if the price is right and the flops are precise enough, AMD and its server partners have a chance to win some deals. And next year, it will get even easier as new chippery comes into the field from AMD.
As we have previously pointed out, the “Navi” Vega 20 kicker GPU is expected to be shrunk to a 7 nanometer process and offer double precision at half the single precision rate when it comes to market in 2018. The Vega 20 is anticipated to have the same 4,096 streaming processors and 64 compute units, and the shrink will increase performance by around 20 percent or so, which means single precision will be around 15.7 teraflops and double precision will be around 7.9 teraflops. AMD could shift to four stacks of HBM2 memory for a total capacity of 32 GB, and bandwidth could double or more, depending on a lot of issues with regard to memory speed, power draw, and heat dissipation. The Vega 20 cards are also expected to support PCI-Express 4.0 links, which have twice the bandwidth and will rival the NVLink 2.0 ports that Nvidia has on its “Volta” V100 GPU accelerators in speed.
That also almost certainly means that the future “Rome” Epyc processors, also due next year, will sport PCI-Express 4.0 controllers as well as microarchitecture improvements and a shrink to 7 nanometer processes. So the CPU compute might rise by another 20 percent to 25 percent, perhaps with a mix of IPC improvements, more cores, and higher clocks. And therefore a Project 47 machine using the Rome Epyc and Navi Vega 20 GPUs would deliver around 1.25 petaflops peak at single precision and about 625 teraflops peak at double precision, and be suitable for all HPC and AI workloads to boot.
These kinds of numbers are probably why AMD has been selected as a partner for one of three pre-exascale machines being developed by the Chinese government. That system is being built by Chinese system maker Sugon under the auspices of the Tianjin Haiguang Advanced Technology Investment Co, which is itself an investment consortium that is guided by the Chinese Academy of Sciences.