Cray’s Ever-Expanding Compute For HPC

With choice comes complexity, and the Cambrian explosion in compute options is only going to make this harder even if it is a much more satisfying intellectual and financial challenge. This added complexity is worth it because companies will be able to more closely align the hardware to the applications. This is why search engine giant Google has been driving compute diversity and why supercomputer maker Cray has been looking forward to it as well.

This expanding of the compute ecosystem is also necessary because big jumps in raw compute performance for general purpose processors are possible as they were in past decades thanks to the slowing of Moore’s Law advances in chip manufacturing, something that we talked about with Cray chief technology officer, Steve Scott, last summer. This was when Intel was putting the finishing touches on the “Skylake” Xeon SPs, AMD was readying its “Naples” Epyc 7000s and a return to the server arena, IBM was prepping its “Cumulus” and “Nimbus” Power9s, Qualcomm was gearing up its “Amberwing” Centriq 2400, and Cavium was preparing two lines of ThunderX2 processors. The Xeons and Epycs support the X86 instruction set, while the Centriq and ThunderX lines use the ARM instruction set.

Cray bet very heavily on the “Hammer” Opteron lines of X86-compatible server chips in the early 2000s, and benefitted from the architectural advantages that the Opterons had over the Xeons of the time – so much so that Cray integrated its supercomputer interconnects right into the HyperTransport bus of the Opterons. But Cray was eventually also burned by bugs in and delays of the final few generations of Opterons, and that hurt its sales of supercomputers for quite some time. So it has been understandably cautious about embracing the revitalized server chip lineup from AMD, but not more than other server makers who also originally embraced Opterons and have been cautiously optimistic about and precisely opportunistic with the Epyc processors. This is not at all surprising, and neither is the practice of creating specific machines for specific use cases, using whatever processor or accelerator that makes sense and closes the deal. This is, for instance, precisely what Dell has done with the three Epyc systems it is selling alongside dozens of different Xeon machines.

The Epyc 7000 series processors have been out in the market since June of last year, and after kicking the tires and soliciting feedback from current and prospective customers, Cray is now comfortable enough to add the Epyc chips as options in its machines. To start, the Epyc chips are being provided in two different server nodes that are part of its CS line of general purpose clusters that are based on InfiniBand, Omni-Path, or Ethernet interconnects, but not part of the higher end XC series machines that sport its own “Aries” interconnect.

The details of the new Epyc-based CS500 machines, which will start shipping later this summer, are a little thin, but here is what we learned from Chris Lindahl, director of product management at Cray. The first system in the CS500 line to sport the Epyc chips is a hyperscale-class system that crams four half-width compute nodes into a 2U enclosure. The machines have two Epyc processors per node, and Cray is being agnostic as to which Epyc chips going into the machines. But Lindahl tells The Next Platform that there is broad interest in two particular processors among its HPC customers. The first is the Epyc 7601, which has 32 cores running at 2.2 GHz with a 180 watt thermal envelope and the Epyc 7501, which has 32 cores as well but which has a lower clock speed of 2 GHz that has a thermal envelope of 170 watts when the machines use 2.67 GHz DDR4 memory or at 155 watts when slower 2.4 GHz DDR4 memory is used. Significantly, the Epyc 7601 has a list price of $4,200 single unit price when bought in 1,000 unit trays, and the Epyc 7501 costs $1,850 a pop.

The CS500 sleds with AMD processors for this density-optimized Epyc machine have eight memory channels, which is more than enough to support the 128 GB or 256 GB of capacity per node that is typical among HPC customers using either 16 GB or 32 GB memory sticks. This is important for a number of reasons. First, denser memory sticks are far more expensive than skinnier ones, and second, to get the full memory bandwidth out of a node requires all the memory slots to be populated. Each node has two PCI-Express 3.0 x16 slots, which is enough to drive two 100 Gb/sec Ethernet or InfiniBand ports; there are multiple options for flash and disk storage, and we are guessing that there are two slots per node. There is no room – either physically or thermally – on these AMD sleds to add GPU, FPGA, or DSP accelerators.

In addition to this dense compute machine, there is a more traditional 2U rack variant of the CS500 machine that uses Epyc processors that has a single two-socket motherboard and is intended as a fat memory compute node or for visualization workloads with a few GPU cards thrown in. This machine is also intended to be used as a head node for managing the CS500 cluster.

Like Dell, Cray is seeing demand for AMD Epyc processors from HPC shops where memory bandwidth is a key deciding factor in the performance of applications, and in particular, like Dell, the big pull is among those organizations that are running computational fluid dynamics. “With an X86 architecture and so much memory bandwidth, this has broad applications, including other areas in manufacturing as well as those who use third party codes and will therefore be able to take advantage of that bandwidth without having to recompile their own code,” explains Lindahl. The implication is that the software suppliers have already done their ports from Xeons to Epycs.

Both machines run the full Cray programming environment, which includes Red Hat Enterprise Linux or its CentOS clone as the foundation, with Bright Cluster Manager babysitting the nodes, a variety of file systems for storage (Lustre, Spectrum/GPFS, NFS, or local file systems such as ext3, ext4, and XFS) as well as a slew of compilers, debuggers, and communication libraries.

The CS500 family of clusters already supports Intel’s “Knights Landing” Xeon Phi processors and has options for adding Nvidia’s Tesla K40 GPU and Nallatech FPGA accelerators. At the moment, Cray is not offering a cluster that combines the CPU compute of AMD’s Epyc 7000s with the GPU compute of AMD’s Radeon Instinct cards, which would be an interesting option for areas that need only single precision floating point math, such as life sciences or seismic processing. (The Radeon Instincts have pretty poor double precision math, which is something that AMD needs to fix if it wants to take on HPC. Alternatively, it can focus on adding half precision or something like Nvidia’s Tensor Core units and just wait for the HPC community to port all of this code to half precision. We are only half kidding here.)

What would be truly interesting, and what might make even better use of that memory and I/O bandwidth that the Epyc chips deliver, is putting these chips inside of the XC series line. Lindahl did not have anything to say about this, but it is an obvious thing to do with the current XC50 machines. We suspect it will happen if enough HPC shops ask for it. Cray has, after all, already added the Cavium ThunderX2 processors, which are not even shipping yet, to the XC50. The demand pull for these ThunderX chips is similar in that customers want lots of memory bandwidth but also want to invest in the Arm architecture, which is more open by some measures than the X86 world. For other XC shops, maintaining X86 compatibility while boosting compute capacity and memory bandwidth will be more important, and thus Cray should also make Epyc chips an option for the XC line. A lot of homegrown and third party software has been tweaked and tuned to run on X86 chips, and porting to Arm is not going to be practical unless there is a huge advantage either for performance or economics – or both.

It is hard to say why Cray did not support the Epyc chips in the XC line at the same time it is delivering them in the CS line, particularly since Lindahl isn’t talking much about future XC products. “We are constantly evaluating technologies across all of our product lines, and want to bring forward the best in breed for all applications and use cases that we can find,” says Lindahl.

So, in other words, if you want an Epyc-based XC cluster, pay for one and Cray will very likely be happy to build it. And if enough of you ask for one, then it will become a standard product. We do think that by the time the future “Shasta” machines are launched, given the caveat that the Epyc roadmap is being kept to by AMD, that there is a fair chance that the Xeon, Epyc, and ThunderX2 processors will be peers in the system, interchangeable and available as customers see fit.

It would be very interesting to see a bake-off on actual HPC workloads running across all three chips with the CS and XC iron to show off all of the differences in compute, memory bandwidth, and interconnect, and then add in the pricing differences to show what architectures are the best fit for specific workloads and budgets.

Vega 20 will offer a 1/2 DP To SP ratio so that’s incoming for the Pro/HPC GPU compute/AI markets at 7nm. What is still up in the air for Vega 20 is if AMD will be increasing the Shader Core count or will AMD be doing a dual die on a single PCIe card variant with Vega 20. Also unknown for any Dual Die on a single PCIe card SKU is how will Vega’s use of the Infinity Fabric IP factor in as previous Dual Die/single PCIe card accelerator products from AMD relied on PCIe/XDMA for inter-GPU Die communication on that single PCIe card. That Infinity Fabric IP on a Dual GPU Die accelerator SKUs would inply a greater level of GPU die to GPU die cache coherency for any Vega/dual Die Vega 20 variant that may have any Dual GPU Dies, Via the IF, appearing to the software/drivers as more of a single large logical GPU than is possible via PCIe. So will AMD chose to make use of the Infinity Fabric for Inter GPU die cache coherency on any possible Vega 20 dual GPU Die variant rather than PCIe/XDMA.

Either way Volta/GV100(Base Die) tops out at a total of 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor Cores, and 336 texture units, But the full shader complement of GV100 is not utilized on the Titan V or Quadro Variants. The Vega 10 base die tapout that is used for the Radeon Instinct MI 25s and Radeon Pro WX 9100 only offers up 4096 shader cores so Vega 20 will either have to offer more shader cores or maybe AMD will just double up the GPU dies on a single PCIe card to get a larger number of shader cores offered.

GV100 has a 1/2 DP FP to SP FP ratio and those Tensr cores so AMD really has to think about maybe a dual GPU on a single PCIe variant for Vega 20 in order to get more FP 16(Rapid Packed math) Tflops to compete against the full GV100 and its extra complement of 672 Tensor cors and 5376 Cuda/Shader cores for AI/Compute workloads.

I Really wish that the Online Information Portals like Wikipedia/others would focus on listing the proper shader core counts for Nvidia’s GV100/eariler base die tape-outs instead of just listing the GV100/earlier derived variants lesser shader core counts. Nvidia tapes out its GV100/eariler base die tapeouts with an excess of Cuda/Shader cores in order to increase the binning percantages so the variants derived for the base die tapeout usually offer less avalable Cuda/Shader cores.

Nvidia still tapes out its usual 5 base die tapeouts per GPU Micro-Arch generation for example with its pascal GPU micro-arch(GP100, GP102, GP104, GP106, GP108) compared to AMD who at that time of Vega’s initial Release had only one Base Die(Vega 10) which AMD has do make use of across all its Pro and consumer Flagship Gaming SKUs. And its takes millions(US) to create a single GPU base die tapeout so Nvidia has a finer grained advantage across 5 different base die tapeouts compared to AMD who could at the time only afford that One Vega 10 base die tapeout.

ItsDualGPUDiesOnASinglePCIeCardTimeForRed says:

April 19, 2018 at 8:10 pm

Vega 20 will offer a 1/2 DP To SP ratio so that’s incoming for the Pro/HPC GPU compute/AI markets at 7nm. What is still up in the air for Vega 20 is if AMD will be increasing the Shader Core count or will AMD be doing a dual die on a single PCIe card variant with Vega 20. Also unknown for any Dual Die on a single PCIe card SKU is how will Vega’s use of the Infinity Fabric IP factor in as previous Dual Die/single PCIe card accelerator products from AMD relied on PCIe/XDMA for inter-GPU Die communication on that single PCIe card. That Infinity Fabric IP on a Dual GPU Die accelerator SKUs would inply a greater level of GPU die to GPU die cache coherency for any Vega/dual Die Vega 20 variant that may have any Dual GPU Dies, Via the IF, appearing to the software/drivers as more of a single large logical GPU than is possible via PCIe. So will AMD chose to make use of the Infinity Fabric for Inter GPU die cache coherency on any possible Vega 20 dual GPU Die variant rather than PCIe/XDMA.

Either way Volta/GV100(Base Die) tops out at a total of 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor Cores, and 336 texture units, But the full shader complement of GV100 is not utilized on the Titan V or Quadro Variants. The Vega 10 base die tapout that is used for the Radeon Instinct MI 25s and Radeon Pro WX 9100 only offers up 4096 shader cores so Vega 20 will either have to offer more shader cores or maybe AMD will just double up the GPU dies on a single PCIe card to get a larger number of shader cores offered.

GV100 has a 1/2 DP FP to SP FP ratio and those Tensr cores so AMD really has to think about maybe a dual GPU on a single PCIe variant for Vega 20 in order to get more FP 16(Rapid Packed math) Tflops to compete against the full GV100 and its extra complement of 672 Tensor cors and 5376 Cuda/Shader cores for AI/Compute workloads.

I Really wish that the Online Information Portals like Wikipedia/others would focus on listing the proper shader core counts for Nvidia’s GV100/eariler base die tape-outs instead of just listing the GV100/earlier derived variants lesser shader core counts. Nvidia tapes out its GV100/eariler base die tapeouts with an excess of Cuda/Shader cores in order to increase the binning percantages so the variants derived for the base die tapeout usually offer less avalable Cuda/Shader cores.

Nvidia still tapes out its usual 5 base die tapeouts per GPU Micro-Arch generation for example with its pascal GPU micro-arch(GP100, GP102, GP104, GP106, GP108) compared to AMD who at that time of Vega’s initial Release had only one Base Die(Vega 10) which AMD has do make use of across all its Pro and consumer Flagship Gaming SKUs. And its takes millions(US) to create a single GPU base die tapeout so Nvidia has a finer grained advantage across 5 different base die tapeouts compared to AMD who could at the time only afford that One Vega 10 base die tapeout.

Cray’s Ever-Expanding Compute For HPC

Sign up to our Newsletter

1 Comment

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Intel To Set Its FPGA Unit Free To Pursue Its Own Path

Academia Gets The First Production Cray “Shasta” Supercomputer

TSMC: The Second Most Profitable Company In The AI Revolution

1 Comment

Leave a Reply Cancel reply