Having the best compute engine – meaning the highest performance at a sustainable price/performance – is not enough to guarantee that it will be adopted in HPC, hyperscale, or enterprise settings.
It is arguable, for instance, that chip maker AMD has held the performance lead over rival Nvidia on raw GPU performance from time to time in the wake of AMD acquiring ATI Technologies almost a decade ago. But thanks in large part to the substantial investments that Nvidia has made in its CUDA programming environment and its early adoption of error correction and network acceleration functions inside its GPU coprocessors, Nvidia’s Tesla is by far the preferred accelerator coming out of the GPU fold.
AMD has wanted to change this for years, and has been working on hybrid CPU-GPU chips, which it calls Accelerated Processing Units, or APUs, that marry the two compute engines together. At this point, HSA is pretty much limited to the “Kaveri” and “Carrizo” laptop chips, and while these are interesting and a compelling case could be made for using such hybrids for scientific simulations and models, the performance of the CPU cores is not sufficient for many workloads and therefore these have not taken off. This despite the fact that the latter Carrizo APU supports a unified and virtualized memory architecture across the two types of compute, something that is not yet possible with X86, ARM, or Power processors linking to Nvidia Tesla motors.
As we have discussed before, AMD has put together a much more conservative roadmap for its Opteron and ARM processors than it had originally planned last year, and is putting all of its efforts into getting its future Opterons based on the “Zen” cores to market, perhaps in late 2016 with volume shipments ramping into 2017 if the scuttlebutt is correct. After that comes the ARM chips using the homegrown “K12” cores that AMD has been working on for several years.
What we have been waiting for is for AMD to pair a hefty CPU with a hefty GPU to create a large compute element – and one that support AMD’s Heterogeneous Systems Architecture (HSA) and that also provides a coherent link between the two as is the case with APUs that have CPUs and GPUs on the same die such as the Carrizo chip.
At the SC15 supercomputing conference in Austin last week, AMD was demonstrating the combination of “Haswell’ Xeon E5 processors from Intel paired with its Radeon R9 Fury GPU cards, which are based on its “Fiji” GPUs. This card is the first to use high bandwidth memory (HBM), and is an example of the kind of architecture that we can expect from all kinds of computer architectures going forward. The card that AMD was showing off in a hybrid configuration was based on a Fiji Nano GPU running at 1 GHz that could deliver 8.19 teraflops of floating point performance at single precision and 516 gigaflops at double precision; it has 4 GB of on-package HBM. This is a screaming single precision device, and it is not even the fastest card AMD has put into the field.
At the moment, AMD does not offer a FirePro compute accelerator variant of the Fiji GPU, which is needed for hybrid CPU-GPU computing in the enterprise, but one is coming in the future and using the desktop part paired with a Haswell was just a demonstration that the new programming tools and peer-to-peer Direct RDMA features that AMD has developed work over the PCI-Express 3.0 bus.
The links between the CPU and the GPU do not offer memory coherency, as is possible in the true single-chip APUs, and we have always thought that this is a problem for AMD’s datacenter strategy. Rather than offloading data manually between the two devices, coherency, such as that provided on the Power8 chip from IBM through its Coherent Accelerator Processor Interface (CAPI) and through NVLink when it is added to the Power8+ chip next year, is what is really needed. And AMD conceded as much to The Next Platform.
We asked Ben Sander, HSA software architect at AMD, about providing such coherency between a beefy CPU and a beefy GPU – something with a lot more oomph than Carrizo and perhaps even more than a rumored hybrid Opteron with sixteen Zen cores and an integrated GPU on the die and 16 GB of HBM on the package.
“I think to do it well, you really want a custom interconnect of some kind beyond PCI-Express,” Stoner explains. “The software that we have demonstrated does not expose coherency through it. We are still using HSAIL, and C++, and single-source compilation and have taken some significant steps forward, but coherency is not something that we are going to support on a PCI-Express platform.” When pressed if such coherency between the compute elements, enabled by some kind of interconnect that was higher bandwidth and lower latency than PCI-Express, Sander said it was a “desirable feature” and then had this more to say: “I think there has been a lot of FUD back and forth between those who have it and those who do not, who have tried to argue that it is not important at all. You can certainly do useful work without coherency. But programmers like it, and there is even performance advantages to it because if you don’t have coherency, you end up doing coarse-grained flushing. Not only is it easier, but there is a performance benefit because if you can maintain all of your state in your caches, you can have much finer and tighter integration between the CPU and the GPU. We have certainly seen examples of programs that are easier to develop, but benefit from the coherency. We are not going to promise to do it or give timelines, but it is a valuable feature.”
IBM would no doubt argue that PCI-Express is a good place to start with accelerator memory coherency, since CAPI does precisely this by running such coherency between the Power8 CPU and accelerators over the PCI-Express 3.0 fabric implemented by the Power8 chip itself. As we learned last week at SC15, IBM has plans to enhance CAPI in a way that makes it less dependent on PCI-Express in the future Power9 chips due in 2017, and it will also be offering enhanced NVLink ports on the Power9 that will offer improved coherency between Power CPUs and Nvidia Tesla GPUs.
The C++ Foundation For Hybrid
The announcements that AMD made at SC15 on the software front lay some of the groundwork to move towards that coherent future we think AMD needs for discrete CPUs and GPUs as well as for APUs that wrap the two together in a single package. This includes a new Linux driver for its GPUs that provides the peer-to-peer RDMA linking of GPU memory across InfiniBand networks, a feature that made Nvidia’s Tesla coprocessors much more efficient when GPUDirect was introduced many years ago.
AMD is also rolling out its Heterogeneous C++ Compiler, or HCC for short, which you can find out a lot more about at this link. HCC is based on the popular Clang/LLVM compiler framework, some HSAIL goodies, and some bits of other open source projects and mixes them up so it can automatically generate code for both the CPU and the GPU in a hybrid programming environment. (HCC will work for APUs that have these tightly integrated or for CPU-GPU combinations that use discrete components linked by PCI-Express.) HCC supports C++11, C++14, and some bits of the proposed C++17 standard, Gregory Stoner, senior director of GPU computing solutions, tells The Next Platform, and it will be able to keep that code inside of one source file but compiles and executes that code in their respective CPU and GPU binaries. The support for discrete GPUs is what is really new here.
We were a bit curious as to when AMD might make a Fortran variant of this compiler stack available, given the popularity of that language in the HPC environment.
“We get a lot of questions on Fortran,” says Stoner. “And we are working with a number of vendors for Fortran support for this stack. As you know, Fortran is a very specialized language, and you really have to work with a vendor that has been around for a while because the codebases that we work with are Fortran66 and Fortran77 and a little bit of Fortran90 and a little bit of Fortran99. With the accounts we are working with, they have less than 2 percent of their code in Fortran.”
This obviously does not just refer to the traditional HPC customers, but to a broader definition that includes simulation, modeling, machine learning, data analytics, and other applications. The Caffe and Torch deep neural network frameworks, for instance, are based on C++ and C . “What people do not realize is that in the national labs, there is a lot of C and C++ code out there,” says Sander, and as an example, he said that Sandia National Laboratories now has 1.6 million lines of C and C++ code in its application base, and that across the HPC community, C and C++ are used in electrodynamics, bioinformatics, computer aided design, and fluid mechanics applications,” says Sander. (With the latter two, it is typically a mix of Fortran and C/C++, he explains, but with newer codes, software developers are shifting to C and C++.) “The demarcation tends to be whether the code was made in the 1980s, when it was all Fortran, or in the 1990s, when it was a mix of C++ or Fortran, and in the 2000s, it was definitely C++ code.”
The final bit of the new software stack from AMD relating to hybrid computing is a tool called HIP, which is a tool to help port applications that have been coded in CUDA C++ to move over to AMD platforms that mix CPUs and its GPUs, whether they are discrete or embedded in an APU. Sander says that AMD analyzed 50 open source CUDA-enabled programs to create the HIP tool, which he says can convert about 90 percent of the code over automatically to work using the HSA approach with the remaining 10 percent to be done by hand. (Those ratios are general, and will vary by application, of course.)
The new software stack from AMD will be available in early access form in the first quarter of 2016, and the company will provide downloads of the drivers, compilers, and runtimes. When it will be generally available was not revealed, and the company is also not talking about when a FirePro S Series GPU accelerator that can support these features will come out, but presumably it will be soon.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Gabriel H. Loh, AMD’s memory guru if you will, has many HBM patents to his credit. One, just awarded deals with memory coherency.
I would like to point you in this direction:
You might also google “Gabriel H. Loh + patents” and see what other cool stuff he has cooked up for HBM.
Considering how AMD’s new Arctic Islands GPU micro-architecture is scheduled to be online before Zen’s arrival, I would expect that Fiji would be supplanted by the newer Arctic Islands completely new GPU micro-architecture scheduled to be introduced with the Zen/Greenland GPU Server SKUs. As for coherence AMD has an even more robust coherent exascale design that includes FPGA’s on the HBM stacks for some distributed in HBM memory compute. As for coherence over PCIe that may/will come from some of the SeaMicro IP mixed with other IP that AMD retains from the shuttered SeaMicro division. So the SeaMicro(Now AMD IP) Freedom Fabric/other has that over PCI coherence in its DNA.
The importance of the interposer for its Ability to play host to 10’s of thousands of etched in silicon traces for APUs on an interposer has plenty of future implications for coherent connections/buses just as much as for GPU/CPU Uber wide traces to HBM memory. So with all these separate AMD CPU(Zen, ARM K12) server cores able to be fabricated on the processes that best suits their use, and Arctic Islands based GPU/HSA accelerators able to be fabricated on a process that best suits their use, these various dies can be wired up on the interposer with plenty of traces to and from CPU to GPU, as well as to HBM, DSP, or other. So I would expect there to be ample ability to etch onto the interposer’s substrate sufficient coherencey traces and buses/fabrics to allow all the disparate dies on the interposer to be wired up as if they where all on a single monolithic die and share coherency between themselves and other components on die and off.
AMD has a shed-load of IP that can be brought online to address coherency for APUs, APUs on an interposer, as well as the coherency over PCI(Freedom Fabric/other IP) for dedicated discrete GPU accelerators for HPC workloads. The interposer, specifically the APU on an interposer, will allow for that Carrizo APU style coherency to be delivered via thousands of parallel traces to multiple separate processor dies on future interposer based Server/HPC SKUs from AMD, both x86 and ARM based.