AMD Researchers Eye APUs For Exascale
February 28, 2017 Jeffrey Burt
Exascale computing, which has been long talked about, is now – if everything remains on track – only a few years away. Billions of dollars are being spent worldwide to develop systems capable of an exaflops of computation, which is 50 times the performance of the most capacious systems the current Top500 supercomputer rankings and will usher in the next generation of HPC workloads.
As we have talked about at The Next Platform, China is pushing ahead with three projects aimed at delivering exascale systems to the market, with a prototype – dubbed the Tianhe-3 – being prepped for next year. For its part, the National Strategic Computing Initiative in the United States is readying two exascale-capable systems for delivery in in 2021.
The demand for exascale capabilities within the HPC community is growing almost as fast as the amount of data being accumulated in the scientific and commercial arenas. Exascale systems will drive the latest applications in everything from meteorological research and high-end healthcare to oil and gas exploration, to national security and to emerging workloads like artificial intelligence. The work to design and develop these systems is ongoing, and the architecture for these systems – and what technologies they will use – is still continuing to evolve. A key driver is to find ways to get as much performance out of the exascale systems as possible while keeping a lid on the power consumption. Simply scaling current architectures won’t work.
AMD Research, the research division of the chip maker, recently took a look at using accelerated processing units (APUs, AMD’s name for processors with integrated CPUs and GPUs) combined with multiple memory technologies, advanced power-management techniques, and an architecture leveraging what they call “chiplets” to create a compute building block called the Exascale Node Architecture (ENA) that would form the foundation for a high performing and highly efficient exascale-capable system. In a recent paper outlining the ENA, the researchers admit that “there are potentially many paths to realize exascale computing,” their effort that “spans from circuits to software enables a very efficient and tightly integrated processor architecture suitable for exascale systems.”
The challenges are substantial. The AMD researchers noted that an exascale system will consist of some 100,000 interconnected server nodes that will have a power envelope of about 20 megawatts, as well as the memory bandwidth and capacity to address the compute throughput demands and execute massive workloads, with each node delivering more than 10 teraflops while consuming less than 200 watts. Currently, the most energy-efficient supercomputer in the world today would have to improve its flop-per-watt ratio seven-fold to meet the target set for exascale systems. In their paper – titled Design and Analysis of an APU for Exascale Computing – the researchers say that the ENA could meet those demands.
According to the study, at the core of the ENA is what the authors call the Exascale Heterogeneous Processor (EHP), a high-performance APU that includes not only the integrated CPU and GPU resources but also on-board 3D High Bandwidth Memory (HBM) like AMD currently uses on its graphics cards. At the same time, the EHP leverages a network of external memory devices interconnected with point-to-point links similar to the Hybrid Memory Cube (HMC) DRAM architecture. The chip uses a modular “chiplet” design that takes advantage of on-die stacking technologies. The EHP offers two CPU clusters in the center of the package, with each cluster holding four multi-core CPU chiplets that are stacked atop an active interposer base die. On each side of the CPU clusters are four GPU clusters, with each cluster comprising two GPU chiplets stacked on an active interposer. Atop each GPU chiplet is a 3D DRAM stack, which the researchers said could be a future generation of JEDEC HBM.
“The DRAM is directly stacked on the GPU chiplets to maximize bandwidth (the GPUs are expected to provide the peak computational throughput) while minimizing memory-related data movement energy and total package footprint,” the researchers wrote. “CPU computations tend to be more latency sensitive, and so the central placement of the CPU cores reduces NUMA-like effects by keeping the CPU-to-DRAM distance relatively uniform. The interposers underneath the chiplets provide the interconnection network between the chiplets (that is, network on chip or NOC) along with other common system functions (e.g., external I/O interfaces, power distribution, system management). Interposers maintain high-bandwidth connectivity among themselves by utilizing wide, shortdistance, point-to-point paths.”
In addition to all that, to hit the small power budget of 200 watts, the EHP uses such power optimization technologies as near-threshold computing, asynchronous circuits, low-power interconnects, and data compression circuitry, and power management techniques as active DVFS and power gating controls.
The researchers used simulations to test how several scientific and security applications – including MaxFlops (for compute intensive workloads), LULESH, SNAP and XSBench (memory-intensive), and CoMD (balanced) – would run on ENA modules. More detail about the architecture can be found in the study, and the authors were encouraged by what they found. For example, they tested the multi-chiplet design against a monolithic exascale-level processor. In the chiplet design, messages between remote chiplets require additional steps – the message runs from one chiplet to the interposer layer by way of through-silicon vias (TSVs), across the interposer layer, and then to the destination chiplet through TSVs. This all adds potential latency to the communication. However, “the performance impact of the large out-of-chiplet traffic is relatively small. The largest performance degradation compared to the monolithic EHP is 13 percent, while some application kernels, such as SNAP, have a negligible impact,” they wrote.
Other results include the importance of in-package memory for bandwidth-intensive workloads, with a key being efficient data management through software or hardware “to ensure that as many requests as possible can be serviced from the in-package memory for both performance and energy concerns.”
In addition, the researchers found that using high levels of die stacking should work even in an air-cooled system, though “more advanced cooling solutions may become necessary as the hit rate of the in-package DRAM improves, more power from the external memory is shifted to the EHP, or if a design point uses a greater per-node power budget.”
AMD’s research into its APU-based ENA compute and memory engine illustrates the dynamic nature of the development efforts worldwide across the globe. Most supercomputers today are built on Intel’s architecture, but as the exascale era gets closer, researchers, scientists, and supercomputer makers are investigating a broad array of options. Fujitsu officials last year confirmed that they were moving from Sparc64 chips to the low-power ARM architecture for the next generation of supercomputers, including an exascale-capable computer set for 2020. China, with its upcoming Tianhe-3, will use homegrown processors. We can now add AMD APUs to the list of possible exascale compute units.