Just by being the chief architect of the IBM’s BlueGene massively parallel supercomputer, which was built as part of a protein folding simulation grand challenge effort undertaken by IBM in the late 1990s, Al Gara would be someone whom the HPC community would listen to whenever he spoke. But Gara is now an Intel Fellow and also chief exascale architect at Intel, which has emerged as the second dominant supplier of supercomputer architectures alongside Big Blue’s OpenPower partnership with founding members Nvidia, Mellanox Technologies, and Google.
It may seem ironic that Gara did not stay around IBM to help this hybrid computing effort, but to a certain extent the Xeon Phi parallel X86 processor and the hybrid Omni-Path InfiniBand/Cray Aries interconnect that Intel has married to it for HPC workloads is perhaps a more natural follow-on to the BlueGene/Q, with its sixteen-core PowerPC processors and proprietary torus interconnect, than is the mix of Power9 processors, Nvidia “Volta” GPU accelerators, and Mellanox InfiniBand interconnect that the OpenPower camp is pushing.
Two years ago, at the International Supercomputing Conference in Frankfurt, Germany, Gara spoke about the architectural possibilities for pre-exascale and then exascale machines, and gave everyone the impression that the best approach was to put a lot more high bandwidth memory against a lot fewer cores of compute and then lash them all together with very high speed, low latency networks – pitting skinny nodes against the fat and hybrid CPU-GPU nodes that the OpenPower camp has been crafting. And at ISC16 last year, Gara took the stage and talked about how we might hit 100 exaflops of performance in supercomputers by 2030 – a bold thought experiment when you consider that we are struggling to get to 1 exaflops by 2020, 2021, or 2022, depending on the project. This year, Gara was on hand at ISC17 to pass out grades on the efforts to reach exascale, and it is interesting for him to be speaking about this at a time when Argonne National Laboratory is in the process of renegotiating its contract with Intel to build the “Aurora” supercomputer. This machine was supposed to be based on the future “Knights Hill” Xeon Phi processor and a kicker 200 Gb/sec Omni-Path 200 series interconnect.
Gara did not talk about Aurora at ISC this year, but this time around he did hand out report cards to the suppliers of different components of supercomputers as the industry races to get exascale machines into the field in the next couple of years.
Generally speaking, Gara seems bullish about the exascale efforts under way, and he reminded everyone that the definition of exascale is not, as far as the US Department of Energy is concerned, not 1,000 petaflops of peak double precision floating point operations in a parallel machine, but instead enough computing to do 50X the work of the 20 petaflops “Titan” hybrid CPU-GPU machine installed at Oak Ridge National Laboratory in 2012. (Bronis de Supinski, the chief technology officer for computing at Lawrence Livermore National Laboratory, correctly pointed out that, technically, an exascale machine by the DoE definition is 50X the performance on real applications versus Titan or the “Sequoia” BlueGene/Q machine at LLNL. We would point out that with AI becoming a much more important part of the HPC workflow, this definition is no longer sufficient.)
“In a public forum, I am not going to give you details about what exascale is going to be, but I can tell you that right now it is just around the corner,” Gara explains. “It takes us four to five years to develop from concept to system and so we expect to have exascale on that time scale. We feel we really know what it looks like at this point, and we know how hard and how easy it is going to be. All of those questions that we have had over the past decade, I think we know what an exascale system is going to look like. Really, we are beyond exascale. We have been focusing on exascale for so long, we really need to start getting past that and start thinking what we are going to do beyond that.”
It is hard to argue with these points, and there is a growing consensus that the future of supercomputing, like that of deep learning at hyperscale, is going to be hybrid, which explains in large part by Intel has bought FPGA maker Altera and upstart AI chip maker Nervana Systems in the years since Intel started laying out its exascale system plans with the formalization of the Knights family and the creation of the Omni-Path interconnect.
Gara says that a lot of the things that the HPC luminaries have been saying would be necessary for exascale systems have turned out to be true, and he reminded everyone that the tried and true Message Passing Interface (MPI) protocol for loosely coupling the memory in clusters of server nodes that has utterly transformed the nature of simulation and modeling since the late 1990s will continue to be a foundation going forward. “The mantra of threads for concurrency will also be a mechanism to exploit the parallelism that will be inevitable and the method by which we get to exascale.
Gara rattled off a list of the things that the HPC industry and the component makers that it depends upon did well in terms of addressing the needs of exascale systems, and then talked about the things that the industry has not done so well. Here is where things have gone well, according to Gara:
Scaling up power consumption. Everyone aiming to have an exascale system knew that it was just a reality of physics that datacenter facilities would have to provide more power, and perhaps very high power density, to cram all the cores into a system to reach exascale-class performance. In the past eight years, there is a ramp in the power consumed, which means we can use the expanding power of a facility to reach exascale. “We find that when we respond to RFPs over the last decade, we haven’t really been limited by power because the facilities have actually been ahead of us. We have run out of dollars before we have run out of power. At least we frightened the facilities into ramping up so fast that we were able to meet those power targets.”
Memory makers have dealt with the memory wall. If you go back a decade, everyone was talking about how the performance of processors would outstrip the performance of memory, and therefore the additional CPU performance that, in theory, would be available would not be utilized in actual applications because the memory could not keep up. This was a very real fear with CPU throughput doubling three times over five years in the 2000s and memory speeds only doubling once in the same time. But the new memory technologies – including Hybrid Memory Cube (HMC) from Intel and Micron Technology and High Bandwidth Memory (HBM) from AMD, Nvidia, and Samsung – have saved the day. Well, at least in terms of providing more memory bandwidth than the DDR3 and DDR4 memory that is the main memory on server nodes. To be fair, the capacity of HBM and HMC is considerably lower than expected, and so is the bandwidth. But the bandwidth is still around 5X to 10X that of the CPU main memory bandwidth, and with most HPC applications being bound by memory bandwidth, as long as you can add more compute elements with HMC and HBM and spread the application concurrency across more nodes, you can still get something approaching balanced performance. It would be better to have fatter high-speed memories on either CPUs or accelerators, of course. “We need a balanced system, and we are not going to get there just by pouring in the flops,” says Gara. “We really need to add all of the system attributes, and so memory bandwidth is one that all of the fundamentals are there that we can hit the 50X memory bandwidth increase over what we had in 2012. This is somewhat remarkable, and it is certainly a testament to the memory manufacturers.”
Flops matter again thanks to AI. Over the past decade, as HPC vendors argued about the relevance of various benchmarks, particularly the Linpack Fortran parallel test used to rank the relative performance of supercomputers since 1993, flops have started to matter again because deep learning algorithms need lots of floating point performance, and in mixed precision at that. We would point out, though, that at the moment deep learning workloads do not particularly scale well, and when the deep learning frameworks adopt MPI they will probably make everyone’s life a whole lot easier. Once that happens, the need for very fat nodes like hybrid CPU-GPU systems may alleviate some, we think. Or it might not. It all depends on the money available and what things cost.
The exponential curve of concurrency. With having 10 GHz, much less 100 GHz, processors not being a possibility, moving from petascale to exascale on the compute side implies adding at least 1,000X more concurrency into the system. A decade ago, says Gara, this just seemed to be insurmountable. “There are applications running with enormous amounts of concurrency, and there still seems like there is room for more. That is pretty encouraging.”
Without pointing any fingers, Gara then talked about where the HPC industry has not done as well as it could have. (It is important to note that he did not use the word “failed” or even imply that. These engineering problems are the toughest in the world.)
Interconnect fabric bandwidth. “This is going to impact real application performance, and certainly for performance at scale,” Gara says, adding that the issue is the amount of investment being dedicated to the problem being too small and system architects working around fabrics that don’t provide enough bandwidth. Gara’s advice is to put into supercomputing RFPs their actual bandwidth requirements with real bandwidth benchmarks, and push the vendors to do more.
Memory capacity on compute. As many have pointed out, the memory capacity of nodes, particularly on that high bandwidth memory, needs to increase and to do so at a rate that makes it also affordable. New intermediary persistent memory technologies, like 3D XPoint, may help increase the capacity of some parts of the storage hierarchy to offset this, but in a Knights CPU or Nvidia GPU, the calculations are done there, not in this adjunct memory. We think it would be interesting to see such memories as 3D XPoint put on the Knights or GPU compute cards proper and not have to have the kind of coherency bus that NVLink provides between Power CPUs and Tesla GPUs. Alternatively, a more generic and industry standard coherency link between any CPU and any GPU or other kind of accelerator that has its own memory would be useful. (Don’t hold your breath waiting for that standard to come to pass and be adopted, although CCIX and Gen-Z are trying.)
Liquid cooling. In a flashback to the bi-polar mainframe computing era from four and five decades ago, it is time to just make liquid cooling standard on HPC systems and to realize that air cooling, while easy, adds costs over the long haul and limits the scalability and performance of HPC systems. Because the majority of the market wants air cooling, systems are not optimized to take advantage of liquid cooling.
Not enough profit to sustain investment in HPC. This is something that we have been saying for years, and something that annoys a lot of people who pay a lot of money for HPC systems and wish they cost a lot less. The competitive nature of HPC has driven down the price of systems, to be sure. “We are not getting as much innovation as we could if we had a more vibrant market,” says Gara, and he is right.
We would add that the military and energy departments of the major governments of the world used to shell out a lot more dough for gigascale, terascale, and petascale systems and they are much more conservative and much more demanding as customers these days. DARPA, for instance, needs to have teraflops in a backpack and petaflops in a tank more than it needs exaflops in a datacenter, and that is one reason why investment has been curtailed here. These are very different architectural challenges, and they don’t mesh to help spread the cost of investment over the greatest number of buyers.
It has been far, far easier to expand the performance of an HPC system than it has been to bring down its price, and that, more than anything, is what has limited the market for HPC and therefore its broader expansion so there would, in fact, be a larger customer base and allowing for lower profits to still drive innovation. There is a reason why Google is not powered by Intel Xeon server nodes linked by Aries interconnects. Google would not have invented a damned thing if it could have afforded to buy it. So you end up with an HPC market that is even more bifurcated – and less profitable – than it appears from Gara’s HPC-centric view. In fact, no one is making very much money in systems other than Intel, Microsoft, Red Hat, and some flash and main memory makers these days, and this is an even bigger problem for the entire IT sector.