When one looks at major milestones in supercomputer history, one of the relatively recently markers rests with IBM, which began work its work to create a massively parallel system for protein folding research. Armed with an initial $100 million back in 1999, the IBM research team set its efforts on making an HPC platform that would be more scalable and usable than existing cluster architectures and provide new elements on the interconnect, chip design, and systems software side that would be relevant for other areas in scientific computing.
The result of these efforts spawned BlueGene, which quietly hit the end of its line (there were three variants; the BlueGene L, P, and Q) last year as IBM looks to the OpenPower approach for its next generation HPC system approach. Over the course of its run, BlueGene systems have consistently ranked high in both the Top 500 supercomputer list (as well as being the first to break the petaflops computing barrier), and ranked at the top on other related benchmark lists, including Graph 500 and Green 500.
The BlueGene architecture was novel in several ways, from the SoC design, the 3D torus interconnect, and the change in approach from high-end processors to low-power chips that were much better at working together than as large singular powerhouses. This meant that the communication had to be emphasized, which was where a lot of the development work went into, according to Alan Gara, one of the lead designers of the BlueGene systems.
We talked with Gara last week, who moved to Intel after the collapse of the BlueGene days at IBM, about the upcoming Aurora system and the HPC scalable system framework. But in the second part of our conversation, Gara told The Next Platform that the legacy of BlueGene can be keenly felt in future systems like Argonne’s 2018 beast—but so much of what was novel about BlueGene has become commonplace that we may not see it.
“There are concepts that were developed and proven in BlueGene that were radical at the time. The thing is, at this point it may not seem that way since so many aspects of BlueGene that were revolutionary and successful have been adopted in other ways. That is really the litmus test for whether for whether or not something was successful.”
At the time of BlueGene’s rise through the supercomputing ranks, more systems were adopting the offload model for large-scale HPC workloads by snapping GPU accelerators into massive clusters, a trend that continued for a number of years, producing top systems, including the top-ranked Titan supercomputer at Oak Ridge National Lab. As of the last Top 500 rankings, 46 of the systems on the list are using Nvidia GPUs (Kepler or Fermi generations), 21 are using a Xeon Phi coprocessor, and a few machines are using both GPUs and Xeon Phi across the same supercomputer.
The point is, there has been a steady climb toward accelerators for top-ranked machines, but with the self-hosted model of the upcoming Knights Landing architecture, this offload model and the bottleneck of data movement between the GPU and other elements, will likely go away. The OpenPower efforts of IBM and Nvidia to use NVlink to speed that communication will be put to the test with the Power9 based systems coming to other centers in the next couple of years, including the future 150-petaflop “Sierra” machine coming to Lawrence Livermore Laboratory, but Gara says that these are still using what amounts to an offload model in that data has to be pushed between multiple components.
It is not clear how the Top 500 folks will choose to classify systems that have a GPU that is part of the compute since the accelerators classification generally just refers to a coprocessor that sits across a bus. The main question, however, is how long it will take for this classification to disappear entirely. As it stands, the new top-tier systems that will start to come online, possibly for the November rankings, will sport Knights Landing, wherein the accelerator is not a discrete unit. Gara says the shift away from the offload model is already starting to happen, and will continue with the introduction of Knights Landing into the full HPC market (right now just the national labs—at least as far we know) are part of the early access program for these chips.
But processors aren’t the star of the show any longer for big systems anyway, Gara says. “When we look at trying to get more performance, it has to happen with higher levels of concurrency, so communication is actually the key. There are more things happening in a system at once, so we need to be able to handle that communication as efficiently as possible, that’s how we’re getting the speed and performance on these systems where we are solving the same problem faster by working on it in parallel. The faster you go, the more important that communication latency becomes.”
The point is, “the offload model will not be the most efficient for highly scalable workloads—the communication and compute will be tightly coupled and so having those very closely coupled is critical to scalability. Having to go through any intermediate processor is just any additional latency that will potentially become a bottleneck.”
The real difference after moving away from the offload approach will boil down to codes that can take advantage of the massive number of cores available on either the upcoming IBM/Nvidia machines or the Knights Hill-based Aurora system. Gara, who has worked closely over the years with Argonne on its cadre of BlueGene supercomputers, says he expects there to be a good story for codes on Aurora because the architectures are matched very closely to what the previous BlueGene machines had.
“They are both self-hosted, they both support the same programming models, and if Argonne can take that same level of threading they had with BlueGene and move that into Aurora, they will see tremendous improvements by just moving their code over, recompiling it, and running in the same way they did on BlueGene—and that’s just with a straightforward port.”
As we know from talking to system software folks at Argonne, there is a lot more involved for other codes to make the system operate at its stated peak of 180 petaflops or above. The real challenge on the end user application side will be optimizing for an architecture that no one has seen yet.