Behind The UK’s Bleeding Edge Supercomputing Plan

Nicole Hemsoth Prickett

9 years ago

Just as the fervor died down around the massive deals for forthcoming pre-exascale supercomputers in the United States following the CORAL procurements (most recently, with the announcement of Aurora—the only one of three such HPC deals that is not betting the future on IBM OpenPower systems) the supercomputing spark was stoked again, this time, from across the pond.

Out of the fastest five hundred supercomputers on the planet, the United Kingdom currently has 30 machines that make the bi-annual list, including its most powerful machine, a Cray XC30 called “Archer” housed at the University of Edinburgh. Archer is followed closely performance-wise by another Cray system for ECMWF, a weather forecasting and research center, and yet another at Daresbury Lab, which is a BlueGene supercomputer—soon to be the site of the upcoming machines in question.

The Science and Technology Facilities Council (STFC), the UK government sponsored research organization that works with both national industrial and research partners on systems and software projects and that funds the Daresbury Lab, announced a £313 million investment to support a grand OpenPower undertaking that will bring three new systems (in stages, as with the CORAL machines in the U.S.) into play, but just as important, will push new architectural and application-level developments—everything from new ways to harness FPGAs, ARM architectures, and deep learning, cognitive computing and other Watson-driven projects.

In this era of data stealing the floating point calculations show, the emphasis for recent big procurements is being placed on what IBM is calling data-centric computing. Accordingly, as we have seen with some of the biggest system deals to date, including the Summit supercomputer coming to Oak Ridge National Lab, and the Sierra national lab system, the key elements of the machines are focused on moving data more than sheer number crunching. It is the successor to their supercomputer push in the post-Blue Gene world, and one that has caught on for some very big upcoming machines that will rely on the combined force of future generation IBM Power architectures (Power 8, 8+, and 9, as roadmaps indicate), Nvidia GPU coprocessors, Mellanox networks and of course, NVLink and CAPI to hook and tie the elements.

The Next Platform was able to get more details about how this system deal will shape up during a discussion with Cliff Brereton, director of the Hartree Centre, who told us that indeed, this is a three-system deal culminating in a 2018 finale that will be far higher performing than anything the lab currently have (which is a tick over a petaflop) but nothing on the order of the 150 petaflop machines that will be installed at national labs in the United States. The differentiating factor in this machine is that the leadup to the 2018 system will be used to test out new architectures and approaches to computing, some that Brereton says are not generally available but could a way into eventual IBM product lines. This includes everything from finding ways to make use of the hooks OpenPower has on the line for FPGAs (a topic the team at his center are already engaged with on an existing Maxeler test cluster) to far more exploration of how ARM will fit into the big picture of large-scale computing.

And as mentioned above, the workloads that will be addressed by the OpenPower machines will also be different from what some of large research and scientific projects will be for Summit and Sierra. Since the center provides computing assistance up and down the stack to many UK-based industries as well as for social projects, the architectures and application work that happens – with a great deal of on-site staff from IBM, who will now take residence at the center as part of that multi-million package over five years – will fine-tune existing industry applications for the coming machines. But just as important, as Brereton explains, they will be “taking on areas that are completely new in computer science in the realm of data science and cognitive computing.”

“If you think about a place like the Watson center in the United States, it is that type of environment we will have here with completely new ideas and applications we will explore, much of it that has not appeared yet.”

One of the big priorities (perhaps not surprising given the UK roots) is much greater exploration of ARM. As we have written about in the past, the center has been keen for some time on showcasing the validity of ARM for demanding HPC and data-intensive computing, but this new focus will push their OpenPower limits—although it’s not clear how this ties in with the Power8 and Power9 architectures in the forthcoming machines—we’ll get more information on that in a dedicated article about this in the near future.

At any rate, Brereton says, architecturally speaking, they are “placing the bets. It’s a time of incredible change. The architectures of the last fifteen or more years are clearly not going to take us past the next several years, so there needs to be a step change and there are lots of options but for the workloads we’re looking at ahead, we are confident about this architecture.”

On the OpenPower front, Brereton says what has the team really looking ahead is the coming addition of both NVLink and CAPI in coming systems. He says that although he sees opportunities in the Intel arena to solve the same data movement problems at silicon level, their partnership with IBM runs deep (although it should be noted the center is also an Intel Parallel Computing Center with staff there on that side of the house).

In a chat to explore more about what architectural differences might be found in these UK systems, which follow the data-centric computing path along the lines of Summit and Sierra, IBM’s general manager of OpenPower Alliances, Ken King, gave The Next Platform got a slightly closer look at where the Daresbury folks are heading with their roadmap, along with some previously unknown codenames for future generation Power8+ and Power9 nodes that will be found on the Summit and Sierra machines (ostensibly the two unnamed architectural leaps found in the roadmap below from when we talked about OpenPower a few months back).

“What will be in place is going to be very similar to Summit and Sierra from an architectural perspective, incorporating the same elements on the Firestone, Garrison, and Witherspoon nodes we’ll be delivering for those CORAL machines,” said King, which sheds light on names, at the very least, to begin referring to those future system designs following Firestone and featuring CUDA 8 and CUDA 9 (it’s not clear which name is matched to specific upcoming nodes).

Of these nodes, King says that these systems, all he could share is that they would all three include flash storage, Mellanox InfiniBand, Nvidia GPUs, and Power processors through the generations. “We are leveraging and reusing what we’re doing, so beginning in the 2015 up to 2018 timeframe, but the system grows significantly node-wise through that period.”

King continued, “It will be a very powerful system, I can’t compare it to Summit and Sierra, but do what is needed for the cognitive computing solutions running on top of it. It’s going to be leveraging Watson and other new data analytics capabilities for life sciences, oncology, and other business workloads. The ability to get that insight does take a powerful machine.”