The Final Frontier: Talking Exascale With Oak Ridge’s Jeff Nichols

Timothy Prickett Morgan

2 years ago

Just ahead of the revelations about the feeds and speeds of the “Frontier” supercomputer at Oak Ridge National Laboratory concurrent with the International Supercomputing conference in Hamburg, Germany and the concurrent publishing of the summer Top500 rankings of supercomputers, we had a chat with Jeff Nichols, who has steered the creation of successive generations of supercomputers at Oak Ridge.

Nichols is associate laboratory director for Computing and Computational Sciences at Oak Ridge and two decades ago was running the Computer Science and Mathematics Division for the lab, and before that did a stint as deputy director of the Environmental Molecular Sciences Laboratory at the Department of Energy’s Pacific Northwest National Laboratory, and is one of the key developers of the open source NWChem computational chemistry simulator that includes both quantum chemical and molecular dynamics scale interactions.

We spoke frankly with Nichols about the importance of exascale and the difficulty of always designing the future with uncertain technology roadmaps and still delivering 10X improvements in generation after generation of systems. And we talked about money and time because these are also ever-present factors for any system that shape and enable to salient characteristics of that system, as much as throughput in flops, bi-sectional and injection bandwidth, storage, energy consumption, and what have you.

One of the things we wanted to know is how the planning process for systems like the “Jaguar” and “Titan” and “Summit” and now “Frontier” supercomputers begins. Do Oak Ridge have a set monetary budget and then the system architects see what they can get? Do they start with the electricity and thermal budget first? Or do they have a performance goal and then just see where the money and electricity budgets will end up and then wince as they throw the budget over the wall to the US Congress?

“Our target was to deliver a double precision exaflops of compute capability for 20 megawatts of power, and Frontier is a peak is two exaflops and our target is 29 megawatts of power when it’s running at full power,” Nichols tells The Next Platform, and by our math, that is one exaflops peak in 14.5 megawatts. Nichols adds that the goal way back when, and mostly for the poetry of it, was 10¹⁸ flops in 2018. The reason that Frontier could exceed the power consumption goals is that it took four years longer to deliver than the original plan. “Frontier has met the challenge when you are talking about the boundary conditions that are out there. We hadn’t really talked about money because I think the thing is that we wanted that 10X performance. We wanted to be 2 exaflops and 20 megawatts. But we have to think about usability, we have to think about the fact that we can’t just be doing something stupid and build something that users can’t program. There are all of those kinds of boundary conditions as well. So I think we’ve done a good job of being a steward of the dollars that have had had to go into the purchase of this machine by delivering a machine within those boundary conditions and that’s going to be a very usable machine.”

When we asked Nichols about the hardest thing about designing for the future – and that is what supercomputing has always been about, which is to build a machine with technologies that are four or five years in the future so the machine’s simulations can better help you see further into the future across myriad domains – Nichols had a surprising answer. One that, quite frankly, surprised us. And to find out what he said, you are going to have to watch the interview.