One way to characterize the challenges of achieving exascale, is to look at how advancing compute, memory/storage, software, and fabric will lead to a future-generation balanced system. Recently Al Gara of Intel, Jean-Philippe Nominé of the French Alternative Energies and Atomic Energy Commission (CEA), and Katherine Riley of Argonne National Lab were on a panel that weighed in on these and a host of other interrelated challenges.
Exascale will represent a watershed achievement in computer science. More than just a nice, round number (“exa-” denotes a billion billion), exascale computing is also supposed1 by the Human Brain Project and others to be the point where machine models will reach parity with the complexity of the human brain. At once stunning and inevitable, exascale has captured the imaginations and budgets of research organizations worldwide.
China’s National Supercomputer Center has announced plans to release a prototype exascale supercomputer by 2018.2 Likewise, the US government’s National Strategic Computing Initiative has been charged with standing up at least two capable exascale supercomputers by 2021.3 The work these machines will be tasked with runs the gamut, from national security and climate research to more fundamental and theoretical domains such as astrophysics and quantum chromodynamics.
A surprise to no one, the need to reduce power consumption is primary to reaching this computational threshold. Indeed, the US Department of Energy suggested in 2016 that an exaflop computer based on simply scaling today’s technologies would require 200 megawatts4, at an annual cost of hundreds of millions of dollars in energy costs alone.
The Need for Growing Compute, Concurrency, and Community
While the processor architectures that will usher in the era of exascale are unknown, several matters are clear. We can assume that clock frequencies will decrease as part of the drive to conserve power consumption, and as a related matter, the number of execution units per socket and system will continue to increase. In large-scale scientific computing, billion-way concurrency is ultimately not unexpected, which will present serious challenges in terms of programmability.
The considerations in reaching forward to future computing, of course, are not limited to making everything bigger and faster. As Katherine points out, “it is irrelevant what we design and put on the floor, if science is not able to effectively use [it] or if it takes an incredibly long time to get science up to speed.” This truism is a crux of the value of exascale, particularly as hardware parallelism increases exponentially; programming models must abstract away the complexity associated with unprecedented concurrency, to avoid making decent coding an esoteric rarity.
In the acceleration of a balanced system, processor speed is in many ways the most straightforward factor. In Al’s words, “we can scale compute very quickly [and] easily outrun the memory at both memory and bandwidth.” To take advantage of that reality, he says, we must as an industry co-design software approaches along with hardware. Ongoing continuity among hardware generations and platforms plays a key role, as does a community-driven approach based on openness and shared capabilities.
Code must also span multiple platforms and problem domains, or as Jean-Philippe put it, “portability might matter more than mere performance. That is, you can sacrifice probably a few percent of extra performance if you get on to better portability and sustainability of your developments.” Openness is an important factor in that interoperability, and as such, community will play an increasing role in realizing the benefit of platform capabilities including compute, memory, storage, and fabric.
Memory and Storage Advances for Greater Size and Speed
Sheer capacity and bandwidth of memory are not going to wane in importance anytime soon. In Katherine’s words, “I will tell you, after 20 years of doing this, that very few … science problems come back to us and say, ‘we don’t need more memory; we’re good.’” That constant exists alongside the reality that the advances making system memory resources bigger, faster, and cheaper cannot keep pace with compute increases in FLOPS, not to mention core counts. Ultimately, memory per execution unit is likely to fall, emphasizing the need for memory resources to be used more efficiently.
Factors that increase the value of physical memory and storage can be assisted by both hardware and software. Hardware features throughout the platform, including the memory itself, the processor and chipset, and beyond, are sure to emerge as time goes on. Multi-tiered memory and storage topologies that include memory, various types of non-volatile RAM, flash, and others are enabling research that includes new object storage technologies to address increased processing speed and data capacity.
From a software perspective, how data is placed in the hierarchy from the fastest tier of memory to slowest must play an important role. Jean-Philippe suggests that “we must work on the data locality that is to decrease movement of data which we know is extremely costly in terms of time and energy.” Solving this challenge will require a reexamination of how software is architected.
Hardware and Software Mechanisms to Mitigate the Cost of Fabric
Improvements in data movement continue to lag beyond advances in other parts of balanced systems. In particular, the monetary cost of high-performance data fabrics are a challenge for designers of large-scale scientific computing infrastructures. There are issues to be contended with in terms of power consumption as well, but as Al summarizes, “Right now when we look at architecting systems, it’s mostly about costs … it’s not a power problem yet, because we can’t afford to put as much fabric [in place] as we would like.”
Intel and the rest of the industry are working on pushing down the cost of fiber optics, for example, and there is some reason for optimism, even as bandwidth requirements go up. Beyond technology advances themselves, there are inherent efficiencies built in to the effort to build bigger pipes. As Al points out, it doesn’t cost twice as much to install a water pipe that’s twice as big in diameter in your house, and the same is largely true for compute fabric in the data center.
As new capabilities continue to emerge, however, the current state of the art calls for software and system architects to drive down the requirements for data movement as much as possible. Using advances in memory and storage technologies to keep data as close as possible to where it’s needed, with the fastest retrieval mechanism possible, is a best practice that isn’t likely to change, at whatever scale business applications run.
Balanced platforms continue to evolve, and the march to exascale continues. Even as per-operation costs decrease across the board, taking best advantage of those advances is beyond the reach of isolated efforts alone. Incremental advances in today’s approaches (as well as compilers, libraries, and other tools) may be hard-pressed to meet future needs, which makes openness all the more vital. Community efforts not only allow for open collaboration; they also foster interoperability and code portability.
Katherine stresses the value of evolving collaboration as a precursor to advancement in general: “People do come up with good solutions, whether it’s a software solution or an application or whatever, and then they’re told, I’m sorry, it’s not going to be able to work on here because you didn’t plug in early enough in the design process or other even policy reasons. So that’s a thing to really innovate and move forward as a community, I think, that that’s one of the things we need to get past.”
Likewise, exascale itself is more a marker along the path than a goal to be reached. Jean-Philippe goes so far as to question the value of exascale as a concept: “We prefer to say ‘Extreme Scale’ [which] will be true in 10 years or 20 years … but more important is that we consider HPC as a whole thing, important at different scales and not just extreme scale.” That sentiment is an important reminder that even major achievements are comprised of modest ingredients.
Matt Gillespie is a technology writer based in Chicago. He can be found at www.linkedin.com/in/mgillespie1.