On today’s podcast episode of “The Interview” with The Next Platform, we talk about exascale power and resiliency by way of a historical overview of architectures with long-time HPC researcher, Dr. Robert Fowler.
Fowler’s career in HPC began at his alma mater, Harvard in the early seventies with scientific codes and expanded across the decades to include roles at several universities, including the University of Washington, the University of Rochester, Rice University, and most recently, RENCI at the University of North Carolina at Chapel Hill where he spearheads high performance computing initiatives and projects, including one we will talk about today–a joint effort to look at the state of HPC system resiliency and power efficiency.
While the second half of the discussion focuses on findings from a Department of Energy/ASCR report Fowler was involved with that evaluated HPC system performance, energy and resilience, the conversation more broadly focuses on the decades of system and software trends in HPC that lead Fowler to his position today. We discuss the role of accelerators, of non-standard architectures, and programming challenges on the road to exascale.
The report Fowler participated in and can be found here and includes the following areas:
Performance portability: We extended performance measurement and autotuning technology to petascale and heterogeneous systems, thus permitting scientists to exploit a wide range of high-end systems from a common code base.
Energy efficiency: Relatively minor code changes can result in significant energy savings for some applications, with little or no impact on performance. We are investigating software energy efficiency techniques to help reduce DOE’s energy costs.
Resilience: Petascale calculations are pressing the limits of reliability both in hardware and system software. We explored strategies to enable petascale applications to be resilient in the face of faults.
Optimization: We extended tools from the mathematical optimization community to develop strategies that collectively optimize performance, energy efficiency, and resilience.