Architects are optimists. In the computer industry, particularly within the HPC community, there is a small cadre of technologists who have the intellectual horsepower, vision and decades of experience to build wholly new system designs. These are the rarefied few who live on the technology frontiers of the future, using the raw clay of emerging technologies to build a framework where future users can thrive. They see possibilities only glimpsed at today and enable the discovery work of tomorrow to unfold.
Architects practice “the art of the possible.” And in HPC today, architecting a system design that can continually evolve to integrate the latest revolutionary technologies is the ultimate possibility.
Al Gara, Intel Fellow and Chief Exascale Architect for Intel’s Technical Computing Group, is one of the renowned architectural visionaries of his generation. Gara is well known as the lead architect of the group that produced the IBM Blue Gene system, awarded the 2009 National Medal of Technology and Innovation. Now, as Intel’s Chief HPC Solutions Architect, he has taken on a task only a very few architects in the computer industry are equipped to handle – Develop an integrated, scalable system design that:
- Continues to evolve, taking full advantage of the continual pipeline of new, revolutionary technologies that enter the market
- Scales to the extreme limits of human capability, supporting the technology and applications of today while maximizing technology of the future
- Supports not only the evolving high performance computing demands of the coming decade but also the emerging needs for big data and high performance data analytics
- Preserves application and performance portability across generations
In short: a highly scalable, versatile and power-efficient system design beyond anything we’ve seen – to the point of only what we’ve imagined.
The result is Intel’s HPC scalable system framework – the foundation of the Aurora supercomputer, Intel’s response to the U.S. Department of Energy’s CORAL program being built in collaboration with Cray for the Argonne Leadership Computing Facility.
“Over time, architectures, frameworks, and platforms have continuously adapted and evolved, driven by new technologies,” says Gara. “In just the past 10 years, computers have progressed by orders of magnitude and there are tremendous opportunities coming forward resulting in new architectural innovations. New developments in memory, in optical technology, in transistors are allowing us to do things that we couldn’t do before. We are driven by the need to evolve, to do things that we couldn’t imagine even five years ago.”
The challenges, however, associated with integrating the multiple component-level architectures of next-generation extreme scale machines are profound.
“There are new technologies emerging on the scene, and Intel is driving many of them,” Gara says. “Whether it’s memory, optics, silicon technology, to name some, Intel is at the forefront. So, to create the next generation of highly efficient supercomputers, we need to make sure the right ingredients are combined in precisely the right way. To do that, we must look at things from a holistic view, from a total system perspective. And out of that comes the definition and the development of those ingredients.” This is Intel’s HPC scalable system framework.
The HPC community is well aware that existing system designs are struggling to achieve effective balance while keeping up with the pace of component technology innovation. But the specific event that launched several new bids to build the next generation architectural direction for extreme scale computing was kicked off when three U.S. Department of Energy laboratories issued a joint RFP, referred to as CORAL (Collaboration of Oak Ridge, Argonne and Lawrence Livermore) to acquire advanced supercomputers for delivery in the 2017 to 2019 time frame. All three labs had aging computing resources. The time had come for the next great architectural design leap forward.
Aurora, a system built collaboratively by Intel and Cray, with significant requirements input from Argonne National Laboratory, will be next on Gara’s long list of industry-changing accomplishments. The tripartite team combines Intel’s understanding of silicon integration processors, fabric, memory and architectural innovation with Cray’s knowledge of the system software stack, of systems integration, of manufacturing and the delivery of many of the world’s most powerful and productive systems. The work of both companies is guided by scientists at Argonne, a leading end user scientific research organization with decades of experience in the practical realities of leveraging and implementing supercomputers for the advancement of scientific discovery.
“As we look to the future of high-performance computing (HPC), the acquisition of Aurora ensures Argonne will continue to retain and build on the global leadership of U.S. supercomputing centers,” says Rick Stevens, Associate Laboratory Director for Argonne. “This will allow the ALCF to remain dedicated and highly focused on breakthrough science and engineering. We were seeking an architecture designed to deliver a well-balanced and adaptable system capable of supporting both compute-intensive and data-intensive workloads, and Intel’s is well-suited to our needs.”
Scheduled for production in 2019, Aurora, as defined today, would be the most powerful supercomputer in the world. Yet Gara says “Aurora is not a final destination for Intel. It’s a very important step in a journey that’s much longer for us, an exciting beginning of a new future. It’s a new turning point, the first introduction of some of the new concepts that we envision for the incredible insights this new scalable system design is going to enable.”
“As we look where system design is going, we know that the challenges of cost, power, and integration of these new technologies into a coherent, usable form factor results in an ever-increasing need to bring more functionality into silicon,” says Gara. “We integrate now onto a single chip what used to be a roomful of electronics. That really is the key direction we’re moving toward, the common thread for how we will continue to advance supercomputing.”
Silicon integration not only improves performance, it reduces power usage, a critical requirement. “We’ve been improving supercomputing performance almost 2X every year, so supercomputers are approximately 1,000 times faster than they were 10 years ago. At that rate, if we didn’t improve power efficiency, we would have computers that took a thousand times more power than today. So energy efficiency is enormously important.”
A critical aspect of Gara’s task has been to integrate new technologies, including disruptive technologies, in a way that doesn’t result in a disruptive experience for the user. “We integrate these technologies so users have a comfortable, familiar feel. We seek a balance between allowing users a transitional path from where they are now to where we’re going, including application and performance portability.”
Providing a comfortable user experience within the revolutionary gains is yet another critical area of guidance that Argonne brings to this collaboration alongside Intel and Cray.
“We’re changing the architectural trajectory so we absolutely must work closely with an organization that wants to use such new systems to give them the confidence they will achieve their productivity, competitive, economic and scientific objectives. Argonne has been instrumental in working closely with Intel and Cray at a collaborative design level bringing an approach to innovation that will meet the needs of their users. Working closely with Argonne in this collaborative fashion will help us make sure the trajectory ends up in the right place to enable their science applications from day one.”
“We regard Aurora as a turning point – and not just for HPC and technical computing. The highly scalable system level design that we envision will truly be a living foundation with the capacity to evolve and grow, scaling from the world’s largest supercomputers down to small and mid-size systems. This has important, long-term implications for all of computing.”