Baseball legend and locker-room philosopher Yogi Berra once imparted this sage advice: “When you come to a fork in the road, take it.” The current generation of HPC architectures is approaching such an inflection point, and in this case, that fork in the road may be a bit daunting. The architectural frameworks now in place are showing signs of age, straining to support the broadening demands of HPC users and facing a new challenge of being able to support new, innovative building blocks, particularly as the industry moves into a period of rapid innovation where workloads for HPC, Big Data, and Cloud converge and even compete for compute cycles.
But are we really talking about a ‘revolution’ in HPC architectures? Probably not. Rather, an evolution is in the offing, one that keeps essentially intact the current multicore framework, maintains investments in the software stack and preserves the cost efficiencies of distributed computing. Yes, the HPC industry has reached a fork in the road, it wants something new – but with a heavy dose of direct-line continuity to the advantages it now has, hoping to preserve the application investments of the past several decades.
While most industry luminaries agree a change in HPC architectural direction is necessary, not everyone agrees on the path that will take us in the right direction. “The overall architectural approach we have now is not really going to change radically,” said Steve Conway, Research Vice President in International Data Corp’s High Performance Computing Group. “We’re not going to jump to some new computing scheme, such as a single processor with access to central memory, like we used to have. A distributed architecture is going to stay. And that’s good, because otherwise we would have to change everything drastically, it would be far too disruptive and too costly.”
In fact, cost / value is one of the best attributes of distributed computing.
“What we have now is basically clusters,” said Conway, “independent computers linked together with networking and software. This approach has brought on the major challenge we face now that needs to be improved upon, but this approach also has led to very good price/performance. It’s democratized the HPC market, and we don’t want to give up that price advantage.”
The challenge with today’s systems is the result of an influx of raw computing power: systems comprised of hundreds and thousands of increasingly powerful processors – without a concomitant development in complementary technologies that would create high functioning, balanced systems.
It’s this overbalance in processing that has brought on what one leading computer scientist calls “a GPU hangover.”
“We have systems that have evolved, a software stack that has evolved, we have programming models that have evolved,” said Rick Stevens, Associate Laboratory Director for Argonne National Laboratory, “in fact, we’re celebrating anniversaries of various pieces of the software stack, things that were innovative 20 years ago, and that’s a little scary.”
There is no question that important and valuable work continues to be done on these systems, with tremendous progress made in climate modeling, energy discovery and many areas of data analytics. But there has been uneven development in the building blocks that make up HPC systems and the underlying, complicated fabric that ties them together, evolving into architectural frameworks that are out of balance and misshapen. Most compelling is the need to integrate powerful new memory, storage and data movement technologies coming on line over the next few years.
“With the architectures we have now, there’s power and cooling issues, resiliency issues, there’s massive amounts of parallelism to deal with,” said Al Gara, Chief Exascale Architect within Intel’s Technical Computing Group. “It spans the gamut from the hardware to the systems software to the parallel file system. And most importantly, to the application and the productive use of these systems to meet users’ science objectives and mission objectives.”
“We’ve seen systems getting bigger and bigger,” said Gara. “We’ve seen them break the petaflop barrier, the 10 petaflop barrier, the 50 petaflop barrier. But what’s lacking is a balanced framework for compute- and data-intensive computing, both getting served by one common system architecture: a science platform and a research platform. We’ve had very large-scale systems built out of components. You had CPUs and interconnects and memory all living under a pile of software, in many cases, written by multiple people. Unfortunately, much of it evolved somewhat haphazardly. What we need is a new system approach that is exquisitely balanced to make trade-offs for a large number of workloads and their needs. We need an architectural framework that is designed system down rather than components out.”
The new architectural frameworks under development will emerge over the next two to three years and will deliver significant performance improvements, enabling major advancements in support of both compute- and data-intensive workloads., These new frameworks will allow the adoption of innovative new technologies while attempting to preserve the investments previously made in hundreds of scientific and business applications. In order to move the industry in a new architectural direction, the first step was to clearly identify what current bottlenecks and performance barriers could be improved. These fall into four major categories:
- Memory Bandwidth
Possibly the single most important factor in improving HPC systems performance is enhancement in memory-related technology.
Since the 1970s, when HPC moved away from discrete component computers, users and programmers have been impeded by a seemingly decreasing amount of memory bandwidth available to the processors. From an applications standpoint, this means the HPC software must be increasingly arithmetically intensive. At its simplest level, the processor takes two numbers from memory, performs a math operation on them and writes out a third number back into memory. For each operation, you have two inputs for every output. How fast that can be completed depends on how fast the system can get those values out of and back into memory. This latency – the distance and time gap for data to travel between processor and memory – is a significant limitation inherent in today’s architectures.
Cache architectures have been developed as a work-around to the memory bandwidth issue, but many modern algorithms are challenging to cache bandwidth. The better solution is bringing memory closer to processing.
“The overriding need is for integration of DRAM memory inside of the processor package, as opposed to on the other side of an I/O bus or on the other side of a storage area network,” said Gara. “This would balance memory bandwidth with add-on accelerator throughput and dramatically reduce latency.”
“Memory underlies a lot of the problems we have today,” said Stevens. “But if you have an order of magnitude of more memory bandwidth that means a lot of software strategies that weren’t effective before become effective again. We need as much memory bandwidth as possible so the programmer doesn’t have to think about how to partition data, keeping small parts of it close to the processor. Programmers want to be able to write a natural form of the algorithm and have it perform well.”
- Data Movement
Data movement – retrieving data from and returning them to storage devices – allows HPC systems to analyze the enormous stores of data, whose capacity has expanded comparably to Moore’s Law. However, data movement has flat-lined over the last five to 10 years, and with the explosion in data-intensive computing, it has become a major powering, cooling and cost factor.
“We have a problem of greater storage capacity without more bandwidth,” said Stevens, “so to get more bandwidth we have set up schemes in which many, many more storage devices are working in parallel.”
The onslaught of data stored on thousands of drives creates extreme programming and integration complexity, along with latency problems and, generally, a drain on overall computing and staff resources.
“We’re generating a lot of data, we’re reading a lot of data, and we want to be able to write programs that can handle the increased volumes of data in the simplest possible way,” Stevens said. “We need simplicity not because we’re stupid, it’s so we can have broadly applicable systems across multiple application environments, for medical research, data analytics, entertainment. It’s so we don’t have to have a roomful of Ph.D. computer scientists trying to figure out how to make systems work. What we want is an architecture that is naturally high-performance, so more people can write applications that get the job done.”
- Aggregate Processors
The explosion in processing power in today’s systems logically means more compute power makes a better computer, right?
Not so fast.
Many-core systems attack computational problems, distributing them across thousands of processors and working in parallel. The results are combined into an outcome – a simulation, a model or a number. But distributing problems across many processors poses a major programming challenge.
“It’s not about bigger systems with more aggregate processors,” said Stevens. “Bigger systems in fact tend to narrow the class of problems that can be solved. A system that grows from 1000 processors to 1 million is more powerful, but it’s also 10 or 100 times harder to use because of where I put the data, whether I have enough bandwidth for the data and enough I/O. So as systems get bigger they must also allow us to move applications to larger-scale systems.”
This programming complexity relates directly back to the need for greater balance – systems that can support the raw power of manycore environments across the spectrum of HPC building block technologies. It’s this complexity that hinders today’s architectures from becoming the broadly applicable systems – across multiple compute- and data-intensive application environments – that the HPC and enterprise communities want them to be.
“Today we have to be aware of the topology of the interconnect fabric and the placement of processes,” said Stevens. “What we lack is a very distributed memory architecture, with large amounts of addressable memory, in which programmers don’t have to worry where all the data goes or worry about the message passing among processors, in which they can write the code in a way that is unaware of crossing the boundaries between processors.”
- Power Consumption:
The enormous amount of electrical power consumed by today’s supercomputers is a compelling issue to systems administrators and the C-level suite – the people who pay the electrical bills – but less so to end users, for whom power consumption is invisible. But the rapid rise in power use must be controlled for HPC systems to remain a practical computing platform.
“We’ve been improving supercomputing performance almost 2X every year, so supercomputers are approximately 1,000 times faster than they were 10 years ago,” said Gara. “At that rate, if we don’t significantly improve power efficiency we would have computers that took a thousand times more power than today. So energy efficiency is enormously important.”
Conway adds that according to IDC surveys of 10 years ago, power consumption was not a Top 10 issue. Now it consistently ranks number two or three. Some data centers are consuming as much electricity as a small city, bumping up against the limits of power available from local utilities. There’s even discussion of equipping data centers with small nuclear reactors.
Across all of these issues – memory, I/O performance, multi-core programming and power consumption – the good news is that new technologies are coming on line that deliver significant performance improvements. How to put them together within a distributed framework that delivers more balanced, versatile and efficient systems is the focus of the major HPC vendors.
“Architectures over time continuously adapt,” said Mark Seager, Chief Technology Officer for the HPC Ecosystem, Intel. “This is driven by the emergence of new technologies. It’s like building a house: you build with the best materials if you want to build the best possible house. But those materials get better all the time, and eventually the house you built years ago becomes somewhat out of date. There are tremendous opportunities coming forward that are resulting in new architectural directions – new opportunities in memory, in optical technology, in transistors, new opportunities to integrate these technologies and allow us to do things we couldn’t do before. We’re making great strides in technology integration, integrating memory and processing. This requires looking at things from a system perspective, a holistic view of what is the optimal system and then out of that really falls the definition of a new framework and the development of those ingredients.”
In the second article of this series, we’ll talk with HPC experts on what needs to happen for the community to move forward with a new architectural direction.