If you want to understand where we are going with computer architectures and the compilers that drive them, it is instructive to look at how compilers have made the leap from architecture to architecture starting six decades ago.
Let’s go back to the very first compiler, IBM Fortran, in 1957. It was an amazing piece of technology. If you look at where it started and what they came up with, it was the kind of effort that you can’t even imagine someone undertaking today.
IBM wanted to sell computers, and to sell computers it needed to enable more people to program them. At the time programming was done with assembly language. This was too hard, and IBM knew that. So Big Blue wanted a way for people to write programs more quickly, without sacrificing performance. The Fortran developers – meaning those who created Fortran, not the programs that used the compiler and language – wanted to deliver performance as close as possible to hand-tuned machine code from programs written in what today we call a high-level programming language.
When it comes to compilers, you have to reckon with the three Ps: performance, productivity, and portability. The creators of Fortran were comparing performance to machine code. The productivity benefit was that programmers no longer had to write that machine code. I don’t know IBM’s original intent regarding portability, but you could not patent software in 1957 and IBM didn’t complain about other organizations implementing Fortran. So, shortly thereafter, Fortran compilers for other machines from other vendors appeared. This immediately gave Fortran programs portability that was inconceivable with machine code.
Compilers were off and running from this point forward. Let’s look at it by decade.
The First Decade
It was in the 1960s, just before my time, that computer architects and therefore compiler writers first started to think about parallelism. Even then people were thinking that computers were not fast enough, and were not getting faster quickly enough, and saw parallelism as a possible answer to that problem. We saw the introduction of instruction-level parallelism in Seymour Cray’s CDC 6600 and CDC 7600, and in the IBM System/360 Model 91. Even more aggressive was the development of the ILLIAC IV designed by researchers at the University of Illinois and built by Burroughs, the Control Data Corp STAR-100, and the Texas Instruments Advanced Scientific Computer. The CDC and TI systems were memory-to-memory long vector machines, while the ILLIAC IV was what today we would call a SIMD machine. The ILLIAC was limited in what it could do because there was no primary scalar processor, and it was really hard to program. The Fortran compiler for the STAR-100 added syntax to describe long, contiguous vector operations. The TI ASC machine was the most interesting because it had the first automatic vectorizing compiler. TI did some amazing work and really stretched the state of the art for compiler analysis at the time.
The Second Decade
In the 1970s, the Cray-1 became the first commercially successful supercomputer. It had many follow-ons and many copycats, which a lot of readers probably know very well. The success of the Cray machines was due in large part to the introduction of vector registers. Just like scalar registers, they allowed programs to do many operations on small vectors of data without having to load from and store to memory. Cray Research also developed a very aggressive vectorizing compiler. It was similar in many respects to the earlier TI compiler, but the Cray compiler had additional capabilities that made it quite interesting. One of the most important was the ability to provide compiler feedback to the programmer.
When developers compile a program today, the compiler generates error messages if the program contains syntax errors. The programmer keeps fixing these errors and re-compiling until they have a working program. If a developer wants a program to run faster, they can enable a compiler optimization flag to make the generated executable run faster – they hope. The difference between optimized code and non-optimized code on most machines is maybe a factor of two, and usually much less – say 10 percent, 20 percent, or maybe 50 percent – better performance. Contrast that with the vector instruction sets available on the original Cray machines. In that case, programmers would often see a factor of 5X to 10X improvement when code was optimized and vectorized by the Cray compiler. Programmers, and especially HPC programmers, are willing to do a lot for a factor of 5X or more in performance improvement.
The feedback from the Cray compiler would tell the programmer that it had vectorized a loop at line 110 and another loop at line 220, but it did not vectorize the loop at line 230 because there was an I/O statement or a function call at line 235 which couldn’t be vectorized. Or, maybe there was an array reference that the compiler couldn’t analyze, or an unknown variable in the second subscript of a particular array that hindered dependence analysis, so the loop couldn’t be vectorized. Cray programmers, wanting to achieve vector performance, paid careful attention to these messages. They modified their code according to this feedback, maybe taking an I/O statement out of a loop, or pushing a loop into a subroutine, or modifying an array reference to remove an unknown variable. Sometimes they would add a compiler directive to convey to the compiler that it’s safe to vectorize even if the compiler can’t determine that through dependence analysis.
Thanks to that compiler feedback, three things happened.
First, more programs were vectorized and more programmers benefited from Cray vector performance. Second, Cray programmers were trained to stop putting I/O statements, conditional statements, and procedure calls in the middle of their loops. They learned that stride one access to data was important and made sure array accesses in their inner loops were stride one. Third, programs that auto-vectorized with the Cray compiler could be recompiled on a lot of similar vector machines from Alliant, Convex, Digital, Fujitsu, Hitachi, IBM, and NEC. All of these systems had vector processors with vector registers and automatic vectorizing compilers, and code previously optimized for a Cray happened to vectorize and run pretty well on all of these machines. In short, Cray programmers ended up with a trifecta of performance, productivity, and portability.
I can’t over-emphasize how important this compiler feedback was, and how successful it was in training a whole generation of HPC developers to program vector machines. Everybody that used it was very happy. Back in those days, I was at the University of Illinois and we thought about starting a company based around parallelizing compiler technology. Of course, our technology was better than anyone else’s because we were the smart academics. In the early days of the Cray compiler, users complained that it couldn’t vectorize anything and that they had to rewrite their code. We thought we could solve those problems and started this little company to do just that. When we approached those same users a couple of years later and showed them our tools to parallelize and vectorize loops, they told us they didn’t need such tools because the Cray compiler already vectorized all of their loops. It wasn’t so much that the Cray compiler got smarter, although I’m sure it got better over time. Primarily, it was that programmers had been effectively trained in how to write vectorizable loops.
The Third Decade
The 1980s brought the widespread implementation and use of multiprocessors. There had been multiprocessors earlier, including IBM System/370s and Burroughs systems. But it was the introduction of 32-bit single-chip microprocessors that drove multiprocessing into the mainstream. To start a computer company, you no longer needed to design a processor – you could buy that. You didn’t have to write an operating system – you could license Unix. All you needed was someone to screw it all together and put a nameplate on it. Oh, and it would be good to have a compiler for all of that. Sequent, Encore, and SGI all built systems that had great performance with one microprocessor, but if you could get a bunch of them running in parallel it was even better.
Unlike automatic vectorization, which was incredibly successful, automatic parallelization was basically a bust. It worked great for innermost loops, but to achieve meaningful parallel speed-ups you typically have to parallelize outer loops. But of course, outer loops add complexity to the control flow and often include procedure calls, and now your compiler analysis completely falls apart. An approach that did work was to get the programmer involved, analyzing the parallelism and entering compiler directives to drive it. Out of this, we saw the emergence of a variety of directive sets for parallel loops. Cray, Encore, IBM, and Sequent each had their own directive sets, the only commonality being that SGI adopted the Sequent directives. Similarly, there were a number of scalable systems with network message passing, which drove the development of a variety of vendor-specific and academic message-passing libraries.
The Fourth Decade
In the 1990s, all those message-passing libraries were supplanted by the Message Passing Interface (MPI), and all those vendor-specific parallelization directives were replaced by OpenMP. More scalable parallel systems appeared, such as the Thinking Machines CM-5 and others with thousands of commodity microprocessors. Microprocessor-based scalable systems were programmed largely with MPI, but MPI is a library and is opaque to compilers. The advantage of MPI as a programming model is that while communication is expensive, it is very much exposed in the program. MPI programmers know when they are inserting explicit communication calls, and they work very hard to minimize and optimize communication as much as possible. The downside is that from a programming model point of view it’s very low level, and MPI programmers get no help from compilers.
OpenMP was born out of that variety of vendor-specific parallel directive sets because users demanded it. They refused to rewrite their programs 12 times just to be able to run them on all the different systems that were available. The whole charter of OpenMP was to spell “PARALLEL DO” the same way. Unlike MPI, directives must be supported by a compiler because they are part of the language. You can’t implement OpenMP as a library.
Also in the 1990s, we saw the emergence of SIMD instruction sets for single-chip microprocessors, like SSE and SSE2 from Intel, and we saw compilers resurrecting the same vectorization technology that was by then 20 or 25 years old in order to exploit those SIMD instructions automatically.
The Fifth Decade
Shortly after 2000, multicore microprocessors became commonly available from a variety of vendors. The world suddenly realized that this parallel programming problem had to be solved for the masses.
At the time, a lot of applications used a flat MPI programming model across all the cores on a chip as well as across distributed-memory nodes. You can always add more MPI ranks to use more parallelism. This works to a point, but some MPI programs encounter scaling issues when you start to get to really large numbers of ranks. The memory usage can get a little out of control when data is replicated in every MPI rank. Once you start putting a dozen or two dozen cores on each node, that is 12X or 24X the amount of memory in use for each item that is replicated. For programming a single node, OpenMP and other shared-memory parallel programming models began to gain traction. For instance, in the 1990s Intel introduced its Threading Building Blocks (aka TBB) and by some claims it’s the most popular parallel programming language today.
Around this same time heterogeneous HPC systems started to appear, although heterogeneity was not really new. We had heterogeneity in the 1960s with attached co-processors for array processing. Attached processors were very popular because they could do specialized operations faster than the CPU. They could often deliver the performance of a mainframe at the price of a minicomputer, something that Floating Point Systems did well for about 15 years. In the early 2000s, there were several accelerators developed specifically or re-purposed for the HPC market, such as the ClearSpeed accelerator and the IBM Cell processor. Both products showed some real technical success, but the HPC market proved to be too small to support the development of custom processor chips on its own.
That left a gap open for GPU computing. The advantage of GPUs over custom accelerators is that a GPU has a very good day job – in the early days it was graphics and gaming, and now it includes AI and deep learning as well – so the very high cost of developing a processor on the latest silicon techniques and the software necessary to drive it can be sustained.
With the rise of heterogeneity came the obvious question – how do we program these things? Back in the day, Floating Point Systems delivered a subroutine library that used their accelerators under the hood. Programmers called into the library, and the HPC masses could effectively use the accelerator without actually programming it (though there were some programmers who did). Today, to address the range of applications that have been developed through the years on scalar, vector, and scalable systems, HPC developers need to be able to program accelerators effectively and they want their programs to look as normal as possible. Using OpenCL or CUDA gives you a lot of control but can be very intrusive with respect to the impact on existing HPC source code. The programmer typically extracts code to be run on the accelerator into specially annotated functions, and often writes it in a very different way than it’s written for the host. This is where directive-based programming models and languages specifically designed for general-purpose parallel programming have some advantages, and a good optimizing compiler can come in very handy.
Living In A Parallel Universe
This all brings us to another very important point. All of the technologies developed in the 1960s for instruction-level parallelism are now inside of your microprocessor. The vector processing concepts from the 1970s live on in the SIMD registers inside of your microprocessor. The multiple processors of the 1980s Crays and commodity processor-based systems are all inside of your microprocessor in the form of multiple cores. The microprocessor today is basically the sum of all architectural work done in the past 50 years, and we can afford to do that because of the fantastic transistor budget – billions of gates – that we now have. And we also now have heterogeneity, and, in some cases, we even have that on the same chip. For instance, the Sunway chip in the TaihuLight system in China is one chip from a packaging standpoint, but each of the quadrants on the die includes a master processor and 64 compute processors. So, it is fundamentally heterogeneous, and you program it more like a CPU-GPU hybrid than a multicore processor.
To take full advantage of heterogeneous GPU-accelerated nodes, a program needs to expose a lot of parallelism – and the right kind of parallelism. These highly parallel processors deliver performance not from a faster clock, and not because of some extremely exotic architecture. Rather, it’s because more of the available gates are used for parallel cores than are used for cache, out of order execution, or branch prediction. All of those staples of commodity CPUs are left out of GPUs, and the result is a massively parallel processor which delivers more throughput on suitable kernels and applications.
Which brings us back to the three P’s, and the fact that performance is not the only goal for HPC developers. Programmers want high performance, good productivity, and broad portability. And most of them want to write their programs only once, not multiple times in the same file with different paths selected based on conditional compilation flags.
In the days when all that you had were single processors on a node and all parallelism spanned across nodes, programmers could get away with using flat MPI. Such programs worked well across a broad family of machines, but that is no longer the HPC world we live in today. We have multiple processors in a node, we have SIMD instructions, and we have heterogeneous accelerators that introduce even more types of parallelism. To achieve portability, we need to be able to write programs that will map efficiently across differing numbers of cores, different SIMD or vector lengths, and systems that are either homogeneous or heterogeneous – without having to rewrite programs each time we move to a new system. For productivity, we need to abstract as much of this away as possible. What we’re trying to deliver in compilers for today’s HPC systems are the same abstract qualities that IBM delivered six decades ago with the first Fortran compiler, which really was the first high-level language compiler of any type.
Getting as close as possible to hand-coded performance on today’s extremely complicated hardware is the goal. Today, the PGI OpenACC compiler targeting NVIDIA GPUs can often deliver close to native CUDA performance for small, straightforward code blocks. For more complex code sequences like large functions, or when whole call trees are being ported to the GPU, it’s much more difficult to approach native code performance.
To be fair, the OpenACC compiler is at an inherent disadvantage. When rewriting a code in CUDA, a programmer might determine that a given data structure is not right for the GPU and might change it to increase parallelism or performance. Perhaps the programmer determines part of the program logic isn’t quite right for a GPU, or that data access patterns are suboptimal, and rewrites that logic or loop to improve data access patterns. If you make these same changes in the equivalent OpenACC program, you often get much better performance as well. But you would rather not have to do that, especially if the rewrite slows down the code on a multicore CPU. As a result, the OpenACC code may not get the same level of performance as if you had expended that coding effort. Even so, if the performance of the OpenACC code is close enough to CUDA on the GPU, the productivity and portability benefits of a directive-based programming model will often prevail.
Despite all the progress we’ve made in compiler technology over the last 60 years, some people are still of the opinion that compilers are more of a problem than a solution. They want predictability from a compiler, not the advanced analysis and code transformations that optimizing compilers do behind the scenes. That can lead down a path where we try to take functionality out of compilers and put more responsibility in the hands of the programmer. One can argue that this is exactly the path that OpenMP is following today, and the danger I see is that we may end up sacrificing portability and productivity for predictability. For instance, instead of automatic vectorization for the SIMD instructions on Intel Xeon processors, imagine if you still had to vectorize every loop manually using Intel’s SSE and AVX intrinsic functions. You’re effectively writing inline assembly code. That is very predictable, but few programmers want to write all of their compute-intensive code at that level. Particularly when you have to rewrite that code for each generation of SIMD instructions, or to use the SIMD instructions on non-X86 CPUs.
Any reasonable advance in computer architecture can and should come with a reasonable advance in compiler technology. The benefit is not measured only by the performance those architectures and compilers can deliver. It also must be measured by the human effort saved and productivity gained when programs can be ported from one generation of HPC system to the next without having to be completely rewritten.
Michael Wolfe has worked on languages and compilers for parallel computing since graduate school at the University of Illinois in the 1970s. Along the way, he co-founded Kuck and Associates (acquired by Intel), tried his hand in academia at the Oregon Graduate Institute (since merged with the Oregon Health and Sciences University), and worked on High Performance Fortran at PGI (acquired by STMicroelectronics and more recently by Nvidia). He now spends most of his time as the technical lead on a team that develops and improves the PGI compilers for highly parallel computing, and in particular for Nvidia GPU accelerators.