Hunting the Mythical Automatic Parallelizing Compiler

As we have heard time and again, one of the greatest challenges for future exascale class computing—the one that falls right under power consumption–is how codes are ever going to evolve to take advantage of that much compute. Even several steps down the computing food chain, there are plenty of companies with mid-sized clusters that have relied on the same homegrown code for decades, but are locked out of the performance, efficiency, and scale due to outdated applications that can’t tap into the potential of massive parallelism.

As the size of the problems developers want to solve scales, so too do the parallelization challenges. To take advantage of the massive amount of compute power sitting idle with sequential codes, one of two things needs to happen. Either the code itself will need to be written fresh to match the capability of newer high core count systems, or the code must be parallelized. As one might imagine, neither of these tasks are particularly attractive.

Outside of research and academic HPC (where there are graduate students to do the parallelization necessary), commercial high performance computing centers are at a crossroads code-wise. Consider the oil and gas industry, where a great deal of advanced modeling and simulation work runs on top-tier supercomputers (for instance, the 6.7 petaflops-capable machine energy giant Total just bought), quite often using code that has been internally developed. While even the off-the-shelf commercial codes are hitting their scalability limits, it boils down to internal expertise to bring the code to the parallelization game.

But there’s another problem. And there’s no way to beat around the bush here—parallelizing code is damned hard. And it’s hard because, in part, it’s a manual process. Worse yet, the people who developed these codes are moving to Florida and taking up canasta. This retirement boom means a dearth of software talent for companies in oil and gas and other areas where an influx of HPC hardware is no longer matching outmoded code–hence the need to develop smarter, parallelizing compilers is even more pressing. And even if it wasn’t so difficult, it’s far harder to do it properly and minimize errors. The real value of automating parallelization would be accuracy, if not pure productivity.

What the world really needs, especially as architectures tend toward ever-more cores (with accelerators tacked on) and a lot more to exploit in terms of memory, is a magical way to auto-parallelize complex numerical codes. This would be a dream come true for developers the world over, so why hasn’t automatic parallelization become a reality?

The answer to that question would require a different technical article that explored all the reasons that doesn’t work in application-specific context. There are a range of compiler tools and libraries to make parallelization of thick scientific and technical computing codes easier from Intel, PGI, and others, and there is strong movement behind OpenCL, OpenACC, and others. The problem is, even with the best use of these tools, the process becomes semi-automatic at best—and again, prone to error. These cannot work around some limitations barring automation for a range of reasons—from untangling dependencies, seeing how loops are constructed, and managing all the differing requirements for compute, memory and I/O resources—not to mention the fact that many of these codes, especially the older Fortran ones—are just plain too unique to pry apart in an automated fashion.

The compilers and tools of today do help, but what might be missing? During the Rice Oil and Gas Workshop recently, which is devoted to the system side of the energy business, we heard from Appentra, whose aim is to deliver an as-automatic-as-possible parallelizing compiler that can hang with an existing array of HPC compilers and tools and handle some higher-level parallelization functions automatically, specifically by wrangling the loop problems in many sequential technical computing codes.

Appentra, which got its start in university research labs to make codes and HPC hardware fit better together, has developed what it considers the closest thing to date resembling the mythical “magical parallelizing compiler.” Most of the execution time for large-scale numericals are typically concentrated in large loops. The generally cited figure is that these loops account for 90% of the execution code, despite the fact that they are only 10% of the code. This seemed like a good place to start, and is where Appentra spins its web. Like other compilers for parallel jobs, its job is to automatically find such course-grain parallelism in the sequential source code. The difference is that it will then generate parallel source code that maximizes performance in modern computing systems, minimizing the synchronization overhead and the memory overhead needed to take advantage of parallelism.

“If we look at different compilers on the market, it’s designed to be complementary with other HPC compilers, explained Manuel Arenaz from Appentra. Both compilers (ours and the standards others for HPC from Intel, PGI and so on, as well as the semi-automatic tools like OpenMP, OpenCL, and others) can find coarse-grained parallelism, but ours sits at the top as a source to source parallelizing compiler versus their which are source to binary optimizing compilers.

Arenaz says that the key to HPC developer productivity and performance is a compiler takes care of where the parallelization takes place and how it’s done. “If you think about classical compiler theory, it’s built around data dependence analysis and idiom recognition. It is extremely sensitive to variations in the syntax and quality of different implementations of a given algorithm. As a result of these technical limitations, it was not shown to be effective for the automatic parallelization full-scale real programs,” he explained.

Challenges in achieving peak performance throughout the HPC workflow include extracting parallelism, managing the data communications, exploiting locality. “All the work that has already been done on sequential codes can be carried over—and instead of just trying to get better performance, we can start to think about how we can get these codes to achieve peak performance on manycore and multicore architectures.”

The company’s automatic parallelizing compiler, Parallware, has been tested on a series of scientific codes over the last ten years, the results of which are published here. While it’s hard to see this as a real magic bullet to bring codes into the modern era, it is an important step—one that will be showcased on Top 500-level systems over the coming years for key simulation codes.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. Two of the (many) serious problems with generating parallel applications are:

    (1) Programming languages that are fundamentally serial, with limited ability to specify particular kinds of parallelism bolted on as afterthoughts.

    (2) Hardware architectures that are fundamentally serial, with just enough restrictions added to enable communication and synchronization to be implemented correctly.

    The combination of these two factors is a big problem. Many attempts to develop programming languages that enable fine-grained parallelism have failed because fine-grained communication and synchronization are remarkably slow on current hardware. The smaller number of attempts to support fine-grained parallelism in hardware have largely failed because there are no widely used programming models that require this functionality.

    At larger scales of parallelism, the inability of programming languages to specify data motion (both through the memory hierarchy and across memory hierarchies) and the inability of hardware to explicitly control data motion provide the next stumbling block. As with fine-grained parallelism, there have been numerous failed attempts to include data motion control in software, but without explicit hardware support these are doomed to be inefficient. Likewise, attempts to provide explicit control over data motion in hardware have failed because of the lack of mature software support.

    The same software/hardware “chicken and egg” problem applies to the ubiquitous failure to develop secure systems. You can’t really do it without new programming languages running on new hardware.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.