Evolving the HPC Software Stack with Fresh Hardware

There is a gentle touch of hypocrisy about the supercomputing world. We proudly, and rightly, advocate that the ability of HPC to deliver results to science, engineering and business and this relies not just on hardware but also on software and people. However, many of us still easily have our attention grabbed by the hardware – future processors, new architectures, esoteric options such as quantum computing – leaving the needs of programming, user interfaces, skills development, careers and diversity hiding at the back of the room.

When planning the development or enhancement of application software to use HPC facilities, or planning for future HPC system options, one of the most important technology questions is the programming technology options.

If we cut it down to its simplest form, there are two choices to make when programming for HPC. The first is the programming language(s) to use, and the second is the choice of parallel programming model(s). There are, of course, a surplus of options for both.

Programming languages are already numerous and new ones are regularly thrown into the arena from computer science research teams. However, the choice of which language to use for HPC applications is actually fairly easy.

The two dominant languages for HPC (indeed for scientific computing in general) remain Fortran and C/C++, although there is increasing use of Python. The popularity of Python is partly because its ease of use is seen to aid programmer productivity, and partly because it makes it easy to call routines written in other languages (most performance libraries used in Python such as numpy, scipy etc. are written in C and Fortran).

These – Fortran, C/C++, or Python with Fortran/C support – are thus the default choice unless the programmer has a considered and well-articulated reason to choose an alternative.

That is not to diminish the value of new and clever languages. There are several of these created with HPC in mind from the start – such as Chapel (supported by Cray) and X10 (supported by IBM), both of which support abstractions designed to simplify programming at scale. However, a cynical (or visionary – it’s only a matter of perspective) observer might suggest that the best ideas of these will be (eventually) absorbed into mainstream options – either directly into the Fortran or C languages, or through community add-ons, or through the parallel programming models.

The dominant programming model in HPC for the last two decades has been MPI (Message Passing Interface) which is a language-independent protocol for inter-process communication, typically implemented as a library. MPI supports both collective and point-to-point communication and has proven to be both portable and scalable. MPI has demonstrated an ability to absorb the best ideas from elsewhere – e.g. one-sided communications. MPI is almost always available on supercomputers and HPC facilities, so choosing MPI gives you a strong portability capability for your code.

The second major parallel programming model is OpenMP, designed for shared-memory parallel computing, i.e., computing on a single node of a HPC machine. OpenMP is implemented by a set of directives which don’t modify the semantics of the program but which tell the compiler how to parallelise it. As the OpenMP standard has evolved it has become much more powerful: OpenMP 3.0 introduced task parallelism which is particularly well-suited to programs with irregular workloads and OpenMP 4.0 introduced the notion of task dependencies which allow the OpenMP runtime to schedule tasks to maximise throughput. There are a set of proposed extensions to OpenMP for programming accelerators, but so far only Intel has implemented them for the Xeon Phi. OpenMP is also almost always available on supercomputers and HPC facilities – and also on workstations etc. – so, again, choosing OpenMP gives you a strong portability capability for your code.

However, HPC would be boring if it were that simple. There are a range of potentially powerful HPC hardware technologies that mean MPI and OpenMP are not sufficient for all cases. This is the hot topic of GPUs, or compute accelerators in general.

For programming NVIDIA GPUs, the “native” language is CUDA. An alternative to CUDA is OpenCL which can target a variety of hardware including NVIDIA and AMD GPUs, Intel Xeon Phi, ARM v8 and even x86 (CPUs). However while programs may be syntactically portable, to get good performance on specific platforms requires a great deal of tuning (this is true even when moving code from one NVIDIA card to another). It is possible to get similar performance with both CUDA and OpenCL on NVIDIA hardware, but arguably CUDA is a much nicer language to program in with better abstractions and a rich set of standard libraries.

If CUDA and OpenCL are too low-level, then the HPC community is of course delighted to offer yet more choices – in this case OpenACC, which uses a similar directive-based approach to OpenMP. OpenACC is supported by the Cray, Pathscale and PGI compilers, and work is underway to add it to the GNU compiler suite. Unfortunately there are still some issues with OpenACC. OpenACC directives are descriptive as opposed to prescriptive (the case in OpenMP), which means that compilers are free to interpret the directives as they see fit. This can lead to different behaviour in a single code compiled with different compilers. Another issue with OpenACC is that, although in principle it handles data transfer between host and device automatically, it won’t do a deep copy of a data structure which contains pointers. In this case the programmer has to manage the data transfer by hand which can be more complicated than writing the code in CUDA/OpenCL in the first place. These issues are being addressed by OpenACC developers (and in the latter case the problem will go away with unified addressing) but for now OpenACC seems best suited to small, fairly straightforward applications, or maybe clearly understood and tested kernels within larger codes.

Of course, why stick with just one programming model when you can combine several? There are good reasons to use a hybrid programming model which combines two or more of the above. For example one might have one MPI process per node and use OpenMP within that process to spread work across the processors on the node; or where a node has a number of NUMA regions run one MPI process on each region with OpenMP inside that. We have also seen MPI+CUDA and there is no reason in principle why you couldn’t combine MPI with OpenCL or OpenACC. These hybrid models are used to access better performance, improve scalability, or support more helpful code structures.

There are alternatives to MPI, particularly Partitioned Global Address Space (PGAS) which assumes a logical global address space divided between a number of physical devices. The two most notable implementations of this are coarrays, part of the Fortran 2008 standard (Co-Array Fortran), and Unified Parallel C (UPC), both of which support a compact, elegant syntax for sharing data between processes. Many PGAS languages are based on the SHMEM libraries which provide support for one-sided communication.

Finally, there is also the promise of a better world. More natural programming, closer to the science/engineering/maths, yet able to deliver performant and scalable code. Domain Specific Languages (DSLs) allow the developer to write code at a level closer to the natural language of the application (e.g., in mathematical formulae or by describing the algorithms). The DSL engine usually then generates low level code (e.g., Fortran, MPI, OpenMP, CUDA, …) from the high level DSL code. Arguably most DSLs are still at an experimental stage. The DSL option attracts debate from both passionate advocates and hardened sceptics. Some are able to tell stories where DSLs have worked well, others claim DSLs are often too use-case-specific to be generally useful even within a field.

So where does all this leave the programmer or HPC service planner?

The short term is easy. Stick to Fortran (or C/C++ or Python). If maintaining existing code in Fortran/C/C++, then it is worth considering upgrading to recent standards. However Fortran in particular has gone through a number of revisions and many compilers still do not support all of Fortran 2003 and 2008, even though a new standard is expected in 2016.

For HPC platforms, whether developing new or existing codes, the most supported and portable model for parallel programming is MPI with or without OpenMP. For accelerators the best current advice is to use the native language supported by the vendor.

Looking further ahead, as machines evolve, more and more of the potential performance will come from parallelism. Indeed, parallel processing techniques are already responsible for perhaps >99% of the theoretical performance of modern CPUs or GPUs. The important thing is to identify all the parallelism in your application and, if necessary, re-write it to make it explicit.

Hybrid models of parallelism are likely to become more important as HPC nodes become more heterogeneous and the memory hierarchies become more complex.

Hardware is changing and there is a good possibility that the “accelerator model” with separate address spaces on different devices will disappear within the next five years. It is also possible that a standard for directive-based programming of heterogeneous machines will be adopted by a critical mass of stakeholders, either based on OpenACC or on OpenMP 4.0.

However much we may close our eyes and hope, software should never be a static beast around which we deploy hardware. That is simply a path to reducing impact and dwindling competitiveness. Whichever programming language and model you choose, all codes will need continued attention to take full advantage of developments in programming models and hardware. Successful organizations recognize that software encapsulates significant value, IP, and opportunity – and make the necessary investments to keep it up-to-date and maximize its benefits to the business, science or engineering goals.

The Numerical Algorithms Group (NAG), in partnership with Red Oak Consulting, recently launched the HPC Technology Intelligence Service. In this exclusive for The Next Platform, based on an extract from the first report, we discuss one of the most important technology choices that HPC users and planners have to consider.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.