GPU accelerated supercomputing is not a new phenomenon with many high performance computing codes already primed to run on Nvidia hardware in particular.
However, for some legacy codes with special needs (changing models, high computational demands), particularly in areas like weather, the gap between those codes and the promise of GPU acceleration is rather large, even with higher level tools like OpenACC to bridge the divide—all without major code rewrites.
Given the limitations of porting some legacy Fortran codes to GPUs, a research team Tokyo Tech has devised what it calls, “hybrid Fortran” which is designed to “increase productivity when re-targeting structured grid Fortran applications to GPU.” They claim this offers advantages over OpenACC, at least in terms of the ASUCA weather code (one of the main operational weather forecasting tools used in Japan as of 2014) to test the concept, in a few ways, which will we describe in a moment.
The team behind the Hybrid Fortran concept has been developing GPU accelerated Fortran applications for weather prediction for some time. For instance, the term first came up at the GPU Technology Conference in 2013. However, much more work has been pushed into hybrid Fortran for other complex codes on more advanced GPU hardware (Pascal generation GPUs), and the advantages over OpenACC are quite noteworthy.
By investigating the necessary code changes with a completed implementation based on Hybrid Fortran the Tokyo Tech researchers demonstrated that this method has enabled high productivity and performance for re-targeting ASUCA to GPU. According to the researchers, “More than 85% of the hybridized codebase is a one-to-one copy of the original CPU-only code – without counting white-space, code comments and line continuations. An equivalent OpenACC-based solution of ASUCA is estimated to require more than ten thousand additional code lines, or 12.5% of the reference codebase.”
The team says that this new implementation performs “up to 4.9x faster when comparing one GPU to one multi-core CPU socket. On a full-scale production run with 1581 x 1301 x 58 grid size and 2km resolution, 24 Tesla P100 GPUs are shown to replace more than 50 18-core Broadwell Xeon sockets”.
In terms of advantages over OpenACC approaches to getting parallelization and performance out of a Fortran-native code with GPU acceleration, the team says that one key is the abstraction of granularity. “Users have multiple granularities defined in the same codebase, depending on the targeted hardware architecture. This is a crucial advantage in order to implement ASUCA’s physical process on GPU—a code that originally has a very coarse granularity, which is ill-matched for GPUs.”
Another advantage of hybrid Fortran, the team says, is that the details of memory layout are also abstracted without losing vital elements of the existing code. “The layout is reordered at compile-time to match the target architecture and extended with additional dimensions to match the specified parallelization granularity.”
There are other approaches to getting efficient ports of legacy codes for hybrid systems including the use of stencil domain-specific languages, which the team says involves far too much code rewriting. There are also directive based porting methods, which is buggy for GPUs due to the coarse-grained parallelization (more detail/examples of this in paper). Granularity optimization methods, including kernel fusion, as well as broader memory layout and abstraction methods. Ultimately, the team says that “no existing method that we are aware of combines memory layout abstraction and a flexible parallelization granularity with the ability to reuse existing Fortran code for GPGPU” until Hybrid Fortran.
For CPU targets, Hybrid Fortran generates an OpenMP code version, with multi-core parallelization applied to the outermost loop. For GPU targets it defaults to CUDA Fortran kernels (thus generating all the necessary host- and device code boilerplate and data copy operations) with an option to use OpenACC kernels with CUDA compatible data structures. The attribute template specifies a macro suffix used for the generated block size parameters – this allows a central configuration for the block sizes used in an application, rather than leaking this architecture-specific optimization to the user code in each kernel.
“It is notable that CUDA Fortran requires a fairly large amount of boiler plate code for grid setup, iterator setup, host- and device code separation as well as memory- and error handling – Hybrid Fortran allows the user to pass on the responsibility for that to the framework. Compared with the code generated by OpenACC however (assembly-like CUDA C or NVVM intermediate representation), the Hybrid Fortran generated CUDA Fortran code remains easily readable to programmers experienced with CUDA. Experience shows that this is a productivity boost, especially in the debugging and manual performance optimization phase of a project.”
Ultimately, the researchers conclude that considering that many of the changes still required in the user code are “mechanical” in nature, “we expect additional productivity gains to be possible from further automation. We strive to achieve a solution where a Hybrid Fortran-based transformation can be applied to large structured grid applications wholesale with minimal input required by the user.”
The full paper can be found here.