The era of vector supercomputing might sound like ancient history to some but it’s still deeply rooted in major commercial and government institutions. While it is not optimal or desirable in most cases, the costs and challenges of rewriting or porting old codes that still do heroic work has not been possible or practical, especially in demanding application areas in high performance computing.
The U.S. Naval Research Laboratory is among organizations hoping to salvage long-used vector codes on modern systems without high-overhead code refactoring. Specifically, they’ve looked at a deeply legacy computational fluid dynamics (CFD) solver created at the U.S. Air Force research hub, which was written in Fortran and has been added to over the years via Fortran 90 and MPI tweaks.
The FDL3DI code, which makes its first appearance in the early 1990s, was designed for vector processing and is still used in aerospace and other domains, almost exclusively in government applications. Kaith Obenschain from the Naval Research Laboratory, along with NEC and Syntek Technologies have collaborated to bring the benefits of modern HPC to bear without all trauma of code rewriting or porting via the NEC Vector Engine.
NEC’s vector history goes all the way back to 1983, just as some of the codes still used today do but they’ve managed to scale compute capability in the NEC Vector Engine in way most modern. Each Vector Engine has 8 total cores for a combined 2.15 teraflops of double-precision performance with all you might expect from other leading processors (six HBM memory modules/48GB, for instance). The secret sauce is in NEC’s scalar processing unit, which takes in all the non-vector instructions on each code while the vectorized C, C++, and Fortran with MPI run on the VE. These units are scalable with each host handling up to 8 VE machines (in the case of the Naval Research Lab these were housed in an HPE Apollo 6500 Gen 10 8 VE system).
“The goal was to assess how the NEC Vector Engine’s performance and ease of use compare with that of existing CPU architectures using a legacy CFD solver,” Obenschain and colleagues explain. “FDL3DI was originally vectorized and optimized for efficient operation on vector processing machines. The NEC VE’s architecture, high memory bandwidth, and ability to compile Fortran was the primary motivation for the evaluation.”
With optimizations, this vector architecture was found to be 3× faster for main-memory bound problems with the CPU architectures competitive for smaller problem sizes. This performance using standard well-known techniques is considered to be a key benefit of this architecture.
Through profiling and modifying the key compute kernels using typical vector and NEC VE specific optimizations, the code was successfully able to utilize the vector engine hardware with minimal modification of the code. Scalar code developed later in FDL3DI’s lifetime was substituted with vector friendly implementations.
The Naval Research team found that their codes could run without any changes but the improvements took some optimizations. As generalizations, these might be useful to anyone considering how to make old codes new again.
They explain that codes designed for vector machines have gone through various optimizations for different architectures over the years and if one of those rounds was optimization via scalar code, that will take more work. Another finding is that the NEC VE works particularly well with codes limited by memory bandwidth but that performance is comparable to AMD Epyc, for instance.
As the Naval Research group continues its work of breathing new performance life into old codes, they will explore PI scaling beyond a couple of VE units, but expect challenges from the MPI stack and overall communication. They will also take a look at other applications that might be a good fit for retrofitting via NEC VE.
Some of the benchmark and performance results based on FDL3DI can be found here.
Be the first to comment