Wanted: A Complete – And Heavily Customizable

intel-arch-day-xe-hpc-ponte-vecchio-package

Sponsored Feature. There are a lot of things that the HPC centers and hyperscalers of the world have in common, and one of them is their attitudes about software. They like to control as much of their systems software as they can because this allows them to squeeze as much performance out of their systems as they possibly can. However, the resources in time and money and the level of expertise necessary to create what amounts to custom operating systems, middleware, and runtime environments is too onerous for most other organizations who should be benefiting from HPC in its many guises.

With a rapidly expanding number and kinds of compute engines in the datacenter and a widening set of HPC applications – which include traditional simulation and modeling as well as data analytics and machine learning and, increasingly, a hodge-podge of these techniques stacked up in a workflow that constitutes a new kind of application – creating and maintaining a comprehensive HPC software stack is a serious challenge.

What if this could be more of a group effort? What if there was a way to create a complete HPC software stack that was still optimizable for very specific use cases? Would this not be a benefit to the larger HPC community, and particularly for those academic, government, and corporate centers that do not have the resources to create and maintain their own HPC stack?

It is hard to argue against customization and optimization in the HPC arena, so don’t think that is what we are doing here. Quite the contrary. But we are thinking about a kind of organized, mass customization that benefits more HPC users and more diverse architectures – and does so because system architectures are getting more homogeneous over time, not less so.

Every maker of CPUs or GPU or FPGA accelerators, not to mention custom ASIC suppliers, create their own compilers and often their own application development and runtime environments in the endless task of wringing more performance out of the expensive HPC clusters that organizations build out of their compute engines and networks. (It is hard to separate compute performance and network performance in a clustered system, after all. Which is one of the reasons why Nvidia paid $6.9 billion for Mellanox.)

The list of important HPC compilers and runtimes is not long, but it is diverse.

Intel had its historical Parallel Studio XE stacks, which includes C++ and Fortran compilers and a Python interpreter plus the Math Kernel Library, the Data Analytics Acceleration Library, Integrated Performance Primitives (for algorithm acceleration for specific domains), Threading Building Blocks (for shared memory parallel programming), plus an MPI Library for implementing message passing scale out clustering, optimizations for the TensorFlow and PyTorch machine learning frameworks, now included in Intel’s oneAPI toolkits.

Nvidia created its Compute Unified Device Architecture, or CUDA, to make it easier to move compute jobs off CPUs to GPUs instead of having to resort to OpenGL. Over time, the CUDA development environment and runtime has added support for OpenMP, OpenACC, and OpenCL programming models. In 2013, Nvidia bought the venerable PGI C, C++, and Fortran compilers, which came out of mini-supercomputer maker Floating Point Systems decades ago, and for more than a year the PGI compilers have been distributed as part of the Nvidia HPC SDK stack.

AMD has the Radeon Open Compute platform, or ROCm for short, which heavily leverages the Heterogeneous System Architecture runtime that has a compiler front end that can generate hybrid code to run on both CPUs and GPU accelerators, and importantly, the tools that make up the ROCm environment are open source. ROCm supports both OpenMP and OpenCL programming models, and has another Heterogeneous Interface for Portability (HIP) programming model, which is a C++ kernel language and runtime for GPU offload that can generate code that can run on either AMD or Nvidia GPUs and can also convert code created from Nvidia’s CUDA environment to run atop HIP and therefore have some kind of portability.

The Cray Linux Environment and compiler set, now sold by Hewlett Packard Enterprise as the Cray Programming Environment suite, comes immediately to mind and can be used on HPE’s own Cray XE systems Intel or AMD CPUs and Nvidia, AMD or Intel GPUs (by incorporating each vendors’ tools) as well as the Apollo 80 machines using Fujitsu’s A64FX heavily vectorized ARM server processor. ARM has its Allinea compiler set, which is important for the A64FX processors as well as for Neoverse Arm processor designs that will be coming out with vector extensions in the coming years. Fujitsu has its own C++ and Fortran compilers as well that can run on the A64FX chip and of course there is also the open source GCC compiler set.

There are other important HPC compiler and runtime stacks with acceleration libraries for all kinds of algorithms important in various simulation, modeling, financial services, and analytics domains. The more the merrier. But here is the important lesson that is illustrated by the launch of the Apollo 80 system with the A64FX processor by HPE: Not every compiler is good at compiling every kind of code. This is something that all academic and government supercomputing centers, particularly those that shift architectures a lot, know full well. Diverse computing is going to mean diverse compiling.

And, therefore, it is best to have many different compilers and libraries in the toolbox to choose from. And, in fact, what the HPC market really needs is a hyper-compiler that can look at code and figure out what compiler should be used across a wide array and possibly a diverse mix of compute engines to yield the best performance. We don’t think that the HPC industry needs many different complete HPC SDKs tuned up by their vendor advocates as much as it needs compilers and libraries from many different experts that can all be snapped into a single, broad, and complete SDK framework for HPC workloads.

Go up a higher level in the HPC software stack, and further complicating the situation, is the fact that every HPC system maker has its own Linux environment, or one that has been anointed as the chosen one from IBM’s Red Hat unit or SUSE Linux or Scientific Linux, or one that is cobbled together by the HPC center itself.

In an HPC world where both security and efficiency are of paramount concern, what we need is a stack of operating systems, middleware, compilers, and libraries that is conceived of as a whole, with options you can slip into and out of stack as needed, but which gives the broadest optionality. This software does not have to be open source, but it has to be able to be integrated through APIs consistently. For inspiration for this HPC stack, we take the OpenHPC effort spearheaded by Intel six years ago and the Tri-Lab Operating System Stack (TOSS) platform developed by the US Department of Energy – specifically, Lawrence Livermore National Laboratory, Sandia National Laboratories, and Los Alamos National Laboratory. The TOSS platform is used on the commodity clusters shared by these HPC centers.

The OpenHPC effort seemed to be getting some traction a year later but another couple of years came and went, and by then no one was talking about OpenHPC. Instead, Red Hat was creating its own Linux distro tuned up to run traditional HPC simulation and modeling programs and the two biggest supercomputers in the world, “Summit” at Oak Ridge National Laboratory and “Sierra” at Lawrence Livermore, were running Red Hat Enterprise Linux 7. The OpenHPC effort was a little too Intel-centric for many, but that focus was understandable to a certain extent with no AMD CPUs or GPUs and no ARM CPUs in the HPC hunt. But the mix-and-match nature of the stack was correct.

Our thought experiment about an HPC stack goes further than just allowing anything to plug into OpenHPC. What we want is something that is designed more like TOSS, which was profiled four years ago at SC17. With TOSS, the labs created a derivative of Red Hat Enterprise Linux that used consistent source code across X86, Power, and Arm architectures and a build system to trim out the pieces of RHEL that were extraneous to HPC clusters and to add in other software that was needed.

In a talk about Exascale systems in 2019, Bronis de Supinski, CTO for Livermore Computing, said Lawrence Livermore had pulled 4,345 packages from the more than 40,000 packages that comprise Red Hat Enterprise Linux, then patched and re-packaged another 37 of them and then added in another 253 packages that the Tri-Lab systems require to create a TOSS platform with 4,598 packages. The surface area of the software is greatly reduced, obviously, while still supporting various CPUs and GPUs for compute, various networks, various kinds of middleware abstractions, and the Lustre parallel file system.

What is also interesting about the TOSS platform is that it has an add-on development environment that layers on compilers, libraries, and such, called the Tri-Lab Compute Environment:

If three of the big HPC labs in the United States can make an HPC Linux variant and development tool stack that provides consistency across architectures, allows for a certain amount of application portability, and reduces the total cost of ownership of the commodity clusters they use, how much more of an effect could a unified HPC stack, with all of the current providers of compilers, libraries, middleware, and such all participating, have on the HPC industry at large? Imagine a build system shared by the entire community that could kick out only the components that are necessary for any particular set of HPC application use cases and that would limit the security exposure of the entire stack in use. Imagine if math libraries and other algorithmic acceleration were more portable across architectures. (That is a topic for another day.)

It is good that each HPC compute engine or operating system vendor has its own complete and highly tuned stack. We applaud this, and for many customers in many cases, this will be sufficient to design, develop, and maintain HPC applications appropriately. But this is very likely not going to be sufficient to support a diverse set of applications on a diverse set of hardware. Ultimately, what you want is to be able to have a consistent framework across vendors for compilers and libraries, which would allow any math library to be used in conjunction with any compiler, and for a tunable Linux platform.

Sponsored by Intel.

Wanted: A Complete – And Heavily Customizable – HPC Software Stack

Sign up to our Newsletter

Sign up to our Newsletter

Related Articles

If You Want To Maximize Enterprise AI, Don’t Just Focus On GPUs

Talking System Architecture With AMD CTO Mark Papermaster

Accelerating Chip Design With GPUs, And Adding AI To Push It Further