Paid Feature There are many ways to scale up and scale out systems, and that is a problem as much as it is a solution for distributed systems architects.
Once they are powered on, few distributed systems are rarely asked to do one thing and one thing only – and they share this trait with big NUMA shared memory systems that run back office applications in the corporate world. Both types of machines do many things, often at the same time in parallel or in series one after the other.
And that means there is usually an impedance mismatch between what any given application needs and the hardware that is available to run that application. It is an impossible situation, and for that we therefore require an “impossible server.”
This server should be malleable so it can fit the changing mix of workloads and their specific needs for CPU and accelerator capacity, and CPU/accelerator ratio. At the same time the impossible server allows for the sharing of such resources to drive up the utilization of these components across time so the enormous sums that organizations pay for clustered systems are not wasted.
“True hardware heterogeneity is difficult to attain without composability and fabrics,” Alan Benjamin, co-founder and chief executive officer at GigaIO, tells The Next Platform. “The world of just running applications on general purpose X86 CPUs is over. And while the Nvidia A100 is a great accelerator, it is also not the answer for everything. If you build static infrastructure, then whatever you build is probably great for one application and either imbalanced for other applications or not really appropriate at all. However, if you build an impossible server – one that has elements composed on the fly from pools of CPUs, GPUs, FPGAs, custom ASICs and what have you – then any given collection of specific machines composed through PCI-Express fabrics can exist for only the time that is required and when one job is done can be put back into the pool.”
As is the case with all-CPU and hybrid CPU-GPU supercomputers today, it is up to the research institutions and enterprise customers to keep a queue full of work ready to run and the jobs schedulers playing Tetris-style, to fit jobs to the unused resources in the cluster. And for that purpose, GigaIO has created northbound integrations with the popular HPC job schedulers, including open source SLURM and PBS, as well as with cluster configuration tools such as Bright Computing’s Bright Cluster Manager (now part of Nvidia).
Another aspect of the impossible server is that it can be configured in a way that is not typically done inside of physical servers. For instance, many jobs may need only two or four GPUs affiliated with the CPU host, which could have either one or two CPUs, and maybe even four CPUs, depending on the serial processing and memory capacity needs of the host machine.
But what happens, such as with AI training workloads, when the best configuration is to have sixteen GPU accelerators attached to one server? There are very few OEM machines that do this, and it can be very expensive to buy them.
But with a PCI-Express fabric like GigaIO’s FabreX and its composability software, using PCI-Express to link lots of GPUs or other PCI-Express-based accelerators to a single node is very easy and pretty much instant. In a future release of Bright Cluster Manager, slated for delivery in early 2022, the reconfiguration of hosts and accelerators will happen automatically. And fast.
“In an optimal environment, where you are starting from clean hosts, you can recompose a host and have new GPUs set up the way you want to in a matter of five seconds per node,” says Matt Demas, chief technology officer for global sales and vice president of sales engineering at GigaIO, who joined the company from composability rival Liqid in August.
“So if I am doing a ten-node cluster, it can be done in less than a minute. In some cases, depending on the scenario, you may have to reboot the host and you have to factor that into the equation.”
The other interesting aspect of this “impossible server” according to GigaIO, is the company’s ability to treat the entire rack as one server today, because its IP centers on extending PCI-Express as a routable network throughout the rack, using DMA (Direct Memory Access) In effect this builds a memory ring around all the resources. With the advent of CXL, which GigaIO has supported from the get go as a Contributing Member to the consortium, the reality of what we described in a previous article on PCI-Express and CXL is closer than ever: “In essence, the rack becomes the server. There are some challenges that have to be dealt with, of course. But having a whole rack that is load/store addressable is very interesting indeed.”
Social Organization Of Computing
The San Diego Supercomputer Center (SDSC) is a flagship, early adopter customer of GigaIO that is using FabreX in an HPC environment.
“What they want is for SDSC researchers to submit jobs through SLURM, as they are already doing, and they want to be able to control the cluster through Bright Cluster Manager, just as they are currently doing,” explains Benjamin.
“We need to meet them where they want to be met, and we have been able to do that with our integrations with SLURM and Bright, which means researchers do not have to change what they are doing but they can still add composability and drive up utilization and efficiencies. We have other northbound integrations we have done for other customers, such as for OpenStack in the fintech area, and we also integrate with Singularity and Kubernetes to deliver hardware composability underneath containers.”
The idea here is that customers don’t need yet another pane of glass to manage the underlying composability, but instead have integrations into these other job schedulers, cluster managers, and container platforms which they are likely already using. Some customers prefer to use their own tools, and appreciate the open standards approach provided by a robust command line interface and Redfish API integration for the underlying composition of FabreX.
“If we have done our job right, people don’t even know that FabreX exists – it is just something that happens transparently underneath the applications and invisibly across the pool of resources in the cluster,” says Benjamin.
That would indeed be the ideal thing. And, as it turns out, it is a necessary thing for what Frank Würthwein, director of SDSC, calls the “social organization of computing,” which he spoke about in a Birds of a Feather session with GigaIO at the recent SC21 supercomputing conference.
Würthwein explained the three use cases that SDSC manages, and its desire to use the same hardware to support different modes and stacks. This includes traditional HPC simulation and modeling under the control of the SLURM, which submits HPC jobs in batch mode against the hardware. Then there is another system that pods up containers to run science workflows and services atop Kubernetes. And then there are bare metal testbeds for researchers to run test beds such as Chameleon, CloudLab, and Fabric.
“These three ways of using the same hardware in different environments are fundamentally neither interoperable nor consistent nor easily co-existable,” says Würthwein. “And that poses a challenge to shops like SDSC because rather than buying three of each, we would much rather be able to boot – or whatever you want to call it – into one dynamically and re-use the hardware for all three.”
Diverse Hardware Environments
Diverse use cases however are only one driving factor. SDSC also has a diversification across hardware. The existing systems at SDSC include a variety of CPU host systems, with different architectures, core counts, and performance.
They include machines with six different types of Alveo FPGA accelerator cards from Xilinx and five different “Ampere” GPU accelerators from Nvidia, and that doesn’t count older GPU cards, and the AMD Instinct and Intel Xe HPC cards it expects to procure in the future. On top of that – or perhaps rather between them – the SDSC systems also count various kinds of SmartNICs and DPUs with FPGA or GPU accelerators on them, too.
“In an ideal world, we would like to be able to dynamically reconfigure hosts with different devices – FPGAs, GPUs, DPUs, SmartNICs, you name it – for a purpose that is desired by a science use case,” Würthwein explains. “For this, we are preparing to use GigaIO in the new systems that we are getting into the shop and that we will talk at length about at SC22.”
The Old And The New: Two Pools Of Resources
At Purdue University, a major US public research institution that is starting proof of concept trials with GigaIO’s FabreX, a new cluster is brought online every year or so. As Alex Younts, principal research engineer for the Information Technology Research Computing group tells The Next Platform, Purdue “likes to keep them refreshed.”
In the past five years or so, the university has started adding GPUs and other kinds of accelerators to the systems, ranging from the latest-greatest device as well as older versions “that we don’t seem to ever throw out,” as Younts put it.
“We have these two pools of resources, with lots of CPUs and lots of GPUs, and the two don’t meet,” he says. “And our mission is not only to support research, but also instruction, and having strict pools of resources for these two missions gives us problems because we cannot mix and match across them. What we are working towards is having one of those yearly community clusters, with maybe 500 or 1,000 nodes, and disaggregating all of our CPUs and GPUs at the rack scale and giving us the ability to meet research as well as instruction needs based on Kubernetes container workflows and virtual graphics workstations to support engineering classes.”
This is exactly the kind of thing that we expect to see at every academic supercomputer center, and there are several thousand of them the world over that want to converge supercomputing and instruction onto the same infrastructure. That’s a big piece of business right there, and with an immediate payoff for the institutions.
Sponsored by GigaIO