It is a rare HPC cluster that is actually upgraded – meaning some of the components in the servers or the networks or the storage that comprise the system are swapped out somewhere about halfway through its lifecycle and replaced with cheaper, faster, or more capacious components. An HPC cluster is designed, installed, and largely left alone until it is upgraded. Resources in one node, such as GPU accelerators and NVM-Express flash drives, are not available for use in others, and in sites that have multiple clusters, resources in one cluster are not available for use in other clusters.
It doesn’t have to be this way. There is a way to break up and virtualize the compute, storage, and networking that is hanging off of the PCI-Express bus inside of each node and use a mesh of PCI-Express switches to create a fabric that allows for these components to be disaggregated from their physical servers and then composed on the fly for specific nodes across that PCI-Express fabric. Within the limits of the BIOS of each server, nodes can have components attached to them as needed and freed up for other nodes to use when they are not.
This disaggregation and composability is the holy grail of infrastructure, and there are a number of system makers, such as Hewlett Packard Enterprise and Cisco Systems, that have built it into selected server lines aimed at Internet service providers. Cisco was a bit early and shut its M-Class server line down, but HPE is getting traction with service providers with its Synergy line. It is unclear just how far along the hyperscalers and cloud builders are in making composable infrastructure, but we know for sure that they have disaggregated their compute and storage in their datacenters, which typically house on the order of 100,000 servers, so any server can talk to any other server and any server can talk to any storage server across a massive Ethernet network in a Clos topology. And clearly the big public clouds are doing a certain amount of composability to deliver thousands of different instance types, with different accelerator and storage options paired with virtual servers.
A number of companies have tried to use the PCI-Express protocol to create switched fabrics for connecting peripherals to servers, and the reason is simple: With most CPUs having PCI-Express controllers on them and most peripherals created to be PCI-Express endpoints, why not just stretch the PCI-Express links with switches so that GPUs, FPGAs, and SSDs can be pooled on the PCI-Express fabric and then carved up and attached to specific nodes in a cluster as needed. This would cut out Ethernet and InfiniBand completely as the transport for linking these devices to each other and to host CPUs.
GigaIO Networks, an upstart founded in 2012, is bringing a new twist to PCI-Express switching to the market, and one that should resonate with HPC centers, particularly those that have diverse workloads and a limited budget for adding accelerators and flash to all of their servers as the big national labs are all able to do. By speaking PCI-Express directly across components, as we explained in our coverage of the company’s FabreX PCI-Express switch fabric back in October, then all of the overhead of converting from PCI-Express to Ethernet or InfiniBand to then talk to components on adjacent nodes is removed. PCI-Express does not have nearly bandwidth, radix, or range of Ethernet or InfiniBand interconnects, but it provides extremely low latency – better than InfiniBand and much better than Ethernet. And with current PCI-Express 3.0 switches and soon-to-be-announced PCI-Express 4.0 switches with twice the bandwidth, there is a way to interconnect components across one, two, or three racks using PCI-Express and provide not just a low latency interconnect, but a means of disaggregating these PCI-Express components and allowing them to be configured to specific nodes in a cluster using GigaIO’s FabreX software.
Interestingly, the HPC centers that are starting to explore PCI-Express switching fabrics are just as interested in flexibility as they are performance.
“For the HPC customers that we have talked to, such as a number of the national labs, what they keep telling us is that they are much more interested in exploring the benefits of a composability,” Scott Taylor, director of software development at GigaIO, tells The Next Platform. “They want the flexibility of being able to add some amount of NVM-Express drives or add some amount of GPUs to a configuration without having to recable things. They don’t want to have everything locked inside a server.”
This is particularly true of clouds that are specializing in running HPC applications that have GPU acceleration. Not every HPC application has the same balance of CPU compute, GPU compute, and flash storage, so being able to have the iron be more malleable, as GigaIO’s FabreX software layer on its PCI-Express switch fabric does, is critical for being able to support a diverse set of HPC workloads and a wider array of customers. The same will hold true of HPC centers that have an equally diverse base of users and applications and that also have to justify the cost of hardware against the rates they charge their end users for processing and storage capacity.
At the moment, disaggregation and composability is just getting started in the HPC sector, but given the benefits that it offers, there is every reason to believe that as awareness is raised of the flexibility benefits and the higher utilization that this approach to infrastructure offers, the HPC crowd will make disaggregation and composability a higher priority in their cluster designs. That is what GigaIO believes, and it will be talking about the opportunity for HPC shops at our HPC Day event on November 17 in Denver. If you can’t make it to HPC Day, we will be covering the event here at The Next Platform.
Be the first to comment