Long gone are the days when high performance computing was limited solely to traditional simulation and modeling at academic and government research labs. HPC now encompasses other forms of distributed computing, including advanced data analytics and artificial intelligence. HPC systems have had to embrace new technologies, new software frameworks and applications, new algorithms, and new classes of users with very different backgrounds and expertise.
The one thing that organizations need is a consistent and flexible way to set up and manage the compute clusters that run these diverse workloads, which will typically have a mix of processor and accelerator technologies, depending on the applications, and which will be considerably more dynamic in nature than the traditional HPC applications running on MPI middleware that were relatively static for four or five years.
The need for flexibility is not just about being able to make use of diverse hardware for a mix of different workloads – traditional HPC, data analytics, and machine learning are the key ones these days – but also in a deployment model that spans from on-premises clusters to compute, storage, and networking capacity in public clouds and even out to the edge where AI and analytics are being increasingly deployed because of latency and data movement issues. And these workloads might be deployed on bare metal with workload managers and job schedulers, virtual servers, or containers with Kubernetes orchestrating them to make matters even more complicated – and flexible.
“In a relatively short period of time, there’s been a lot of change,” Bill Wagner, chief executive officer at Bright Computing, tell The Next Platform. And the company Wagner leads knows because it has been one of the forces behind that change. Bright Computing was founded in 2009 and makes cluster management software that currently is in use at more than 700 customers, including Boeing, Samsung, Sandia National Laboratories, Tesla, San Diego Supercomputing Center, Cummins, Applied Materials, Caterpillar, the National Security Agency, Virginia Tech, Drexel University, the National Institutes of Health, the National Oceanic and Atmospheric Administration, and 50 of the Fortune 500.
Buying supercomputers used to be like buying a laptop or a server for one set of specific users and one specific set of applications. Admittedly, a supercomputer is a very large system and more complex, but the process of specifying a machine was done once, it was set up, and used for four or five years and then replaced wholesale with a new machine in a new budgeting cycle. Those days are fading.
“What we are seeing with customers directly and through our resellers in the past year and a half is that there is this is day of reckoning that’s kind of coming for HPC,” explains Wagner. “Based on all of these changes that are taking place, it’s no longer good enough to build a cluster that is going to support an organization for five years. They now need to think about building a cluster that’s able to change over the next five years. It needs to be dynamic because it is no longer just about one type or one set of applications, it’s about a broad set of very different set of applications, maybe in different sectors. From the technology side, it is no longer just about bare metal. Virtual machines are becoming more capable and containers, while still in their infancy, are definitely going to come into the fold here quickly. And it’s not just about on-premises clusters, either. Whatever they build, they will have to include the public cloud and it will eventually include the edge, too. It is not just about Intel Xeon CPU clusters as it has been. Now we have AMD Epyc processors and Arm processors and GPU accelerators and other kinds of accelerators on top of that. And so, the way that organizations need to think about building their infrastructure has to be very different. Obviously, this is a lot more complex – and it has got to be flexible. HPC systems never inherently needed to be flexible in the past.”
The hardware is changing and the software is also changing, and this is new, too. To a certain extent, the software stack on HPC clusters always changed, but nothing like we are seeing today. There were always updates to the software, and but now you might be running simulations and models on a portion of the cluster, and AI workloads on another portion, and data analytics on another and sometimes the whole machine might be used to run one big workload for a length of time. The workloads are changing across space and they are changing across time, and the software has to respond to that. It’s not that the cluster changes once, it is that it is fluid on many different dimensions.
“What you don’t want to do is partition the cluster, because all that does is create silos, which are themselves rigid and inflexible,” says Wagner. “Siloed partitions might be incrementally better than perhaps having entirely separate systems. But what you really want is the ability to dynamically allocate as much or as little of the cluster as possible at various moments of time to work, with priorities that can change every day, week, or month.”
And the flexibility that HPC customers need is not going to be truly useful unless and until it can be extended to the public clouds. Whatever customers deploy, the same set if tools that make their on-premises machines malleable should be able to be extended out to the cloud, where certain workloads may run in burst mode or may run entirely on the cloud as conditions dictate. And at some point, for many organizations, they will also need to extend out into the edge, where a certain amount of their data and computing will live.
“Over the past twelve months, we have definitely seen an increate in the number of organizations that want Kubernetes on the same environment where they run traditional HPC simulation and modeling applications using workload schedulers,” says Wagner. “They are definitely trying to bring in different architecture environments, including different hardware vendors, different CPU architectures, as well as accelerators based on different kinds of compute such as GPUs.”
What HPC shops are not doing, by the way, are suddenly deciding that they need to containerize all of their HPC applications and run them inside of Kubernetes. No one is containerizing for the sake of containerizing, any more than they were adding hypervisors and putting code in virtual machines for the past decade. There are operational benefits to using VMs and containers, but they are not free and they carry a performance penalty, and hence they largely stayed bare metal. But new workloads increasingly run in VMs or containers by default, so HPC centers adding these new analytics and machine learning workloads have to be able to absorb these technologies, even if it is only going to be for a portion of their workloads for now.
The complexity that organizations have to deal with requires flexible management frameworks like Bright Cluster Manager, which:
- Automates the process of building and managing high-performance Linux clusters, including provisioning servers from bare metal
- Sets up networking, security, DNS, and user directories
- Runs tests across the cluster before being made available to users
- Enables simultaneous hosting and resource sharing of apps on bare metal, VMs and containers
- Provides detailed health checks of all hardware and software on the cluster
- Provides job and user resource reporting
- Provides automated one-step updates of nodes
Rather than have different tools for different kinds of applications, which is what organizations tend to do with siloed infrastructure – meaning separate clusters not always running at the highest utilization – they could have a single tool like Bright Cluster Manager that can support these different compute paradigms and become a single foundation for all of them – whether they are running on-premises, in the cloud, or at the edge.
This is done through a piece of software called a director, explains Wagner, which resides in the cloud or at the edge that acts as a proxy for your on-premise headnode to manage the resources that will be deployed on these distributed forms of compute and it then provisions and images that capacity as needed for particular workloads in the same manner. The most important thing is that Bright Cluster Manager monitors the cloud jobs that are running and is able to shut off the rented infrastructure as soon as a job is completed so customers truly only pay for the capacity they need. At the moment, Bight Computing supports provisioning and monitoring of capacity on Amazon Web Services and Microsoft Azure, and Wagner says that demand for Google Cloud is increasing and probably warrants doing an integration there soon. There is a feature in Bright Cluster Manager called cmjob that stages data on the cloud so it is ready before a job needs to run. Customers do not have to move a parallel file system to the cloud, either. Data can be dumped into AWS S3 or Azure Blob storage and Bright uses Amazon Web Services FSx for Lustre or Azure NetApp Files for applications that are I/O intensive and require a parallel filesystem. The clouds have nearly infinite storage capacity and their networks have a crazy amount of bi-section bandwidth so the storage can feed the compute fast enough and take data off it fast enough, too.
It is still early days for the cloud and the edge among Bright Computing’s customers. At this point, there are about a dozen who are extending out into the cloud, with interest rising fast, and there are only two customers who are starting to use Bright Cluster Manager out at the edge.
The obvious question, after more than a decade of talking about this, is why haven’t more HPC shops extended workloads out to the cloud?
“First and foremost, the answer is that they have done the math,” says Wagner. “They know that once they get above 35 percent to 40 percent utilization, they are better off just owning the infrastructure. If you couple that with performance and security considerations, there has not been a strong catalyst to pull them in that direction.”
And part of the reason, we think, is because getting a consistent environment across many public clouds and on-premises clusters has been difficult, to say the least. The public clouds also need to cut their prices to reduce that economic barrier. And the other thing to note – and this is important – is that HPC centers are not thinking about moving to the cloud wholesale at any price, says Wagner, but in operating in a hybrid environment where they extend out to the cloud for certain workloads some of the time, perhaps side projects or when emergency work – like trying to figure out the COVID-19 disease – takes up more time on the primary systems.
What is clear is that having a common tool across these three environments will help grease the skids and get people kicking the tires with proof of concept projects and then modest hybrid scenarios. And to help remove even more of the friction, Bright Computing has created the Easy 8 package, which allows customers to use Bright Cluster Manager for free on clusters up to eight nodes. Now, they can try it out at a reasonable scale before they buy it, and see how they might unify their HPC infrastructure across datacenter, cloud, and edge.