Institutions supporting HPC applications are finding increased demand for heterogeneous infrastructures to support simulation and modeling, machine learning, high performance data analytics, collaborative computing and analytics, and data federation.
For some, the rise of Covid-19 emphasized the need for more flexible HPC solutions. As they turned to cloud deployments to meet their needs, they have discovered the benefits of expanding beyond their on-premises solutions or of presenting their on-premises services as private clouds.
Following are a few examples of organizations that have migrated to the cloud, running their different optimized workloads on cloud instances deployed with various generations of X86 processors.
The Broad Institute – Accelerating Life Sciences Research on Google Cloud
The Broad Institute is a world-renowned research center that uses genomics to advance the understanding of the biology and treatment of human disease.
“We are a sequencing center, so we generate genome sequences on the order of a whole genome about every three to five minutes,” explains Geraldine Van der Auwera, director of outreach at Broad Institute. “Each genome generates approximately 350 gigabytes of data.”
Broad Institute runs their genome processing pipelines nearly 24 hours every day, resulting in over 30 petabytes of genomic data to manage. In 2014, in addition to maintaining their on-premises infrastructure, Broad began development of a Google Cloud-based platform.
“We decided to go to the cloud for several reasons,” says Van der Auwera. “One was logistics and economics of operating our processing pipelines and data storage. We could scale as needed for both compute and storage, paying only for the capacity we used. But also, the cloud would allow a whole new level of data federation and collaboration among scientists.”
According to Broad Institute, the organization was able to leverage Intel technologies on Google Cloud to run its pipelines faster and help reduce the costs of sequencing from $45 to about $7 per genome. Part of what led to lower costs was modularizing their workloads and right-sizing the Google Cloud instances for each task.
“With on-premises infrastructure, you don’t have the profiles of systems like the cloud,” stated Van der Auwera. “By being smart about modularizing our pipeline and right-sizing the Google instances for each workload, we could cut costs considerably.”
A key innovation was how they moved data. With modular components, instead of moving an entire file of genomic data from object storage to a VM, Broad Institute moves only a subset of the data. Only what the task needs to complete its function is copied to the instance. This reduces the amount of storage and memory needed and reduces costs.
To support the new cloud approach, Google and Broad Institute, together with Verily and Microsoft, designed a cloud-based data management and research enablement platform, called Terra. Terra integrates an expansive data library of publicly available and access-controlled datasets, a workflow description language (WDL), and shareable workspaces for wide collaboration. Scientists can build optimized workflows, integrating Broad Institute’s and other organization’s pipelines, complementary toolsets, and interactive analyses. Terra opens new doors for extending research and discovery capabilities for Broad Institute’s users.
Broad Institute runs its pipelines on Google N1 instances, and those pipelines are freely available on GitHub for organizations to adapt them to their needs, running them on systems of their choice, whether on-premises or in the cloud, including using Google N1 or N2 instances.
Intel-Optimized HPC Clusters On Google Cloud
While some organizations, like The Broad, develop their solutions internally, Google has produced in collaboration with Intel a turnkey solution that automates creation of clusters compliant with the Intel Select Solutions for Simulation & Modeling specification on Google Cloud.
“Through our collaboration, we’re making it simpler than ever to have an HPC environment in the cloud that runs user workloads without modification, delivers optimized performance, and lowers the barriers to cloud adoption for HPC,” Bill Magro, chief technologist for HPC at Google Cloud, explains. “With this turnkey solution, customers can create an auto-scaling HPC cluster that has achieved Intel Select Solution verification and provides performance and compatibility with a wide range of HPC applications.”
These Intel Select Solutions for Simulation and Modeling HPC clusters can be built with Google Cloud C2 instances with Intel “Ice Lake” Xeon SP.
DiRAC – Building Cloud-Driven HPC for Theoretical Research Communities Across the UK
The Distributed Research utilizing Advanced Computing (DiRAC) service manages access to specialized, heterogeneous HPC resources for the science theory research community in the United Kingdom. DiRAC’s computational services are hosted on different HPC architectures designed for specific types of workloads—traditional simulation and modeling, Artificial Intelligence (AI) and machine learning, large memory computing, and extreme scaling. DiRAC resources are provided by the Cambridge Service for Data-driven Discovery (CSD3 systems Cumulus and Wilkes), Data Intensive at Leicester (DIaL) cluster, memory intensive service at the University of Durham’s Institute for Computational Cosmology (ICC), and the Extreme Scaling Service provided by Tesseract and Tursa at EPCC, University of Edinburgh.
“Cloud is very much on our agenda within DiRAC and more broadly across the United Kingdom,” says Mark Wilkinson, director of DiRAC. “We’ve seen the need for various HPC services to be offered to a range of different communities. But, not everyone wants to interact with a terminal. Scalable services need to be presented in ways that users have experience with and then dynamically allocated for their work. Anything else slows down discovery and innovation.”
DiRAC’s cloud enablement has been ongoing for several years. When Cumulus was first deployed, it was a purely bare metal machine. One of the partner projects in Cumulus is the international Square Kilometer Array Observatory (SKAO) project. (SKAO antennas are being built in Australia and South Africa. The South Africa Center for High Performance Computing (CHPC) is also a SKAO member.)
SKAO have been supporting work with StackHPC to deploy an OpenStack cloud environment on Cumulus for several years. DiRAC and the STFC-IRIS consortium partnered to build a scalable science cloud, and now other communities are becoming interested. In October of 2020, Cumulus was expanded with new nodes built on Intel “Cascade Lake” Xeon SPs and deployed entirely as a cloud-native environment using OpenStack and StackHPC software. Since then, it has been used for a mixture of cosmology, astrophysics, and particle physics simulations deployed using the OpenStack environment.
When a DiRAC user accesses Cumulus for a simulation, they are presented with a real volume that includes high-performance storage and CPU and GPU compute nodes, depending on their workloads. The instance delivers bare-metal performance.
“Delivering bare-metal performance is absolutely critical,” adds Wilkinson. “There is no value to a cloud presentation at all if users can’t get the performance they are used to on bare-metal. A lot of work has gone into essentially being able to tunnel through all the virtualization layers and get bare-metal performance for users.”
Users have different requirements regarding security, so that is also a high-priority deliverable. The OpenStack system running on the ISO-certified Cumulus infrastructure can ingest both sensitive and non-sensitive data and process it with the level of protection the data requires.
“Cosmology simulations with non-sensitive data can run alongside live medical diagnostic image analysis,” continues Wilkinson. “Essentially, for diagnostic workloads, the system needs to ingest patient data from the scanners at the hospital and provide modeling information back to the clinicians as quickly as possible, all in a secure manner.”
DiRAC’s Cumulus users have been seamlessly deploying their workloads through OpenStack for nearly a year. Wilkinson is satisfied with the migration to OpenStack so far.
“The proof of success for me is how transparent the new deployments have been,” concludes Wilkinson. “Users are unaware that behind the familiar terminal and command-line interface the system is actually an OpenStack cloud. It’s that seamless.”
CHPC – Enabling South Africa To Manage Covid-19
South Africa’s Center for High Performance Computing (CHPC) hosts Lengau, the largest supercomputer on the continent. Originally deployed for scientific simulation and modeling, users began requesting access to nodes for other computing resources.
To accommodate these users, CHPC first built a virtualized environment using virtual machines and Lengau’s distributed file system for storage. However, increasing demand began to degrade storage performance for large HPC jobs. Both this impact and the need to support the SKAO project’s Science Data Processor (SDP) with Lengau nodes led CHPC to expand the system using an OpenStack private cloud with CEPH storage environment.
“There were several reasons to consider a private cloud,” explains Dr. Happy Sithole, CHPC’s director. “Since we support many governments and businesses in addition to researchers, we needed to address their concerns, such as where instances would be deployed and data sovereignty. We desired greater control over the architecture, access, and security. A private cloud gave our stakeholders more confidence.”
CHPC added nodes in early 2020, using the OpenStack deployment on Cumulus as a model for their new OpenStack Production Cloud. The new system was built on Supermicro TwinPro servers with Intel “Cascade Lake” Xeon SPs, 1.5 PB of disk storage, and more than 220 TB of Intel SSD drives.
“The new cloud system was designed to support many virtual jobs related to ongoing research, such as custom workflows, pleasingly parallel workloads, and web hosting,” says Dora Thobye, CHPC technical manager for HPC resources.
Users were migrated from the VMware environment to the new OpenStack on-premises cloud. On March 23, 2020, the new system went into production. Three days later the country went into lockdown due to Covid-19, and everything changed. Agencies across the government found themselves scrambling for computing capacity and storage resources for population tracking and tracing, to address remote learning programs, plus SARS-CoV-2 research, such as DNA sequencing and virus analysis.
“Because of the pandemic and all the new users it brought to us, we were running out of compute and storage resources,” explains Thobye.
Lengau was utilized as much as possible, but the OpenStack Production Cloud was overwhelmed. In early 2021, the OpenStack Production Cloud was further expanded with new compute and storage nodes to address the demand.
The expanded cloud supports ongoing pandemic activities by the Department of Higher Education and Training, Department of Health, university research, and other public and private projects to address needs from the pandemic. But it also paves the path for South Africa CHPC’s future.
“OpenStack offers a foundation for our existing heterogeneous computing needs and for a future converged infrastructure that provides both supercomputing and general purpose services,” concluded Sithole. “And it provides a transparent environment for users around the world to analyze SDP data from the SKAO.“
As institutions need more diverse computing capabilities to enable their investigators, researchers require easy access to those resources using the tools they are familiar with. On-premises and public cloud services have proven capable to provide these resources from widely heterogeneous infrastructures to enable next-generation research.