Let’s face it. There is nothing simple about large-scale sequencing and targeted analysis of genomic data. It becomes even more complicated when teams move beyond a few key elements from sequencing information and try to understand broader trends on organisms like bacteria or viruses. The IT hurdles alone can be higher than general purpose HPC centers want to cross — and for specialized projects, like the CLIMB effort in the UK the diversity of missions adds to the complexity.
From a systems standpoint, compute capacity is the simplest element of bioinformatics systems. While many centers use the elements of high performance computing, (many thousands of powerful parallel processors, low-latency networks, etc.) they have to use them differently. High performance and high throughput are not the same thing, thus system balance for genomics and related applications in microbial biology is not the same. There are other complications on the software side, and if the data is being used in conjunction with healthcare entities, a security component is critical as well. In short, while HPC is critical to bioinformatics, it is in its own category.
We talked to Dr. Thomas Connor about this re-balancing act for bio HPC systems. He is the PI for the CLIMB project’s Cardiff group. He walked us through a few ways that “compute is the easy part” for the kind of work his teams do. As detailed more extensively here, it is not only demanding on IT resources, it has a mission-critical angle that puts an even heavier strain on system design and use. He says that while having the expertise of the Dell Technologies HPC team on his side has been game-changing for the successes of the CLIMB project, there are many things to consider as we look ahead to bioinformatics—an area that will be of greater interest than ever before given the current novel coronavirus.
Even though the focus of our conversation was on systems, Connor began by talking about HPC and software. Oddly enough, this was where the co-design challenge began for his teams as they consulted with Dell about how to best go about the compute, storage, reproducibility, and reliability challenges they faced. Recall, as in the linked article that the team provides genetic analysis to the National Health Service and clinicians about virulent diseases. This means the results need to be accurate and replicable. That might sound like a simple request since HPC systems and software are designed with those goals in mind but again, bioinformatics is a different animal.
“Sequencing remains, even today, mostly a research activity, not something done routinely in healthcare systems,” Connor explains. “So the challenge is that when you want to do something with sequence data, you have to use research software developed by academics in groups without stable versions managing that process. And when you go through a process, like the one we have with HIV, for instance, there are maybe 10-15 steps, each with a different piece of software that has to run nicely together for us to generate results. And the problem is that there is an accredited process in the UK; these things are all part of a standard. One critical piece of this accreditation is reproducibility.”
In other words, methods developed on or for HPC research systems have to evolve rather dramatically to have the coherency and replicability necessary for clinical use. To use the terms from another world, they have to be “enterprise grade” tools that can provide the exact same results every time. Interestingly enough, for Connor’s teams and the Dell Technologies HPC advisors, this became a hardware, storage, systems, and workflow problem to sort, which they did in some unique ways. They made their shared resources a cloud with full automation, containerization of processes, and with both block and object stores to let them float to where resources were needed, while maintaining the necessary reproducibility.
“What we were looking at would have been almost impossible with an HPC system where you offload different modules and get inconsistencies between libraries and different pieces of software interfering with each other. We also have multiple users to further complicate things. We had to move to a more automated system, one that would be locked down with processes that could define, in terms of pre-defined workflows that make use of specific containers,” Connor says. In their analysis of all the reproducibility, interoperability, and even security constraints with Dell’s HPC team they found that this setup, which does not sound much like any other HPC site, was the only way to get the level of assurance needed around results that clinicians could count on.
While compute might be the easy part, it does not mean that performance is not critical. “We have to turn samples around fast,” Connor says, adding that while in the research world, they would generate datasets of several genomes then spend months crafting those, analyzing them, and writing a paper. The clinical world is far different. “With something like our HIV work, sample turnaround times were a few days maximum. We couldn’t have an analysis that would fail or take a lot of work to simply keep running. It had to be reliable every time and run the same every time.”
Connor says the bottom line is that standard HPC systems are not well-sized or well-balanced for their evolving sets of demands and even if they ran on a traditional HPC cluster for research, they would be disruptive from a cluster resource perspective. When they made their initial application to funding agencies for infrastructure five years ago, the stated goal had been to build an infrastructure piece that could enable researchers to undertake microbial questions on appropriately-sized and scaled hardware. He says that this goal has been met, with support from the Dell HPC team, which knew the difference between what worked for most HPC sites and this new set of demanding use cases.
“The effort has been successful from the infrastructure standpoint and has been used in everything from translation activities through to the Ebola outbreak a few years ago where all the analysis of sequences generated was done on CLIMB. There is a lot of compute capacity on demand and all the right tools to make the most use of it most effectively,” Connor concludes.