Bioinformatics research that targets infectious diseases is more important now than ever. However, the systems required to identify and analyze the nature, treatments, and transmission patterns of epidemics, or even pandemics, is incredibly complex. While AI might bring some new tools to bear that have their own learning curve on the IT side, the promise of integrating GPU-accelerated deep learning into several scientific workloads will make the new addition worth the effort.
One project, among many worldwide, that is seeing the value of AI in future bioinformatics for public health is the Cloud Infrastructure for Microbial Bioinformatics (CLIMB) project, which is one of the world’s largest collaborations of its kind. The program’s goal is to provide leading infrastructure for the microbial bioinformatics community in the United Kingdom, which includes distributed computing resources, a large pool of shared storage, enhanced system and workflow management tooling, and many analytics applications ranging from sequence analysis to population trend studies.
It is a collaboration between academic and HPC specialists at four universities in the UK; Bath, Birmingham, Swansea, and Warwick. There are several specific programs that fall under its purview, including providing research and guidance to health authorities in the UK around infectious diseases including HIV and tuberculosis. Given the wide range of applications on the shared infrastructure, it is no surprise high performance computing teams needed to think outside the traditional bounds of scientific computing.
As we have described previously – see Building Bulletproof Bioinformatics Storage and also For UK Genomics Initiative, Compute Is The Easy Part – one important aspect of the work relates to genetic sequencing of bacteria and viruses to better target treatment and understand transmission trends. CLIMB teams have been innovative in their infrastructure choices, opting for cloud-based collaboration, but they are also in front of trends with new methods, including using GPU-accelerated AI to speed some of the individual pipelines.
Thomas Connor, the PI for the Cardiff division of the CLIMB project, reports that they expect GPUs will be important in the genomics pipeline in several key ways, including in base calling from the instruments like the Nanopore sequencer, which is the sequencing device the team currently uses for its work. These sequencers are highly compact, about the size of a stapler, and can sequence many thousands of bacterial genomes on a single device.
The Nanopore device presents a silicon substrate where proteins sit in a nanoscale pod as DNA is sucked through. More specifically, the protein is its own molecular motor, pushing DNA through the pore then across the silicon substrate, producing a change in voltage depending on the bases moving through. This gives off a “signal” that is read in real time to a laptop. That signal needs to be turned from an electrical signal into a series of base calls to ACTG (the nucleotide bases of a DNA strand). The method that appears to Connor and his distributed teams as best for this are neural networks, which can be trained using the NVIDIA V100 GPUs available as part of the CLIMB infrastructure sites.
Other areas where the team does work identifying transmission of bacterial or viral illnesses also stand to benefit from integrating AI into existing HPC workflows. For instance, Connor says that now his teams do tuberculosis resistance checking against a catalogue of known substances. That list of known resistances was developed by using AI on a set of genome sequences where they have characterizations (i.e. they’ve been grown on a plate with an antibiotic to gauge resistance). It took 10,000 of these strains where resistances were known for teams to use AI to identify genes that were associated with resistance across a massive dataset.
Further, Connor explains: “As we generate sequence data that is rich and highly complex, there’s real potential for integration with all the other data streams we’re collecting within the National Health Service (NHS) in the UK. One thing that is interesting from a research perspective is looking into the question of using all these data streams, collections of patient records going back 20 to 30 years, and using machine learning to classify things that we very likely could have missed due to the complexity and vastness of data. That has a big effect in terms of how we do things like infection prevention and control.”
He adds that the endpoint might be a real-time system that is not just monitoring the sequence data, but also the real-time feeds from other diagnostics and instruments in hospitals to provide real-time alerting for prompt intervention. During a pandemic like coronavirus, this could have far-reaching potential effects.
“We’ve spent the last few years doing work to build our core diagnostic systems to the point that we can generate the sequence data and analyze it efficiently, but we are thinking about how to move ahead and improve,” says Connor. “AI is going to be a core part of that. That’s where HPC Wales in particular comes in with their NVIDIA V100 GPUs.”
Connor notes that all of the systems and the management layers in between have been purpose-built for the specialized bioinformatics work they do. Working with the Dell Technologies HPC team, they are ready for what lies just beyond the mission-critical work they do now targeting infectious disease in particular. They are “future proofed” in the sense that they have the robustness of HPC systems and workload expertise along the way from their vendor partners, something Connor says had made all of this possible.
You can find out more in this case study.