Building Bulletproof Bioinformatics Storage

Going from sequence data based on millions of lines of short DNA fragments to the final, simplified report for clinicians is no small undertaking from IT standpoint. Bioinformatics workflows constrain nearly every possible aspect of system design. They require large computing resources, robust storage, seamless integration and management. While there are some standards across the bioinformatics domain, each organization has its own requirements that impact system design and deployment.

For academic research centers, the requirements on performance, reliability, and reproducibility are quite a bit different than for centers that operate on real-world healthcare data. For organizations that toe the fine line between academic research and robust healthcare analytics with quantifiable, bulletproof outcomes, this can be a challenge.

One especially interesting example of how a mission-critical bioinformatics workflow has been designed and implemented can be found in the UK via the Cloud Infrastructure for Microbial Bioinformatics (CLIMB) project.

Building a system that looped in rigorous research from several participating institutions and system design that could ensure reliability, reproducibility, efficiency, collaboration, security, and more was quite a challenge, according to Dr. Thomas Connor, Principal Investigator for the CLIMB project’s Cardiff division. The overall CLIMB team, scattered across several academic research communities in the UK, gathers and analyzes gene sequences from infectious microbes to provide transmission and personalized treatment data for many of the UK’s National Health Service (NHS) programs. This is important work to understand infectious diseases like tuberculosis and HIV and certainly the novel coronavirus as well.

Connor’s team has a small, 160-core HPC cluster that sits next to the sequencing machine, which offers the low latency connection to the cluster required for an initial round of analysis. This on-site cluster is critical, Connor says, because in most clinical settings, the network infrastructure is not strong enough to support streaming DNA analysis. This pre-processed data is sent to two other HPC sites via an OpenStack gateway to provide further testing environments, as well as resiliency. Core to that off-site Ceph storage from Dell Technologies is a backend federated research cloud for UK bioinformatics work that serves over 300 separate groups.

The compute challenges of CLIMB are enough to fill volumes but when it comes to boots-on-the ground scientific results that the NIH can use to inform public health policy and to provide treatment guidance, storage is where the real meat of the story rests. For Connor and teams, however, it goes far beyond having the right capacity. Storage considerations for mission-critical bioinformatics mean factoring in virtualization and cloud while maintaining high levels of security, reliability, and fast reproduction of results at any moment. The CLIMB team’s current Dell-integrated storage system layers Red Hat and Ceph across 7 PB of storage with clever use of containers and an innovative workflow management system.

“Storage is critical and even more important is making sure everything we do is reproducible. In academia, we build software and do analysis, something that might take a few weeks then be changed or tinkered with. In healthcare, this isn’t the case. Everything has to be locked down. Each time you run an analysis process, you’ll need to achieve the same result with the same data.”

The CLIMB team has been using virtualization to ensure reproducibility of results, specifically Singularity containers combined with the NextFlow workflow management system to lock down pipelines and processes to ensure consistency of results. It also translates the robustness of results into far more cores beyond that initial stage cluster next to the sequencer. The team can branch out to take advantage of another 2,000-core Dell EMC cluster at their main datacenter in Cardiff, or another thousand cores in a different facility using OpenStack to burst into one of those on-prem clouds when more compute capacity is needed. One of the newest clusters, also through Dell Technologies, ready to receive this bioinformatics work is at Cardiff and features AMD Epyc processors with OpenStack, Kubernetes, and Ceph built in and ready to roll on the fly.

The team’s new lab system, built in close partnership with Dell Technologies, is 160 cores running a hyperconverged Ceph setup so that each of the nodes has SSDs at its backbone. The Dell Ceph install is 27 Dell EMC PowerEdge R730XD servers, each with sixteen 4 TB disks and 2 240 GB SSDs. That’s all connected with 10Gbe Brocade switches to round out the picture. This interacts with the new Kubernetes system.

“The OpenStack system gives us what is effectively a blank slate to create and burn environments, letting us use the same tools on the VMs we spin up on the bare metal system. Our new system running Kubernetes is designed to quickly move us into the right position because using containers with NextFlow allows communication with Kubernetes to initiate jobs, which means we can position ourselves to use shared infrastructure while retaining all the governance requirements around patient data and the specific software needed for analysis.”

Dealing with on-prem to cloud, even private clouds, is still viewed as a frightening prospect for some in the healthcare community, but with guidance from Dell’s systems and security experts, Connor’s team now has a robust security framework in place that dovetails with system functionality across on-site systems, containers, and spanning multiple institutions.

Like many systems that cater to HPC and complex workloads, even ones with security constraints as in this public/private CLIMB endeavor, the systems built to serve the CLIMB project’s teams like those from Dell within Connor’s group were the result of careful planning. “Over the last several years I have relied heavily on the great partnership with Dell to help craft these systems from start to finish. They have taken the time over many years to understand our evolving requirements and help us understand the options and where the shortcomings might be. These new systems are doing interesting things that might have not been clear possibilities without their continued help and expertise,” Connor says.

More info on this program and the systems that back it can be found here.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. Similar circular charts can be created using a tool named circos (circos.ca) the brainchild of Martin Krzywinski, Staff Scientist at Canada’s Michael Smith Genome Sciences Centre.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.