The Translational Genomics Research Institute in Arizona is one of a handful of non-profit centers in the U.S. that is harnessing innovations in genetic sequencing technology to explore the roots of diseases and potential cures.
As one might imagine, such an undertaking requires some hefty systems and software—a fact that the center’s VP of Technology, James Lowey, told us is an unending research task in itself. With the cost of sequencing dropping, the price of software tools for genomics pushing skyward, and the high performance computing investments a constant push and pull between performance and power, Lowey has a full plate—most recently with his effort to whittle down the cost of sequencing a single genome from an HPC systems and infrastructure perspective.
“The model is still being built,” Lowey tells The Next Platform, “but it is necessary to get a handle on our investment and the costs of computing a single genome, especially as we build out the next iteration of systems where we will try some new things, including the possibility of bursting into the cloud.” Other projects underway that will change the math on cluster costs at TGEN include swapping out the existing workload management frameworks, including making the shift from Torque/Maui to SLURM, which might provide better scalability, says Lowey.
The current production environment at TGEN consists of Dell M1000 blade chassis stuffed full of M420 quarter-height blades, which allows the genomics center to fit over 500 cores into 10U. At the main site, they have around 2600 cores devoted to crunching genomic sequencing applications, all interconnected with 10 GbE and FDR InfiniBand for the storage system. TGEN is sitting on 1.2 petabytes of Lustre, which is used for both scratch and the storage that is hooked directly into the sequencing machines. The goal, Lowey said, is to keep the genomics as clean and well-fed as possible, an art that has been refined over the years.
Although TGEN keeps an eye on emerging technologies, including the possibility of using Hadoop for key points in the genomics workflow, bursting into a public cloud for extra capacity (both topics we’ll get to in a moment) and the use of acceleration engines like GPUs, they tend to be conservative about what they put into full production. Still, Lowey said that they have tested some other Dell technology, including the Fluid Cache system so the data can be kept close to compute. “We’re always looking for new technology to accelerate our pipeline because when there is, for instance, a child with cancer, time is no luxury, we need to speed our time to result however we can in certain cases, and make sure its efficient and fast at all other times.”
In terms of keeping the genomics pipeline clean and efficient, Lowery described how the team is evaluating where Hadoop and its HDFS file system will fit into the mix. There are parts of the pipeline, including read alignments that he says deliver far faster results (from six hours on their conventional systems to two on an experimental Hadoop cluster). The team at TGEN is also looking at how Hadoop might be useful for other bottlenecks in the pipeline, including handling variant calls. What kept the TGEN teams from exploring this before is that there has been a lack of proper tooling, especially for use on multicore systems. Further, the ability to use HDFS on top of Lustre is still a work in progress, especially for TGEN that wants to prove everything well ahead of rolling it into production. “Intel has been doing work in this area, and we have the Dell teams helping us do what we need as well,” Lowey says.
Dell’s Fluid Cache will be coming to TGEN in the near term, and in preparation, the team is doing some early stage code optimization to make sure they get the most of the implementation. Lowery expects there to be immediate benefit with Fluid Cache and their optimizations, which are hefty because the teams have not had to go through and optimize their mix of open source and proprietary codes since the performance has always been reasonable. During the deployment, they will roll in SLURM iteratively to avoid excessive downtime and roll in other features, including the SaaS Boomi, which will allow them to connect to more data sources and expand their analytical capabilities.
While much of what Lowery described about the TGEN genomics IT environment is oriented around improving performance across various workloads, he noted that there has been a shift in how he and his team at TGEN have considered HPC investments. “Five years ago it was all about the feeds and speeds. We wanted the fastest possible systems and as many of them as we could afford. But looking forward and dealing with current economic realities, we’re spending more time looking at TCO over a four or five year period. In other words, we’re looking at the all-in costs.” This is why Lowery is looking at an infrastructure-centric cost per sequence as well as to other models that might bolster their efficiency both in terms of their sequencing goals and their CFO’s sensibilities at the same time.
Being able to burst out into the cloud is an attractive feature for TGEN from this new less performance-focused approach over time, but Lowey says there is still caution—and they are not in denial about the “secret” that data movement is going to be a big cost. Although he agrees that companies like Amazon Web Services have come a long way in making the cloud a better fit from a compliance and tooling standpoint for genomics companies, he suspects that there are plenty of research organizations like his own that will continue to be reticent to put all of their data outside the firewall. Still, the cost benefits of bursting into the cloud versus investing in more hardware represent a decision set that is still very much on the table.
“The cost per genome from an infrastructure point of view has dropped significantly, especially in the last couple of years. If you go ‘all in’ with the cost view and look at sequencing, informatics, and computation all these thinks are shrinking in cost. But the part we’re still finding challenging from a price perspective is how much all the downstream analytics and informatics are—not just the cost of those tools, but even getting a sense of how these fit into the cost per genome model since there are so many variables.”
Lowey noted that he would be willing to share more about the model and how they break down investments over time per genome, so we’ll certainly follow up with TGEN at a later date for more—and find out how the Dell Fluid Cache offering stands up in the real world.