Many Life Sciences Workloads, One Single System

The trend at the high end, from supercomputer simulations to large-scale genomics studies, is to push heterogeneity and software complexity while reducing the overhead on the infrastructure side. This might sound like a case of dueling forces, but there is progress in creating a unified framework to run multiple workloads simultaneously on one robust cluster.

To put this into context from a precision medicine angle, Dr. Michael McManus shared his insights about the years he spent designing infrastructure for life sciences companies and research. Those fields have changed dramatically in just the last five years alone in terms of data volumes, compute requirements, algorithmic complexity, and regulatory pressure. Ultimately, McManus says, the trend toward simplification of infrastructure is strong—even if getting to the point where that is possible has been a long journey.

In both genomics and drug discovery, infrastructure design used to be a more complex and decentralized because the workflow steps so often were rooted in specific hardware elements, which led to bottlenecks data movement-wise as well as a less streamlined path to results. However, two elements have entered the infrastructure ecosystem that can carve clearer paths through the infrastructure thicket, creating a unified workflow (hardware and software alike). These new forces are, according to McManus, Intel’s scalable systems framework—an approach that defines a single architecture for multiple simultaneous workloads—and the addition of machine learning into that mix.

For genomics use cases, he says, it used to be customary to take the raw data off sequencing machines, move those to a system that converted raw data into a file that contained the list of variants, then process the variant file with other software such as an annotation pipeline followed by an interpretation process, often on a different computing solution. This data movement proved to be yet another bottleneck.

Using the single system framework as a springboard to allow multiple simultaneous workloads across a larger workflow panorama, genomics infrastructure can now be defined within one system for all of those previously disparate elements (sequence informatics, annotation, and interpretation) via the single system architecture. Further, there is significant promise that the largely human-led filtering and classification could be handled by machine learning algorithms, which can run on the same hardware platform as the rest of the genomics workflow. The point is, simplified hardware (thus management), higher efficiency, and faster time to result.

“In genomics, there is growing interest in machine learning to decipher variant patterns in a person’s sequenced genome. Of the millions of identified variations that are compared against a reference genome many are not relevant to the person’s medical issue. And these variations tend to occur in patterns that vary by different populations and subgroups. Machine learning could be used to filter through these different patterns and could assist in the interpretation process,” McManus says. “When you have three or four million variants you have to understand this filtering and classification process is largely manual at present. Researchers can make use of a 56-gene list from the American College of Medical Geneticists (ACMG) to quickly screen for rare diseases.” The goal, he explains, is to speed the time to filter and classify those variants, leading to faster precision medicine results, and to do so on a system that is tied from start to finish to the entire genomic workflow.

Barry Davis, General Manager of the Intel’s Technical Computing team adds, “machine learning algorithms do not need special hardware in our view. You don’t need add-in cards for those algorithms when machine learning is really, from a system level, just another workload running on that machine—albeit an important one. There are cases when specialized hardware will be needed, which we are addressing with workload optimized solutions such as our Nervana Deep Learning Engine or a FPGA-based solution.” Those architectures, which tie a custom processor to a standard Intel Xeon processor are still forthcoming or recent additions to the Datacenter in the case of FPGAs, but this same approach applies to the company’s own self-hosted accelerated compute engine, Intel Xeon Phi processors. Xeon Phi not require an offload model like other accelerators and can fit seamlessly into a standard CPU cluster to meet the needs of certain parts of the workload using a resource management solution like Intel HPC Orchestrator to divide and conquer work as required by the application.

“Genomics is typically done on a Xeon and machine learning can also be done on a Xeon along with Intel Xeon Phi processors as an accelerator for part of that workload,” McManus says. “A single system architecture does not mean the whole cluster has to be homogeneous. You can now imagine a cluster with a mix of Xeon and Xeon Phi nodes all interconnected with the Intel Omni-Path fabric and a parallel file system like Lustre and see how a collection of separate job queues can feed work to the appropriate CPUs running simultaneously in the system for genomics and machine learning workloads alike.” He says once the Xeon + FPGA and Xeon + Nervana Systems chips enter the market there is an even more powerful collection of components to tie together in the same system, which are standardized to run the entire collection of workflows.

This is a boon for users, but also for the OEMs, Davis explains, noting that they can amortize their investment in a hardware solution across multiple workloads instead of in the appliance-driven model of today where there is a specific system for key workloads wherein the customer has to ultimately manage the side effects of that kind of heterogeneity.

Being able to support the variety of workloads from genomics, drug discovery, medical imaging, or other precision medicine-focused workloads across a unified single system architecture can provide for far greater efficiencies, says McManus. This will enable the research and pharma pros get back to addressing some other pressing challenges in the field from a computational science perspective (developing more robust algorithms based on machine learning and deep learning algorithms) as well as more esoteric barriers, including validating new approaches to these problems in the face of FDA regulations.

Ultimately, McManus concludes a key factor is to allow researchers and drug discovery experts to get out of the infrastructure management business and back into their core work—and the single system architecture approach, which will enable a diversity of workloads, including machine learning algorithms, to operate simultaneously thereby speeding discoveries, is a key factor.

Sponsored Content from Intel.

Many Life Sciences Workloads, One Single System

Sign up to our Newsletter

Be the first to comment

Leave a Reply Cancel reply