From its start in 2000, The Centre for Genomic Regulation (CRG) has been the largest genomic research center in Spain and among the largest in Europe. The diversity of research in human, animal, and plant genome studies, coupled with a broad collaborative approach that requires the sharing of data and resources, creates some computational and data management challenges.
To address some of these issues, particularly as they relate to moving data and applications between different kinds of infrastructure, from workstations to local clusters to supercomputing resources like the Barcelona Supercomputer Center (BSC), and to Amazon’s cloud, CRG has deployed software containers – ones that mesh well with their existing workflow management frameworks while still allowing flexibility to deploy on whatever infrastructure the job requires.
This container approach solves another important problem for genomics centers like CRG, which is the tricky art of reproducibility when workflows are run against different hardware environments and with different versions and releases of genomics software, which is constantly changing. As with other kinds of science, genomics researchers have to be able to reproduce results of their experiments to attest to their validity, and achieving this necessary result is difficult – if not impossible – if researchers are deploying different versions of application software.
Automating The Workflow, Containing The Software
Like other large genomics centers, the computational problems generally have far less to do with processor or memory capacity and are far more connected to data movement on-site and among collaborators where analysis procedures, protocols, pipelines, and source code must be shared. At CRG the requirements for more collaboration will become even more critical when the center gets access to a half million drug recipes that need to be analyzed against clinical and time series data.
For Paolo Di Tommaso, a software engineer in the Comparative Bioinformatics Group at CRG, the best way to route around current and future challenges, especially when using the center’s own 2,000-core cluster, the Mare Nostrum supercomputer at BSC, or the cloud – all the while preserving reproducibility – was to mesh the Grid Engine workload scheduler from Univa with Docker containers and thereby automate a large part of the process of submitting and managing jobs.
For its internal processing, CRG has a cluster with 150 nodes, the bulk of which are Xeon-based machines from Hewlett Packard Enterprise but which also includes Xeon machines from IBM and Dell. The HP ProLiant nodes do the bulk of the heavy lifting and are linked to each other over a 10 Gb/sec Ethernet network, and the Dell and IBM machines are hooked up with older 1 Gb/sec Ethernet. (A lot of genomics work is not latency or bandwidth dependent, so this is not a huge issue.) For storage, CRG has two network-attached storage arrays. An 800 TB SFA1000 array gives disk space for user data, including snapshots, but does not have replication, and a larger EMC Isilon array has 1.9 PB of capacity and provides both snapshotting and replication of user data and application software stacks. The clusters talk to the storage using the NFS v3 protocol. The CRG cluster nodes runs Linux, and Grid Engine is used to schedule applications on top of the cluster.
The integration of Docker containers with Grid Engine done by CRG predates the commercial support of Docker containers by Univa, and is related to a parallel tool developed by Di Tommaso and his team called Nextflow, a project that was started three years ago and that is akin to the open source Kubernetes container management system open sourced by Google, except that Nextflow has been designed explicitly by CRG to manage the deployment of distributed software stacks on bare metal or within Docker containers.
In fact, as Univa CEO Gary Tyreman recalled, the Univa team was inspired by the work done at CRG to integrate Docker containers with Grid Engine as it developed its own commercial variant, called Grid Engine Container Edition. This Container Edition, now being tested at CRG, automates the deployment of Docker Engine runtimes on top of clusters and then dispatches container images packed with application software from Docker Hub, the public repository run by Docker (the company), or Docker Hub Enterprise, the on-premises variant of the repository. For performance reasons, using a local Docker repository is preferred over using the public one.
The Nextflow workflow engine allows the CRG team to deploy computational pipelines using the same code on any infrastructure, despite the difference in schedulers between, for example, BSC (which uses IBM’s Platform Computing LSF scheduler on its Mare Nostrum cluster, which started out based on IBM Power processors but now has Intel Xeons mixed in) or another site that uses Grid Engine. Users define their pipelines, hook together different tasks, and it takes over parallelizing those across the desired resources. “The deployment of a pipeline is boiled down to a few images in the end,” Di Tommaso says.
At the core of Nextflow, which runs seamlessly with Univa Grid Engine and other schedulers, is Docker. Di Tommaso and his team at CRG first found Docker when it was a small GitHub project and were early collaborators on the code as they sought to make it robust enough to span their large-scale computing environments. The big challenge was making the schedule aware of the Docker images in each node. To get around this, CRG built an extension in Univa Grid Engine, which watches this, and allows for simpler job submission, specification of the number of cores and memory required, and other data all bundled into a container. The scheduler can then figure out which nodes are needed, download the Docker image from the registry, and move forward.
While CRG is a long-time Grid Engine shop, other environments use different workload schedulers and CRG wants to be able to make use of any computational resource at the disposal of its researchers. The challenge was to create the same workflow across different schedulers, which adds another kink to the reproducibility problem. Further, with so many pieces of software running in tandem across a genomics research workflow and in mixed hardware environments, the burden was always configuring different platforms and infrastructure. Containerizing this was a way to isolate elements of the pipeline execution – and the need arose for CRG to do just this well before containers were all the rage, and hence the Nextflow project.
“For one genomic analysis you need to code different programs to do different things, so there are always several tasks with their own dependencies,” explains Di Tommaso. “With Nextflow, you bring together the tasks you want, even those with different binaries, dependencies, and other factors, and build a Docker image for each task. Nextflow knows how to parallelize the workflow, how to switch the images off and on, and how to distribute the containers across the nodes or blades or cloud instances without doing much.”
Overall, the CRG team sees a bright future for containers, but Di Tommaso says that Docker might not be the container solution of choice. Projects such as Shifter, which was developed at NERSC, are more tuned to the needs of MPI and HPC applications. Still, finding Docker early and working with the Univa Grid Engine framework allowed CRG to solve its pipeline problems before they compounded. The goal, Di Tommaso says, will be to have everything containerized at CRG for greater collaboration and reproducibility, which means it would be useful if other centers worked toward similar ends.
Right now, the Nextflow engine is still in limited use at CRG because of security concerns about Docker images. Further, since the team is still using Docker 1.0, it is still experimental but Di Tommaso and his team are working to bolster the software and its use at the center. According to internal benchmark tests, wrapping code in Docker containers rather than submitting them in bare metal fashion to the cluster adds about a 4 percent overhead for the genomics codes at CRG – a performance hit that many HPC shops would be willing to take given the advantages of simplifying software distribution and automating complex application workflows.
Maria Chatzou, a bioinformatics engineer at CRG, says that the combination of Grid Engine, Docker, and Nextflow has been transformational for the research institution.
In the old days, Chatzou explains, researchers running genomics code pushed out a pipeline by hand, manually setting up the software images, keeping track of all of the software releases, and then had to constantly check to see what had failed and what had run. Now, the software is put in containers, Nextflow automates when each part of the genomics processing chain is fired up and shut down – and what versions of the software are used – and Grid Engine grabs that software and schedules it on the CRG cluster to provide efficient running of all jobs submitted by researchers. And instead of managing different parts of the workflow by hand, it is all automated as a single workflow, and that makes the overall genomics pipeline – stepping through the processes, not the application runtimes – anywhere from 10X to 20X faster.
“It is changing our lives – the impact it’s having on our work is amazing,” says Chatzou. “I am speaking to that as a developer on this and also as a user. I have many different genomic pipelines I need to run, with different programs popping out every time. This stack lets me wrap everything up. In combination with Nextflow, which is the framework to ship things around without having your main code buried in it, it changes everything. You set it and forget it.”
For those interested in the NextFlow, Docker combination with Univa Grid Engine base, there is a detailed benchmark report available here.