Inside an Evolving Genomics Cluster
February 20, 2015 Nicole Hemsoth
From processors, memory, network, and beyond, making architectural choices to support large-scale genomics research is often fed as much by trial and error as it is empirical knowledge about what will work for a demanding application set.
Statistical genomics requires snappy rehashing of a central, consistent dataset (which can, luckily, be managed in cache) against an ever-evolving set of variables to strike correlations. “The problem isn’t with processing genomic data—it’s fairly easy to understand and accelerate from a computational perspective. It’s rather the discovery, the statistical genetics, which is the top-down comparison of many thousands of genomes all at once,” Robert Esnouf, who co-manages the research computing core within the Wellcome Trust Centre for Human Genomics, tells The Next Platform. That kind of ‘all against all’ comparison, where it’s not clear what we’re looking for, is what we were most focused on when we thought about this system.”
Selecting a system that balances the I/O and compute requirements for statistical genomics requires a “best of all worlds” approach, according to Esnouf. His teams have learned tough lessons about what works (and doesn’t) for high performance computing clusters handling variable research workloads, all of which have found their way into a new Fujitsu system for the center that strives for a balance between memory, compute, and storage performance to meet these particular challenges in genetics research.
Esnouf and his team have seen their share of multi-purpose clusters come and go, but for the kinds of statistical genomics workloads that set the center apart as a genome research center in Europe, they had to perform a balancing act with the I/O and compute requirements, all the while striving for the density needed to house the system with its additional cores and new GPFS-based DDN GRIDscaler SFA12k appliance. After shopping a bunch a potential clusters to support its software, Wellcome chose the Fujitsu BX900, an ultra-dense blade servers that packs 18 server blades, 8 connection blades, 6 power supply units, and 2 management blades into a tight 10U high chassis.
The original Nehalem cluster that sat in the same floor space in three racks was swapped out for two of the Fujitsu cubes. The active rear-door cooling leads to around 18KW per rack, which is in line with what Fujitsu told them to expect. The new system’s “Ivy Bridge” Xeon 2600 v2 CPUs provide a “2.6X performance increase over its predecessor built in 2011. It boasts 1,728 cores of processing power, up from 912, with 16 GB 1866 MHz memory per core compared to a maximum of 8 GB per core on the older cluster,” according to Esnouf.
“When you can’t guarantee your load will be 100 percent all the time, you’re better off taking slightly fewer but faster cores. And it’s also important to think about how many cores per NIC you’re looking at on the back in terms of I/O, which definitely influenced our decision around this,” he explained. On the memory front, this is one of a growing number of areas that will continue to require massive amounts of memory. The 108 16-core machines provide 256 GB each, but he notes this need for larger memory configurations will only to continue to increase as the center grows.
The team also made the switch from 40 Gb/sec QDR to 56 Gb/sec FDR InfiniBand as part of the move to the new cluster, but not without some bumps in the road. “We found on the older cluster that using gear from both QLogic and Mellanox creates some major problems.” Part of the RFP for the new cluster required compatibility for using both QLogic/Intel and Mellanox gear going forward, even though the team has settled on FDR as central.
Wellcome Trust also learned other lessons from its prior X86 cluster that it applied to the bidding on the new cluster. For instance, they quickly realized the limitations of using the NFS file system for statistical genomics. Accordingly, they looked at GPFS as the primary file system for the new cluster since, according to Esnouf, Lustre loses its shine when it comes to distributed metadata handling, especially when so much of the data sits in cache waiting to be matched. Distributed metadata scales well on GPFS, especially on a moderate-sized cluster like the one at Wellcome Trust, according to Esnouf.
While Lustre was a consideration, the organization’s own benchmarks with statistical genomics found that GPFS was a clear leader, in large part due to the scalability and performance fed by distributed metadata. This is not to say that Lustre doesn’t have this capability, although it is new in Lustre 2.5, which has something that’s not distributed metadata in the GPFS sense, but is handled via multiple metadata servers, each of which share responsibility for part of the namespace. This means you can allocate I/O to different metadata servers. On that same note, Lustre has caught up in its newest incarnation to include hierarchical storage management (HSM), which was a differentiating feature for GPFS for some time. Again, the system choices were based on the needs of statistical genomics, which is a perfect fit for using a file system with robust distributed metadata.
As a point of unique interest, Esnouf’s team decided to go against DDN’s suggestion to use 8 MB stripes for data, and instead trimmed those down to 1 MB. “When you make a request to GPFS for 8 MB at a time from the disks in the hopes that’s what you’re going to want, that’s great for certain applications that take data sequentially. But if you look at how these algorithms for genetics are built, they take data in small chunks—and a lot of these algorithms are the same way since they were developed against NFS, among other things.” If Wellcome Trust took DDN’s suggestion, a lot of the transferred data would have been discarded anyway, so even though it has lost some of that “headline performance” by shifting to 1 MB data stripes, the organization has what Esnouf describes as a more flexible and resilient system.
With OCF serving as the integrator, the Fujitsu BX922 compute nodes leverage the native Fujitsu SynfiniWay middleware for virtualization, orchestration, and scheduling. Esnouf says that while Wellcome Trust is leveraging the built-in tools from Fujitsu, his teams are used to gluing the stack together with the open source code. As many HPC shops do, Wellcome Trust is using the CentOS variant of Red Hat Enterprise Linux as the operating system on the nodes, XCAT for management, and the open version of Grid Engine to orchestrate workloads on the cluster. The cluster runs an electron microscopy project alongside the statistical genomics jobs, and this application has GPU acceleration. As that electron microscopy workload grows, Wellcome Trust will need to look into a supported version of Grid Engine from Univa to keep pace. He says the team has already had Univa on site to evaluate its next steps into GPU accelerators and potentially, Xeon Phi accelerators from Intel.
It was this ability to work around specific demands for the file system, network, and DDN storage appliance that ultimately led to Fujitsu’s capture of the deal, following bids from eight other major HPC hardware vendors (including Bull, Hewlett-Packard, IBM, Dell, and others Esnouf didn’t name), none of whom could touch Fujitsu on density, price, integration, and support. At the end of the day, Esnouf said that support is what really set Fujitsu apart, much to his own surprise. It is, after all, not one of the “standard” European HPC vendors, at least if you look at the vendor share in the United Kingdom in particular, but he took a chance anyway.
This sort of national bias for one HPC vendor over another is nothing new or unexpected, But when it came to choosing Fujitsu for a UK academic institution—that’s where the real challenge was that made the Wellcome Trust team wary about the impending deal. The concern, he said, was around the company’s ability to offer the kind of ongoing support required, but by working with Fujitsu on a custom plan for the center they’ve “learned a lot from each other” and are having a successful relationship thus far.
As an aside to the HPC nationalism issue raised above, while the Top 500 is by no means a fair representation of actual system share for a particular vendor (these machines represent only a fraction of clusters in the world, and just those who submitted benchmark results that appear on the bi-annual list), one glance reveals that Fujitsu is not a prominent HPC vendor in Europe. The only non-Asia/Pacific region cluster on the list is a 16,384 core machine at Spain’s renewable energy resources laboratory. However, in the UK, there is one notable machine, installed to support HPC Wales, which runs a wide range of workloads for Wales’ national supercomputing projects. Fujitsu scored this award in 2010, which allowed for the first phase of the machine.
To be fair, Fujitsu does have some notable European ties, not mention the fact that, again, as stated above the Top 500 isn’t necessarily a perfect measure of real influence of one vendor or another (since it’s just supercomputing—and only the highest end). Their machines are built in Germany (as well as Japan) and via their historical relationship with what used to be Siemens server division, which Fujitsu eventually absorbed.
On that note, the Wellcome Trust team is also looking at new tools and system approaches that will support a sister Big Data Institute that will open its doors within the next five years. Here, Esnouf says, is where the two institutions will collaborate on using Hadoop, Ceph, OpenStack, and other tools that are not as common in traditional HPC environments like it has just built for running statistical genomics code. It could be that the lessons Wellcome Trust is learning alongside Fujitsu could lead to other HPC and data-intensive system wins for Fujitsu in Europe—but Esnouf says this new big data center plan is already attracting a lot of notice from many other hardware vendors. So if Fujitsu wants to win the deal, it will have to fight for it.