Shared Memory Pushes Wheat Genomics To Boost Crop Yields
May 10, 2016 Timothy Prickett Morgan
Wheat has been an important part of the human diet for the past 9,000 years or so, and depending on the geography can comprise up to 40 percent to 50 percent of the diet within certain regions today.
But there is a problem. Pathogens and changing climate are adversely affecting wheat yields just as Earth’s population is growing, and the Genome Analysis Center (TGAC) is front and center in sequencing and assembling the wheat genome, a multi-year effort that is going to be substantially accelerated by some hardware and updated software.
With the world’s population expected to hit 10 billion by 2050 or so, there is actually not a lot of time to waste in the Seeds of Discovery project that TGAC, which is based in Norwich Research Park outside of East Anglia in the United Kingdom, is engaged in. The wheat genome, which has 21 chromosomes, is five times larger than the human genome, and it will take time for wheat breeders working downstream to figure out how to tweak wheat so it can better withstand current and future conditions so that people do not go hungry.
TGAC, which gets its funding from the Biotechnology and Biological Science Research Council in England, provided the computing power that let researchers deliver a first draft of the wheat genome last November, a milestone in the project. That effort delivered a genome assembly with 98,974 genes, which TGAC reckons is about 91 percent of the total genome for the plant; it weighs in at 13.4 GB, which is pretty fat for a text file. But work still needs to be done to fill in gaps in the wheat genome, which is not expected to be fully completed until around 2018 or so. (BBSRC invested over £509 million in various life sciences projects in 2014 and 2015 and has spent over £100 million alone on wheat research in the past decade.)
The genomics center has been an SGI customer since its inception, owing to the nature of its genomics sequencing and assembly applications and their desire to have a large shared memory to work within.
“At the time, from our point of view, SGI was the only vendor that had systems that allowed us to run our software with a large shared memory architecture, Stitt tells The Next Platform. “To give you an idea, the wheat genome typically requires somewhere between 4 TB and 6 TB of memory per run, and it will run for maybe three or four weeks to give us an assembly.”
The software that TGAC is using is a heavily modified genome assembly code called DISCOVAR, which was created by the Broad Institute genomics lab in Boston. Development on that code, which is written in C++, ceased a while back at Broad, but the algorithm team at TGAC, under the direction of Bernardo Clavijo, has created a heavily multithreaded variant of the genome assembly routine, called w2rap, that was tuned for a shared memory architecture. Back when TGAC was founded, if you want to stick with the X86 instruction set, then the only option was SGI UV machines. (There is no technical reason why w2rap could not have run on enterprise-class IBM Power or Oracle or Fujitsu Sparc NUMA machinery, which all had large memory footprints back then and still do today.)
TGAC has used a mix of machines for its various workloads, but it looks like a pair of UV 300 systems beefed up with newer Xeon processors and great gobs of flash storage will be big enough to take on a lot of the heavy lifting for the wheat genome work.
Up until now, this work was done by a pair of UV 100 machines, each with 768 cores and 6 TB of shared memory, and a larger UV 2000 machine with 2,560 cores and 20 TB of shared memory. The UV 2000 is partitioned into three distinct machines, two with 8 TB of memory and one with 4 TB. That gave TGAC effectively five machines to do its big wheat sequencing and assembly work, and it was running up against capacity limits on these physical and partitioned machines.
With the pair of UV 300 systems that TGAC is installing now, the machines are shifting to the latest “Haswell” Xeon E7 v3 processors from Intel, which were launched last May and which are due to be replaced imminently with a “Broadwell” Xeon E7 v4 update. The jump from the “Nehalem-EX” Xeon 7500 processors used in the UV 100s and the “Sandy Bridge” Xeon E5s used in the UV 2000, plus the move to faster DDR4 memory and the addition of 32 TB of P3700 flash-based SSDs for accelerating the file systems on the UV 300s, are expected to make a dramatic jump in performance. Each of the UV 300 machines will have 12 TB of shared memory across 256 cores, and even with that much lower core count, could yield as much as an 80 percent improvement in performance on the w2rap workloads. (The UV 300s, as we explained in our overview of the UV architecture last year, make use of the NUMAlink 7 interconnect, which tightly couples the main memory and reduces latency.)
Stitt has just taken delivery of the machines and is now managing the work of recompiling the w2rap application using Intel’s Parallel Studio compiler suite as well as the open source GCC compilers to see which one will give the best results on the new iron. TGAC uses Red Hat Enterprise Linux on its machines, and will continue to do so, and uses the open source SLURM scheduler to dispatch work to the systems.
With the pair of UV 300s, TGAC will have one of the largest installations of SSDs in the world that have been equipped with NVM-Express ports linking the flash to the CPU complex. This is important, says Stitt, because when starting a genome assembly, the first thing w2rap does is take two input files that are somewhere on the order of 512 GB each and feed them into main memory. This will obviously be a lot faster from NVM-Express flash drives than from the disk-based file systems used in the UV 100s and UV 2000 that TGAC has been employing. How much this flash affects overall performance, Stitt is not sure because TGAC is still tweaking the code for the new hardware. But he hopes it will allow a sequencing and assembly run to take three weeks or less instead of four weeks.
If TGAC needs more oomph than the current UV 300 setups provide, it does have some expansion room. The two UV 300 machines have only sixteen sockets with 16-core Xeon E7 v3 processors, and the system expands to up to 32 sockets. With the socket expansion, the UV 300 system can have 24 TB of shared memory, and Jorge Titinger, CEO at SGI, tells The Next Platform that the machine can be expanded to 48 TB of shared memory with super-dense 128 GB memory sticks that SGI has started shipping to selected customers. We presume that SGI will also be making available a processor upgrade option in the UV 300s once Intel ships the Broadwell Xeon E7s sometime later this year, so the peak core count should go up a little higher. (The Haswell and Broadwell Xeons share the same sockets.) It stands to reason that a new UV 400 will be launched sometime in 2017 or 2018 when the “Skylake” Xeon E7 v5 processors are delivered, which will use a different socket and have a different memory architecture than the current Xeons.
As Titinger has discussed with us in the past, SGI has a goal of getting the UV product line to drive at least 10 percent of its revenues in the current fiscal year – up from a low single digit percentage in years past – and he tells us that the company is on its way to meet that goal. “I think the opportunity for the UV 300 architecture is going to grow,” he says. At the moment, the in-memory uses of UV 300 machines for SAP HANA and Oracle databases are driving about 60 percent of the UV revenues, and the remaining 40 percent is dominated by larger scale UV 3000 systems running more traditional simulation and modeling workloads with a smattering of HPC work on UV 300 systems like is being done at TGAC.
The UV machines at TGAC have been focused on the wheat genome assembly work, which is the most computationally challenging tasks at TGAC, this is not steady state work. The commodity cluster that the genomics lab uses for downstream analysis of the genomes it sequences and assembles on the UV iron, however, is kept extremely busy by researchers analyzing the wheat genome and runs at a very high utilization.
This cluster is based on server nodes from Supermicro, which use AMD Opteron processors and with 128 GB of memory in each node, with a total of more than 4,000 cores. This commodity cluster is networked with plain vanilla Ethernet interconnects because, as Stitt put it, there is very little MPI in bioinformatics code so you don’t need fancier InfiniBand or Ethernet links with RDMA. This cluster will be refreshed later this year, according to Stitt, and it stands to reason that TGAC will go for a system using the latest “Broadwell” Xeon E5 v4 processors from Intel.
“This sort of system is more about price/performance for us,” he says. “We don’t need high amounts of shared memory for these throughput jobs, we just need lots of cores, and at the time whatever processor configuration gives us the best price/performance, that is probably what we will go with.”
TGAC is also on the bleeding edge of optical processing research for genome sequencing, which we profiled last March.