One Small Shop, One Extreme HPC Storage Challenge
February 7, 2017 Nicole Hemsoth
Being at the bleeding edge of computing in the life sciences does not always mean operating at extreme scale. For some shops, advancements in new data-generating scientific tools requires forward thinking at the infrastructure level—even if it doesn’t require a massive cluster with exotic architectures. We tend to cover much of what happens at the extreme scale of computing here, but it’s worth stepping back and observing how dramatic problems in HPC are addressed in much smaller environments.
This “small shop, big problem” situation is familiar to the Van Andel Research Institute (VARI), which recently moved from a genomics and molecular dynamics focus into the world of ultra-high resolutions cryo-electron microscopy. The center acquired one of the most powerful Cryo-EM microscopes in the world and as a result, was pushed into thinking big about hardware (storage in particular) within a relatively small cluster environment.
To better understand the intricacies of planning a high performance computing environment around such a powerful data-generating microscope, we talked with the institute’s Zachary Ramjan, who says that for the research teams using his gear value very high performance as well as the agility to quickly adopt new codes and tools from a largely open-source community base. To this end, he has architected VARI’s environment around these requirements, which includes the need to store and manage 13 petabytes of data daily from the 2,000+ core clustered compute, equipped with some GPU nodes (interestingly, the team had Nvidia Tesla K80s but found better performance for their workloads using the workstation graphics-focused GTX 1080s, which are Pascal at the core but at less than one quarter of the cost).
Ramjan was initially tasked with building a high performance computing environment for VARI that was focused on genomics and molecular dynamics, but with the addition of the Cry-EM, he said they had to take a fresh look at scalable storage. The center had been an Isilon shop in earlier days, but he says the pricing was far higher, which pushed him to look at both GPFS (now called IBM Spectrum Scale) and Lustre—the two leading high performance file systems. He says that while Lustre had some benefits, the number of people required to manage it was going to be far too high. Further, he felt locked into disk geometries with the Lustre implementation he surveyed and with growing data requirements, being stuck buying out of date spinning disk to fit the same geometries was not an option. While the team did not opt to buy SSDs, it did select GPFS delivered via a DDN GS7K parallel file system as well as the company’s WOS object storage to match the private cloud delivery model Ramjan wants to solidify internally for his users.
“We went with GPFS because it was more robust in terms of functionally and required far less manpower. Initially we didn’t have tiers, but with Cryo-EM we wanted to have a multi-tiered solution. GPFS looked cleaner and easier here and there are protocol nodes as part of the product, which meant we could also do NFS, Ceph, and object all under the same project.”
The storage environment when the shop was devoted to genomics and molecular dynamics was very siloed, Ramjan explains. “We had chunks of storage everywhere, which wasn’t economical or manageable or high performance. When I came in, we took the HPC approach of getting a big chunk of high performance storage that could be easily scaled out and added to incrementally, but then we went about bought this massive Cryo-EM microscopy technology and all of a sudden the storage projections were way off—we needed to increase 60% in size immediately.” He says the took a tier approach since not everything needed to be in ultra-fast storage and with the WOS object storage, which is policy-driven so the cold data could sit on the cheaper, dense storage, the team was able to adapt to the demands of the new microscope.
Even though Ramjan’s users are more focused on their research than wrapping around a new approach to getting their results, he says the investments made in OpenStack and preparing for a more object-oriented world will eventually pay off. Currently, users at the center can access compute through the traditional PBS batch processing environment or via the OpenStack private cloud. “A lot of their work is open source and many of those tools only run in an old school cluster environment. But a lot of the new stuff that might not be mainstream in the scientific community yet will look more like Spark-based processing versus MPI. As those tools become more popular and fit into a cloud model—an object model versus cluster one—we’ll be ready to go infrastructure wise if they’ll adopt it.
Even though OpenStack is available for researchers at VARI, Ramjan says it will take time before they begin using in greater numbers. They are very sensitive to the performance of their applications, and he says despite what he has read, the virtualization overhead is not insignificant. “When you’re at the fringes of computing with maximum CPU, RAM, and I/O these problems with overheads really show up. At the extremes, I think there’s a higher penalty in terms of overhead.”
“Deploying an end-to-end storage solution has allowed us to elevate the standard of protection, increase compliance and push the boundaries of science on a single, highly scalable storage platform. We’ve also saved hundreds of thousands of dollars by centralizing the storage of our data-intensive research and a dozen data-hungry scientific instruments on DDN. With all these advantages it is easy to see why DDN is core to our operation and a major asset to our scientists.”
While VARI’s environment is facing many of the same challenges other HPC centers have in terms of performance for demanding applications, they are not operating at the same scale in terms of node counts or novel approaches. However, their journey through infrastrchucture requirements to meet the needs of data-generating tools is representative of that of other organizations in life sciences and beyond. At the base, Ramjan says, is having the ability to be agile—to add new resources quickly. At the base of that stack is storage, which if VARI is indication, can provide a backbone to keep building and scaling up and out.