Tech Refined at ARM Targets HPC I/O Bottlenecks

As an FPGA engineer and developer located in a vibrant semiconductor and biotech center like the University of Cambridge in the UK, Rosemary Francis has been able to track new developments and emerging needs of both industries in real-time. For those two areas in particular, understanding I/O issues at scale and putting a stop to them before they nested deeper in systems are big problems. As she began to see the connections between such system problems at various UK companies and institutes, her small team began engineering around the problem, founding Ellexus along the way.

In her work with Commsonic, Simba HPC, ARM, and others, Francis quickly noticed how much money semiconductor designers were spending on software tools that simply didn’t work—not because they were buggy, she says, but rather, because of their complexity and the weighty IT infrastructure required to support them. One mistake or misunderstanding brought the entire system down. With this as a springboard, she set to work designing tools to debug the various problems and built those initial concepts into something that is known as the Breeze framework, which has since been adopted by companies like ANSYS, among others, to make complex software work between users with different dependencies and infrastructure. “Little things can break really complex things in mysterious ways,” Francis explains, pointing to the early role their core product, Breeze, played at other companies, including long-time collaborator, ARM.

At its core, Breeze targets I/O patterns on applications while a new piece to that stack, Mistral, serves as a tool to run against a live cluster and provide up-to-date information about which applications are causing problems for storage before such issues have completely overloaded the storage system. Ellexus was able to get some early funding and initial reference customers beyond ARM, including the Sanger Institute and others, although the focus for the last several years has been almost strictly on semiconductor companies and bioinformatics shops.

These two segments share quite a few challenges on the infrastructure side, Francis tells The Next Platform. “The overall IT infrastructure is similar in bioinformatics and the semiconductor space as well as from a compute perspective. Chip designs are big, so are genomes, and in both cases you’re dealing with designs or genomes that are large, involve a lot of repetition in the tooling, pattern matching, and similar data movement…These are also shared compute areas with users submitting jobs that need access to a large pool of shared data and the workloads aren’t spread across multiple machines; they’re usually single-machine jobs so from an architecture view, we haven’t had to dramatically re-target our technology.” Further, she explains that although bioinformatics might favor Lustre and semiconductor folks prefer Isilon, NetApp or GPFS, the problem is the same—there is still a single user access that crashes the system for everyone else.

If this sounds like a familiar problem or one that seems so intrinsically, profoundly disruptive, you’d better believe it is. So then, why aren’t there ways of correcting it now? There are monitoring tools and ways to pinpoint I/O problems, but these are often only found after they’ve caused problems. Rooting everything back to a single application or user in a complex workflow is a challenge—one that Mistral seems to have solved. The challenge, however, is integrating tightly enough with an HPC workflow to be able to do so in real-time, thus triggering an alert before the storage system is flooded.

The question is, where might Breeze and Mistral fit into an HPC center, which is already likely to have something set up for this task. “It depends on the organizations, but what we most often find is a mix of homegrown solutions to tackle the I/O workload issues. There are tools for doing I/O profiling, but they do so in a different way. In such cases, some of the HPC people we are talking to look at those as something in addition to Mistral and Breeze because what something like Mistral is solving is largely around scaling. When users aren’t getting the performance from the storage they need and buying more storage doesn’t work either, they start looking at things like this.”

The ARM collaboration extends back to the start of Ellexus. The company was using Breeze to get detailed profiling about their infrastructure and design flows, but eventually required something that could work on a live system, hence Mistral was born. “The same problem ARM was having was the same for everyone. It’s easy for one user to submit a job that has too much I/O associated with it that it crashes the cluster. Since then we’re extended this tackling of the problem of live I/O monitoring and load balancing in addition to Breeze.”

As for Ellexus, being centered at the heart of the UK’s biotech and semiconductor businesses has been an advantage. Francis says that six years ago, she and her small team were able to raise enough capital to get off the ground and work with the existing system there in the UK. They are, however, targeting a range of other applications in oil and gas and have already done work with NERSC on the Edison supercomputer. The one thing they’re missing to tap into the broader HPC market, however, is the ability to sing with MPI applications—something that the small team is hard at work on now. Once that happens, their reaching into the HPC segment could grow as monitoring and I/O profiling tools that work against a live system are few and far between for high performance computing. DataDirect Networks and others offer them as part of their storage solutions, but the goal is to provide users with deeper application-level interaction versus more simplified alerts.

Tech Refined at ARM Targets HPC I/O Bottlenecks

Sign up to our Newsletter

Be the first to comment

Leave a Reply Cancel reply