Containing the Complexity of the Long Tail

HPC software evolves continuously. Those now finding themselves on the frontlines of HPC support are having to invent and build new technologies just to keep up with deluge of layers and layers of software on top of software and software is only part of the bigger picture.

We have talked in the past about computational balance and the challenges of unplanned data, these are both real and tangible issues. However, now in addition to all of that, those in support roles living at the sharp end of having to support research are also faced by what is increasingly turning into a new problem of “software balance”, or worse still, “unplanned software”.

HPC and computing resources now reach an ever increasing and wider audience, extending out from the “usual suspects” in traditional modeling and simulation, into what is now being called the proverbial “long tail of science”. In support of this long tail, a new breed of research software engineers, research computing facilitators and scientists are needed. Even the very computer systems themselves are getting monikers related to such “long tails”. Recent NSF machines Jetstream and Comet as two very visible examples are each named as a hat tip to “long tails”.

Jim Kurose, professor of computer science and director at the National Science Foundation has spoken at length about this broadening of science and the critical need for systems to be in lockstep and match this important refocusing of science and research. Looking into these “non-traditional environments”, it is crystal clear that the recent investments from the NSF and others are driving many more diverse and enthusiastic researchers to need to turn to their local research computing groups for support of their brand new and often unplanned in-silico experiments, software and data.

As a real world, high profile example of type of need for advanced scientific computing and research computing support, The Next Platform recently spoke with Ruth Marinshaw, CTO for Research Computing and Vanessa Sochat, Research Software Engineer, of Stanford University. Their group “The Stanford Research Computing Center” (SRCC) manages traditional HPC, over 30,000 cores with more than 2,000 GPUs and about 20 petabytes of storage and related cluster resources. But more importantly they also provide support for their considerably varied set of users across a huge range of research groups throughout the Stanford campus.

Marinshaw and Sochat are essentially canonical forms of this new breed of experts living right at the sharp end of providing advanced scientific computing support. In asking them how they manage to support science, one key overarching theme stood out…Complexity. “The job requires not only keeping servers functioning, but functioning correctly and functionally working for the needs of large and varied user groups. While a good portion of our user base (more than 10,000 individuals) use the various clusters that we support, not all of them do. We fully recognize that traditional HPC, batch-scheduled resources are appropriate for some – but not for all. So we have other environments, and we depend on our faculty and dual science/CI experts to help link us to today’s science needs”.

It’s a “complexity soup” of equipment, software and people. Sochat continued to explain some of the challenges they face in support of their community, “From the perspective of the user, the expectation is to log in to a secure environment and have access to any and all software and resources, and then submit a ticket and get a quick response if there is an issue”.

Science as a Service

As consumers of a multitude of professional internet services, we have become so accustomed to having ubiquitous and seamless access to technology. Take for example, simple email systems. Compare this to the early 2000’s versions of how we consumed email versus our current email systems today. We used to tolerate outages, failure and downright plain old clunky VT100 based computer interfaces. We no longer tolerate this today. This expectation of quality, performance and availability (rightly so) bleeds right over into our support of research computing interfaces, software and data. We have come to expect more from our computer systems, and we also would no longer expect to have to “roll our own” email systems. We expect our systems to be interactive, elegant and fast. As electronic notebooks continue to develop in science, there is a further even more insatiable appetite for the need to abstract complex software away to focus on the science. This very appetite has driven the development of a number of “containerized” systems in an attempt to both contain and also sustain our insatiable hunger and desire.

In trying to understand more about containers for science, The Next Platform asked Sochat what she thought was specifically “cool about containers”. Her response gets at the very subtlety of the challenge as she neatly reframes the question into one as providing “Science as a Service”:

“They are cool because a user is empowered to learn and build the software on their own. If I need a library or custom installation on my cluster, I don’t need to ask an administrator to install it for me. I build a container, and I use my container. I can have confidence that it can run elsewhere without needing to start again from scratch. Containerized workloads get closer to this goal of “Science as a Service,” where it could be possible to not know a thing about servers, programming, or containers, but be a really impeccable scientist that can write grants, get funded, collect data, and then analyze it with optimally developed pipelines delivered via containers.”

Performance.

There’s a yet further subtle, and more important detail to containing complex software here that we need to explain. As a real world example, The Next Platform looked at the physical number of system calls needed to “open” a large scientific software application using the popular Linux application “strace”. The results were surprising. Over 66,000 system calls were needed to even “start” the application. No science had taken place. Simply opening the application, and nothing else. Over 20% of these system calls were to fopen() or stat() and their friends as the executable trawled the file system to access the huge numbers of required binaries, libraries and modules. That’s a serious amount of overhead. This overhead destroys the performance of your application, and that is before it even starts to carry out any of your actual intended science.

So, containing these huge science applications doesn’t only reduce their complexity and improve the environment for your customers as Sochat stated, but it also has the potential to significantly improve your performance. More performance, more science.

This subtle detail was clearly shown by Greg Kurtzer at the recent GTC conference. The Next Platform sat in on Kurtzer’s talk and were keen to see our suspicions confirmed. Kurtzer, author of Singularity now commercializing it as the CEO at Sylabs presented a detailed example of Python interpreter invocation times (github). Wolfgang Resch who carried out the research in partnership with Kurtzer into this specific example of launching over 1,000 concurrent python interpreters is a Computational Biologist at the National Institutes of Health HPC Core Facility.

Provenance

Once you “contain” your application, additional features quickly become apparent that didn’t exist when your software was “hard coded to the operating system”. You can reuse the container. You can send it to a non expert user. You can effectively share it, and the hard work that went into making it. This isn’t new territory, Docker et. al. have been doing this for years, but what Sochat and Marinshaw are taking on in partnership with Kurtzer is something very specific around the challenge of scientific data provenance.

“It is still non trivial to move and organize data. It’s nontrivial to run an analysis once, and then do it again,” Sochat argues. To that end, their team devised a system called “Singularity Hub”. This is work based on their recent publication “Enhancing reproducibility in scientific computing: Metrics and registry for Singularity containers”. They care not only about containing the workloads and having it close to the metal for performance with Singularity, but also they care about being able to “do it again”. Sochat and team are extending the hub into a full blown “Registry” for science described in her publication: “Singularity Registry: Open Source Registry for Singularity Images”. There are now over 600 “collections” in the Hub which is on version 2.0 and the “Registry” aims to extend this and build what they describe as “empowering researchers, academics, developers, and all users alike to share their containers”.

On the horizon, the team sees a future where science, data and software become ever more seamless and easier for researchers to engage with advanced computing systems. For instance, Sochat is finishing up the first round of work and a publication of a specification for a Scientific Filesystem that describes a file system organization, environment namespace, and interactions for modular scientific containers. This is a natural extension of Singularity, the Hub and the Registry, and it is clearly needed to support the ever growing complicated stack of software and libraries that we need for modern science. They have also started in on a new tool they call “Tunel” which also bridges endpoints via Globus so that containers and data in the Hub can be more seamlessly moved from one site to another. It is complex engineering, but very much needed especially within modern and ever more distributed HPC environments. ”.

Containing the Complexity of the Long Tail

Sign up to our Newsletter

Be the first to comment

Leave a Reply Cancel reply