When it comes to using the public cloud, few market segments have a better story to tell than the life sciences sector. In this arena, the data sizes have grown exponentially as the cost to generate and acquire the relevant datasets have been pushed down, and while this has meant an increase in the amount of computational resources required, there are now far more options than ever before—both in terms of combining in-house cores with cloud clusters as well as on the software orchestration and application fronts.
Many of the widely used bioinformatics applications for research in genomics, drug discovery, and other shoots from the expanding life sciences tree are parallelizable, which makes them a more suitable fit for running in a cloud environment. And while all of the fundamental elements appear, at least from a distance, to be in place, especially since Amazon Web Services and other large-scale cloud resource providers are bolstering their enterprise appeal with far more sophisticated data management, application framework, storage, compute, and security tools, there are still some gaps. Accordingly, the life sciences and genomics markets are finding their cloud approaches to be meshed with vendors who offer domain-specific cloud services, just as many expected might happen when public cloud adoption was a mature proposition.
In addition to lacking some tailored, non-generalized compliance and security features, what life sciences companies are missing is a management system for dealing with petabytes of data and billions of objects, says lead scientist at DNAnexus, Andrew Carroll. “In addition, there are the challenges of operating at scale—it’s not that difficult to do something that will work once or a hundred times, but when it comes to have the same system work hundreds of thousands or millions of times, there are a lot of random errors and other lower-level problems that turn out to be a big deal. From a bit flip or node failure, when you’re running on the order of millions of jobs, this is a major issue.”
DNAnexus is one of a handful of companies that are harnessing the Amazon cloud on behalf of users, providing an environment that can be spun up relatively quickly with all the right compliance and key management tools in place, as well as an environment for developers to port their code in and have it run on the most efficient machines inside Amazon EC2 for their workload demands, both in terms of the time to solution and for the cost efficiency. While the compliance story is strong one, for us at The Next Platform, what is interesting here is how the company’s end users, particularly on the genomics side, are making decisions about whether to build or buy their genomics and R&D infrastructure.
Carroll says that while many of their larger-scale users already have clusters in-house, many of their on-site workloads tend to be in bursts, which means they need to have integrated ways to push workloads out into the cloud. But what strikes him most about these on-premises cluster users is how the cloud is making their existing hardware investments more valuable. “If you look at the efficiency of a local cluster, let’s say you’re running at 110%. This is not a good thing because that means there are wait times. For companies that are afraid of this scenario, they tend to overprovision, which on the other side, outside of those bursts where there’s maybe a 10,000 genome problem, the cluster might be used at 90% utilization. We’re seeing that bursting into the cloud is the most efficient way to use that combination of local and Amazon resources.”
Interestingly, Carroll also sees some notable trends among the smaller life sciences companies who have never invested in their own cluster. “With these users, they’ve had the benefit of not coming from that world. That means they have a lot more free bandwidth that would have otherwise gone into managing their IT and infrastructure, and now their energy is spent is on how they interact with the cloud.” This means the IT-oriented people at the company can shift focus away just management of the metal to doing innovative new things on the applications, testing, and development fronts.
The other positive side of not having cluster resources on-site is that as workloads change, so too do the computational demands. For users with in-house infrastructure to contend with, it might be really good at one key application set, but that infrastructure (compute, memory, storage) is all fixed. “We are really opportunistic because we have a full buffet of choices form Amazon in terms of what processors we use, if we need a memory heavy approach, or need SSDs or more disk, for example,” Carroll explains. His team at DNAnexus runs each application that users work with on small samples of node types to determine the best operating environment, factoring in the users need to hit a solution on time and within budget frameworks. If a user wants a particular processor type or configuration they have options to tweak it inside in the system, but Carroll says most go with their tried and true defaults.
While the backend cloud hardware story and the build versus buy questions are interesting, at the end of the day, what these users care about—what really makes the defining decision for them to consider DNAnexus—is the compliance, security, and application porting story. Carroll tells us that they have invested a great deal over the last few years in creating a system that can containerize (using LXC, Docker’s security issues were the limiter) and port around the custom environments that ensure HIPAA and other compliance guarantees so that each machine is isolated and has a solid data providence structure so all operations can be tracked and reported. While it’s true any company could go out and have its own engineers spin up EC2 clusters, when dealing with personal health data, it is no simple proposition, even though Amazon has a lot of this in place already to appeal to the life sciences set.
“It’s not a matter of whether they are HIPAA compliant at Amazon because that is just compliance and secure of their machines. It has to happen at the data management level and when it’s petabytes of data we’re talking about, this matters even more at scale.” Carroll explains that if a company wanted to create their own cloud clusters on Amazon or another provider’s resources, it would take a team of skilled engineers several years to build to what DNAnexus has constructed. And even still, he says, if they were able to do so, their teams would be managing this. Going with a provider of genomics as a service like this allows the team at DNAnexus to focus on side elements that might otherwise be overlooked, including penetration testing and building new development tools to make application building and porting easier.
While the price is a sticky issue given the variability of hardware, applications, and data transfer (we did not explore this topic in depth) this is where the meat really is for users—at least when it comes to the build versus buy equation We’ve covered this topic a bit here in the past, but for now, it’s safe to say that the domain specific high performance computing cloud is happening.