In 2003, the first human genome required $3 billion and fifteen years to fully sequence. Since then, the cost has dropped to around $1,000 and a sequence takes one day. Aside from the sequencers and tooling required, the early efforts came with intense data demands—something that has not changed much over the years, despite advances in compression and other techniques.
Considering that our DNA is 99.9 percent identical, this may come as a surprise, but looking for that tiny fraction of a percent, the part that makes all of us unique, is still a demanding task in data wrangling. As Jonathan Bingham, a product manager in Google’s division that handles data science, startups, and scientific computing areas, notes, “At 100 GB per genome, if we wanted to read the DNA of every person in Moscow it would take 1.2 million terabyte hard drives” to handle the task.
Like other large-scale datacenter operators, Google certainly knows about addressing data at massive scale. Interestingly, its most recent effort, called Google Genomics, is using existing frameworks and tools, most notably its SQL-driven BigQuery engine and MapReduce (the same tool used to build its search indexes) against its own hardware infrastructure. As many in the genomics arena are already aware, Amazon Web Services, an early pioneer in bringing life sciences tools and high performance computing instance types to the fore, has been tackling the cloud-based genome analysis market for a number of years, meaning Google has some catching up to do—at least in terms of mind share and use cases.
In an effort to bolster its genomics cloud story (and well in advance of the announcement of Google Genomics) the search giant coupled with HPC middleware provider, Cycle Computing, to orchestrate the Broad Institute’s genomics work and is just relatively recently firming up the software stacks required for dense genomics pipelines, not to mention the range of other life sciences applications that the Google division is targeting.
While the investments are present and the efforts are well-publicized, one can’t argue with Amazon Web Services’ relative strength in capturing the genomics and life sciences market—and retaining those customers. What was interesting about the AWS approach is that it quickly tackled the security and privacy concerns, both in legal and company sentiment terms, and took swift action to bolster its regulatory and compliance model to make these the center of its life sciences cloud story. AWS was also successful in lining up highly publicized use cases on the part of genomics research centers, life sciences research hubs, pharmaceutical makers, and the medical community to prove by example that the cloud was a workable solution for at least some of their workloads. One can make the argument that Google, for once behind in one key technology area, will need to do similar footwork to ensure their cloud infrastructure has the same seals of official approval—and then make a big deal about big customer wins as AWS successfully did over the last five years in particular.
What is also worth noting about the differences between Amazon Web Services’ approach and that of the Google Cloud Platform is that AWS worked early and hard to carve out a niche with those who were far more accustomed to high performance computing hardware and tools than they might be with general Java-based applications and familiar web interfaces. While this is an advantage for Google in terms of wider adoption in more general enterprise segments, for the high performance computing set, particularly in life sciences (and not just because of regulatory and compliance concerns), this will take some intensive effort. If one looks at the infrastructure alone, AWS has gone the extra mile to create specific hardware configurations for high performance computing users, including the addition of GPUs for scientific applications that require acceleration and a large number of instance types and networking options that emphasize their recognition of the low latency requirements for such applications. These needs go beyond life sciences HPC, but again, this is something that Google’s cloud platform will need to focus on to get a wide uptick in usage from scientific computing users in life sciences research.
But now, to soften this for a moment. Every market needs a competitive environment and thus far, outside of some smaller cloud or HPC as a service companies like Rescale and others, AWS stands alone in terms of the robust security, compliance, analysis, visualization, storage, hardware options required to do true life sciences in the cloud. Competition drives innovation and so on the flip side, one might expect that Google will have watched and learned lessons from how AWS tackled a market, especially during a time when cloud-fear (security and privacy-wise) was at its height in the 2010 to 2013 timeframe in particular. While these concerns have not diminished in some areas, the cloud is an increasingly trusted center for far more applications than ever before—and life sciences will not be an exception to that, especially given the range of tools to make it easier to port, analyze, visualize, and store genomics data. In many respects, Microsoft has also made clear efforts to tap into the scientific computing market, particularly when many members of the technical computing team were moved under the cloud banner. Azure, like AWS and now Google, has its own suite of tools for genomics workflows and has had a few landmark partnerships and use cases, including work with the UC Santa Cruz Genomics Institute–work that also began early, although has not received the same kind of attention or number of use cases found on Amazon’s cloud.
For its part, Google has taken steps to make this transition easier by working toward a standard via its membership in the Global Alliance for Genomics and Health. The Google team is contributing to the definition of a new standard API for genomic data. With this, public data, including that of the 1000 Genomes Project, will be available through an API with an open source software stack surrounding it to make it easier to use that API. Further, simple tools like BigQuery, will make analysis of existing datasets as easy as a SQL query with other more powerful (and more complicated tools for the uninitiated) tools like MapReduce for more sophisticated analysis. Of course, Amazon Web Services is also part of the Global Alliance for Genomics and Health, contributing its pieces to the standardization effort that will make using the cloud—any cloud—a simpler process in the face of so many internal formats for genomics data. And AWS also has its own suites of tools, very similar to Google’s, for analyzing existing genomics data.
Just as AWS has its own set of tools that do roughly equivalent things, Google has BigTable, Spanner, BigQuery, and other frameworks designed for internal Google projects at scale. It is extending new efforts into its Cloud DataLab, which is its programmatic approach for simpler interfaces with large, complex datasets. Further, GATK on Google Genomics is opening new pipelines, allowing users to interact with Spark, Cloud Dataflow, or Grid Engine clusters.
One thing that Google has that AWS and Microsoft do not is an internal initiative through other projects and investments to push the capability of its own cloud infrastructure and tooling. Consider that Verily, which used to be Google Life Sciences, is now Alphabet Inc.’s research arm that extends to genomics and other life sciences areas. While genetic analysis is not the only thing Verily will be focused on in coming years, this is a segment—and it will need to continue to bolster its tools for use on its own cloud to continue pushing those research efforts to new heights.
While AWS had the head start in life science cloud computing, securing big name users in research and industry, one can expect that Google won’t be far behind.
It is rare to see Google playing catch-up in many respects, but when it comes to getting external, large-scale research shops or major drug makers on board, it has work left to do. While Amazon had a head start to begin with in the overall public cloud space, that alone does not account for its adoption by life sciences companies and researchers. It has been a broad effort–one that started with an emphasis on the overall needs of scientific and technical computing that then narrowed in one specific segments. As of now, two reference customers are listed in Google Genomics resources list. AWS has more than can be counted easily, many of whom have shared at events and on the record about how they are using the cloud for their genomics workflows.
Of course, with AWS getting its footing in 2006 versus Google’s cloud platform, which just launched into general availability in 2014 (although its App Engine services came onto the scene in 2008), there are expected to be differences. We will be watching this year for companies that make the decision between one cloud over another to find out why, if the price is approximately the same, such decisions are made.