Google Fires Up Spark, Hadoop Service On Its Cloud
September 23, 2015 Timothy Prickett Morgan
If Google knows a lot about one thing, it is that developers do not want to spend a lot of time setting up infrastructure just so they can create applications that chew on lots of data to run the business.
While many of us admire Google for its systems and datacenter designs, the thing that truly makes Google powerful is its ability to quickly develop new technologies and make them easy for programmers to make use of. This is the nature of the Borg cluster controller and the many frameworks that work in conjunction with it.
You don’t have to work for Google to be like Google, however. Ahead of the Strata + Hadoop World conference next week in New York, Google has announced Cloud Dataproc, which packages up the combination of Hadoop storage and Spark in-memory analytics and offers it as a service. Cloud Dataproc – which is short for data processing, of course, a very old school term for what we are all still doing – is in beta testing now; the timing of the commercial release of the service has not been revealed.
The pricing for Cloud Dataproc is sure to get the attention of people who are thinking about setting up Hadoop or Spark analytics clusters, which is a pain in the neck and which takes experts to maintain. Google is charging a penny per virtual machine per hour in the virtual clusters it manages on behalf Cloud Dataproc users. (This is in addition to charges for Compute Engine instances, network bandwidth, and storage that customers incur to set up the clusters.) Cloud Dataproc can run on regular reserved and on-demand instances, as well as the new preemptible instances that Google debuted a few weeks ago. Pricing is rounded to the nearest minute with a ten minute minimum billing period.
Speed is also something that Google is trying to sell. In a blog post announcing the Spark/Hadoop service, James Malone, a product manager for the company’s Cloud Platform public cloud, says that Google can start a Hadoop or Spark cluster within 90 seconds, and scale it up or shut it down in around that time – or less. Google has chosen to run Hadoop and Spark atop the Debian 7 distribution of Linux inside Cloud Platform to create the service; presumably it is using the open source Apache versions of the Hadoop and Spark stacks.
“One of the big features of Dataproc is that we are trying to answer the needs of the customers that want more control over a cluster,” Malone tells The Next Platform. “Customers have their data and workflows on Hadoop and Spark, but they don’t want to spend hours setting up VMs, distributing the binaries, and so forth. Open source software is great, but it usually comes with hundreds of switches and knobs to finely tune and dial. The thing we tried to do with Dataproc is have one big switch and it is On, and when you are done you can turn it off.”
Malone says that the Cloud Dataproc service can scale from three to hundreds of virtual nodes, and integrates with Google Cloud Storage, the object storage service that works with Compute Engine. (The default replication for data is set at two, not the normal three used in Hadoop clusters.) This storage has a mix of disk drives and flash SSDs to ensure good performance, and importantly, if data is stored in Cloud Storage, even when you turn off the Cloud Dataproc cluster, this data will persist to be used again. So, in a sense, instead of the ephemeral storage that was common in the early days of cloud computing, this can be thought of as ephemeral compute with persistent storage. Malone tells us that customers can set up HDFS on local disk or flash storage in their compute instances, but once the cluster is turned off, the data is gone. Google recommends putting HDFS on Cloud Storage for this very reason.
That said, the Cloud Dataproc cluster will also persist as long as you pay for the Compute Engine instances underpinning this virtual Hadoop and Spark cluster. Google does warn that a Cloud Dataproc cluster is limited to a maximum of 24 CPUs and 240 VM instances, like other Compute Engine resources, and you have to ask for any capacity above and beyond that. If you need to goose performance, you can pick Compute Engine instances with local SSDs.
Cloud Dataproc doesn’t just support MapReduce batch jobs and Spark in-memory processing, but can also support applications that run atop the Hive data warehouse layer atop the Hadoop Distributed File System as well as applications that use the Pig scripting tool to parallelize their MapReduce queries. Any of the languages that are supported by Hadoop and Spark – Java, Scala, Python, and R – are supported with the Cloud Dataproc service. The service does not yet have an official service level agreement, but will when it becomes generally available. Cloud Dataproc is available across all regions and zones in the Google public cloud. Malone was not at liberty to say when Cloud Dataproc would be generally available.
The Cloud Dataproc stack has been in alpha testing for a few months, and launches in beta today supporting Hadoop 2.7.1 and Spark 1.5. Programmers can create and destroy Cloud Dataproc clusters from a set of APIs, from the Google Cloud SDK, or from the Google Developers Console.
Google does not expect for customers to necessarily use Cloud Dataproc instead of its BigQuery and Cloud Bigtable services, which we reported on here. (Cloud Bigtable is a service that mirrors the internal database overlay for the Google File System that inspired Hive, and BigQuery is an ad-hoc query service for read-only work that Google created and that inspired, among other things, Amazon’s DynamoDB and Facebook’s Cassandra.) Because the Cloud Dataproc can be linked to Cloud Bigtable and BigQuery, it can be used, for instance, to do the data massaging before log files and other telemetry coming out of systems and applications is dumped into services like BigQuery and Cloud Bigtable and then perhaps visualized in tools like Tableau. Malone says that some customers in the alpha program for Cloud Dataproc were looking for more scale or lower costs for their Hadoop and Spark infrastructure, and some were looking to make it easier for developers to spin up a cluster to test out their data and algorithms.
Google rival Amazon Web Services put its Elastic MapReduce service out in beta to offer Hadoop as a service in April 2009, and it was available for production use shortly after that. AWS offers the Apache distribution of Hadoop on its virtual clusters, and also allows customers to use the Hadoop distribution from MapR Technologies. Amazon’s pricing is quite a bit higher for the EMR managed service than what Google is charging for Cloud Dataproc, varying along with the cost of the underlying EC2 instances that it makes use of. EC2 instances run from 4.4 cents per hour for a an m1.small instance to a high of $5.52 per hour for a d2.8xlarge storage optimized instance; the EMR service adds 1.1 cents to 27 cents per hour on top of these EC2 fees. (Those are on-demand instances in the US East region; you can lower the cost with reserved instances.) Amazon just added support for Spark to EMR back in June, and in July revamped EMR with a 4.0 release that includes Hadoop 2.6.0, Spark 1.4.1, Hive 1.0, and Pig 0.14.
Microsoft similarly has a Hadoop service on its Azure cloud, which it calls HDInsight and which is based on the Hortonworks Data Platform distribution of that analytics platform. In addition to Hadoop, Microsoft supports Spark in-memory processing and Storm stream processing add-ons. Customers can deploy HDInsight on Windows or Linux on the Azure cloud, and Microsoft bundles in the price of the compute and the service together in its pricing, which ranges from 8 cents per hour for an A1 instance to $1.41 per hour for an A7 instance. If you buy these instances raw running Linux on Azure, they cost from 1.8 cents per hour for the A1 instance to $1 per hour for an A7 instance. So Microsoft is similarly charging a much higher premium for the managed Hadoop/Spark service than Google.