By the time Ashish Thusoo left Facebook in 2011, the company had grown to around 4,000 people, many of whom needed to access the roughly 150 petabytes of data—quite a hike from the 15 petabytes his team was trying to wrench from data warehouses in 2008.
Watching this evolution in volume (and the accompanying “V’s” in the large scale data equation) got Thusoo and his team at Facebook thinking about accessibility and usability. After all, what good was all of that information when the number of internal gatekeepers was limited and the company had an ever-growing set of use cases and products to build from it? “When that fifteen petabytes or so was in the data warehouses, there were a total of six people who were actually making use of that data. By the time I left, between 20 and 30 percent of the company was hitting the data analytics infrastructure each month,” Thusoo explained to The Next Platform.
This widespread access and usability problem spurred the creation of Hive, which opened the possibility of using SQL across very large datasets—something many of Facebook’s existing force was already familiar with. The Hive effort spearheaded by Thusoo and a small team inside Facebook grew quickly, and development of Hive continued well after he left with the company in 2011, along with another Facebook data engineer, Joydeep Sen Sarma to found analytics-as-a-service company, Qubole, which to date has gathered $20 million in investment to feed the data needs of companies like Pinterest, which uses Qubole as the management engine for their Hadoop stack.
Thusoo says he he had a clear sense of what it meant to deliver large volumes of data to a wider set of users using cloud-based resources and a self-managing approach that extracts some of the deployment and management overhead of deploying MapReduce, Hive, Spark, and other open source frameworks. Before Facebook he worked at Oracle as a technical lead for six years, moving next to Identity Engines where he developed more usable database frameworks. In addition to an emphasis on data warehouses, by the time he got to Facebook, the problem wasn’t just one of user access, there were also hardware challenges.
“When we started most processing at Facebook, we had the standard Teradata and Oracle boxes, although there were some emerging companies like Vertica and Aster Data starting then. The approach for a long time was get more performance outside of single boxes—adding more memory, disks, more powerful CPUs. But by the time I left we went from those big $20,000 machines to $5,000 nodes because we had learned to scale on commodity hardware.”
The coupled lessons from moving to a commodity scale out approach and open software that emphasized accessibility and scalability got Thusoo and the early Qubole team thinking about how Amazon, Google, and Microsoft cloud offerings were primed for a new generation of data analytics users at scale. The problem was one of deployment and manageability, something they have worked on since 2011, including building their own middleware software to manage the various policies and jobs based on the MIT-developed StarCluster package. From there, users are able to tap into Qubole’s framework to run Hive, Presto, MapReduce, and Spark jobs on a cloud infrastructure provider of their choice without dealing with the gooey middleware and management.
The value too, Thusoo says, is that they are able to remove some of the cost overhead by using hands-off services like autoscaling and spot pricing, both of which are available in various forms on each of the cloud providers’ platforms. The demand now at Qubole, perhaps not surprisingly, given the founders’ histories, is Hive as a service. But Thusoo says that Spark is gathering use cases as more users seek to run complex machine learning applications in a scalable manner without dealing with the provisioning headaches of setting up their own on-site clusters.
One of their largest users, Pinterest, has total of 3,000 nodes on site but they are also cloud converts that have been using Amazon’s Elastic MapReduce for Hadoop workloads. The problem, however, is that as they scaled past several hundred nodes using EMR, they began hitting some issues. Pinterest looked to Hive to start to address some of those problems, in part because SQL was more familiar to a growing base of internal users but Elastic Map Reduce has a proprietary variant of Hive. According to Qubole, “The company had already built so many applications on top of EMR that it was hard for it to migrate to a new system. Pinterest also didn’t know what it wanted to switch to because some of the nuances of EMR had crept into the actual job logic. In order to experiment with other flavors of Hadoop, Pinterest implemented an executor abstraction and moved all the EMR specific logic into the EMRExecutor. This gave Pinterest the flexibility to experiment with a few flavors of Hadoop and Hadoop service providers, while enabling us to do a gradual migration with minimal downtime.”
“If you think about scale and management, there is really no reason why anyone would want to do this with on-site clusters. It is not a simple thing, and while it might be something users do when they are playing around or testing something, when they roll it out into production, they need the management and scalability capabilities that we can provide across a number of services, not just Hive or Presto or Spark.”
Thusoo agrees with IBM’s assertion that the next big thing to hit the analytics space beyond Hadoop and MapReduce is Spark, and he says that is where they are expecting to see momentum in the cloud-based analytics space in the coming years.