Teaching Grid Engine To Speak Mesos
April 21, 2015 Timothy Prickett Morgan
The HPC and hyperscale camps do not always get along, but they are often trying to solve similar problems when it comes to application scheduling and resource management on clusters. With the hyperscalers, one might say that there is a tendency to reinvent some wheels that have already been put on the road in the HPC market. But that does not mean that products aimed at one segment cannot work in conjunction with those of the other segment to better manage clusters and their distributed applications. So it is with Univa grafting support for Mesos frameworks onto the popular Grid Engine cluster workload scheduler.
Mesos is an open source management tool for Linux clusters that takes inspiration from the Borg cluster and workload management tool created by Google over the past decade to manage the jobs running on its vast fleet of servers. Mesos was initially created six years ago at the AMPLab at the University of California at Berkeley, which gets funding from Google. The lab was working on ways to do better job scheduling on multicore processors, and then reckoned the lessons that it learned on a single chip could be applied across the clusters as well. Google nudged here, gave some advice there and before too long the Mesos framework was sophisticated enough that Twitter and Airbnb could further develop it and put it into production for managing clustered analytics workloads. In 2012 Mesos became an Apache project, and last year Mesosphere, the entity that is spearheading development of Mesos and offering commercial-grade support for it, was founded by Tobi Knaup and Florian Leibert, software engineers at Airbnb and Twitter, respectively, who have lots of experience with clustered hyperscale applications. Mesosphere dropped out of stealth mode last year and has raised $48.8 million in three rounds of venture funding, and is quickly positioning itself at the hub of modern Linux clusters as it tries to peddle Mesos to the Global 2000. eBay, PayPal, Groupon, Netflix, OpenTable, HubSpot, Salesforce.com, Vimeo, Conviva, and Best Buy are all running Mesos in production.
Both Grid Engine and Mesos offer a kind of virtualization that is different from that provided by VMware. With a bare metal hypervisor like ESXi, VMware is carving up a giant wonking X86 server into virtual slices so multiple operating systems and applications can be loaded up onto that single machine to drive up the overall utilization. The idea is to increase efficiencies in the datacenter by having fewer machines running at as close to peak utilization as makes sense. (Most servers run out of memory bandwidth or I/O bandwidth long before they run out of compute, but don’t tell server makers that.) With Grid Engine, Mesos, and other distributed computing management tools, the applications themselves are designed to be distributed across many systems, so the workload scheduler has to allocate resources for multiple jobs running across server nodes in a cluster.
Grid Engine, which traces its roots back more than twenty years, and Fritz Fertzl, CTO at Univa, has been spearheading the development of various grid software tools since that time. At this point, the open source variant of Grid Engine has over 10,000 users worldwide, and Gary Tyreman, CEO at Univa, tells The Next Platform that about half of the licenses of the open source version of Grid Engine are in enterprise accounts rather than in supercomputing centers. Univa now has 400 paying customers around the world and Tyreman says that about a quarter of the company’s revenues are coming from non-HPC shops. This is essentially the trajectory that The Next Platform would expect. An open source tool that was created for HPC or hyperscale shops gets widely distributed as an open source tool, and eventually enterprises pick up on it and deploy it. Then, after they get to a certain level of sophistication, enterprises want technical support and perhaps other features that are not part of the open source distribution and that are, generally, aimed at enterprise customers with the budget to pay for those features. What Univa has accomplished with Grid Engine, Mesosphere is trying to do with Mesos.
Univa is full aware of the growing interest in Mesos and the war chest that Mesosphere has amassed to commercialize its quasi-clone of Google’s Borg. Univa has one thing that Mesos does not have – decades of experience with enterprise customers running a broad mix of real-world workloads – but Mesos has also created a sophisticated set of frameworks for abstracting away the underlying cluster from MapReduce, Spark, Accumulo, and other distributed applications. And so Fertzl and the team at Univa have done something clever, and that is to grab the Mesos API stack that allows the Mesos frameworks to talk down to the Mesos toll for resource allocation and put it on top of the Grid Engine workload scheduler for clusters. What this means, in plain English, is that workloads that have been tweaked to run in a framework in top of Mesos will run unchanged on top of Grid Engine now.
The layer of software that implements the Mesos APIs on top of Grid Engine is called Universal Resource Broker, and among other things it allows these new-fangled applications to run side-by-side with established HPC and data analytics applications that have long since been tuned to run on top of Grid Engine clusters. Grid Engine is generally run on Linux clusters as well, but supports Unix and with the 8.2.0 release that came out last September, Windows Server was supported as a deployment platform. Univa also added support to deploy and manage applications running in Linux containers (based on cgroups) to Grid Engine with that 8.2.0 release, and is working on support for deploying into Docker containers right now. (It looks like Universal Resource Broker will be the front end to manage Docker containers.)
“We think that what Mesos is doing is very interesting and we agree with some of the long-term vision about the world going distributed. And arguably the technologies like ours are distributed operating systems but we tend to not call them that because it can mean different things to different people. You call it workload optimization because that is what the customer gets out of it.”
The move by Univa to support Mesos frameworks is reminiscent of what happens with the open source MySQL database. The database comes with its own storage engine, but the code is pluggable so storage engines can be swapped in and out without having to tweak the database management system itself or the applications that talk to it. This is not a perfectly analogy, we realize. Perhaps it is more like companies adding NoSQL data stores or clustered file systems like Lustre and Gluster underneath the MapReduce layer of the Hadoop batch analytics platform. The MapReduce applications have no idea they are not talking down to the Hadoop Distributed File System, but end users sure realize that performance is a whole lot better.
“This is a way for us to bring more workloads into the Grid Engine cluster,” explains Tyreman, but the Universal Resource Broker add-on to Grid Engine is also a way to allow for enterprise customers who are already running Grid Engine to more easily adopt applications that have been tuned up for Mesos frameworks and distributed computing across clusters. “We have implemented the Mesos API and have done a pretty substantial amount of work around it that tucks it into the dynamic partitioning in a Grid Engine cluster.”
Tyreman says that there are several benefits to using Grid Engine and Universal Resource Broker together to host applications written for Mesos frameworks. First, Grid Engine’s throughput is a proven, known quantity and is able to scale across several hundred thousands of cores with many thousands of users banging away every day with a queue that can have as many as 5 million jobs a day in it. “We are used to starting, running, and stopping workloads,” brags Tyreman. “We are have advanced policy control, so enterprises can be a little more structured in how they use shared infrastructure. And because Grid Engine runs all of the workload, we can do all of the accounting for the lifecycle of applications. We also start applications faster with a Mesos framework than Mesos can. So to net it all out, you get the benefits of a very mature, enterprise product for these Mesos workloads.”
The difference between running Mesos natively on a cluster and putting Mesos frameworks on top of Grid Engine is subtle. And Tyreman tips his hat to the work that the Mesosphere community has done. But Univa understandably thinks that Grid Engine is a more robust workload scheduler than the one at the heart of Mesos, and Universal Resource Broker will put that idea to the test.
“We think that what Mesos is doing is very interesting and we agree with some of the long-term vision about the world going distributed,” Tyreman explains. “And arguably the technologies like ours are distributed operating systems but we tend to not call them that because it can mean different things to different people. You call it workload optimization because that is what the customer gets out of it. For instance, we are seeing customers, for instance, rewriting applications from MatLab to Spark, and they get an immediate performance improvement because they can distribute Spark across a cluster. They might reduce their wall clock by a factor of 200X. The challenge with Mesos today as an environment is that workloads can take over an entire cluster, and that is not going to be acceptable to any enterprise. We would get skinned alive if an engineer could take over 50,000 cores and do whatever he wants. With Grid Engine’s central management, we can take that ported MatLab workload now running on Spark and limit it to only a few thousand cores. We can fire up multiple instances of Spark for multiple users so no single user can hog too much resource. So Grid Engine can fire up Mesos work faster, manage resources better, and enforce the business service level agreements.”
Tyreman adds that the Mesos scheduler has issues when multiple frameworks ask for cluster resources at the same time, and says that Grid Engine long since figured out how to resolve such contention on large clusters with a diverse mix of workloads and SLAs.
Universal Resource Broker is an add-on for Grid Engine, and customers who want to do bare-metal provisioning of nodes in the cluster to run applications tuned up for Mesos frameworks can do so with another Univa product called UniCloud, which leverages the open source Puppet configuration management tool to do this. Grid Engine costs $100 per core per year to license, and Universal Resource Broker costs an additional $50 per core per year. The issue for pricing is a bit more complex than suggested above. “We are already running enterprise workloads with massive scale and throughput,” says Tyreman. “I don’t know many businesses that are going to run Spark or MapReduce or Accumulo at the full scale of their Grid Engine clusters.” The broker can be used on a subset of the cluster. (Volume pricing scales that per-core cost for Grid Engine and Universal Request Broker way down as the core counts go up, obviously.) Universal Resource Broker is in production testing with early adopter customers, and is available now. It requires Grid Engine 8.2.0 or higher to run.