Mesos Clusters Growing To Monster Sizes
March 24, 2016 Timothy Prickett Morgan
Scale is in the eye of the beholder, and it depends on the nature of the workload. The fault domain for particular kinds of work does not have to be large, but for a general purpose cluster controller the very purpose of the tool is to span the enterprise, it not only needs to scale but companies have to trust it enough to actually employ that scale.
This seems to be happening with the Data Center Operating System, or DCOS, from Mesosphere, which is the commercialized version of the open source Apache Mesos cluster controller. Mesosphere has said from the beginning that it is targeting Global 2000 companies, which are the ones with the most complex workloads and, not coincidentally, the fattest wallets. And the strategy seems to be working, albeit there are always companies (such as Bloomberg) that will roll their own Mesos because they have the skills to do so. But most enterprises are not looking for that. They are more like Verizon, which wants to build infrastructure with the commercial DCOS in control of allocating work to it.
The idea behind Mesos is to bring Google-style, automated infrastructure to enterprises. The Mesos kernel abstracts compute, memory, storage, networking, and other resources on cluster server nodes and presents them as raw resources to many different kinds application frameworks, much as Google’s Borg container management and cluster controller does. The Mesos kernel knows how to automatically scale applications as they require more resources and free them up when it doesn’t, which allows for might tighter bin packing of applications on the cluster – pushing utilization 2X or 3X higher than is possible using other tools, and in some cases companies are seeing utilization as high as 75 percent to 95 percent on their clusters, depending on the workloads. This, in a world where 10 to 15 percent utilization on a server is the norm.
Given the radical increase in efficiency of server use, the large cluster sizes that Mesosphere is seeing with commercial customers is perhaps initially surprising, but the point has always been for Mesos to become the operating system for the datacenter. (Something VMware wants to be, too, and that Microsoft and Red Hat do as well. You could argue that Google wants the same thing with Kubernetes and OpenStackers want it as well.)
“We have some smaller paid proof of concept work we do in advance of a commercial deal, but let’s call those the exception,” Matt Trifiro, senior vice president of marketing at Mesosphere. “With commercial deals, the average is for thousands of nodes, and in general probably at least an order of magnitude bigger than your typical Hadoop deployment. And that is one of the reasons why our investors are so excited about us. The variety of workloads that can run on the DCOS encompasses almost every workload, and these are very large deployments.”
From our discussions with Hadoop customers and the providers of commercial Hadoop distribution suppliers like Cloudera, Hortonworks, and MapR Technologies, a typical Hadoop proof of concept might have a few dozen nodes and maybe grow to 50 or 100 nodes in production, with a few outliers having 1,000 or even 2,000 nodes. The funny thing, of course, is that large enterprises – the very companies that make up the base that Mesosphere is aiming for with DCOS –have deployed Hadoop in production in many lines of businesses and now have five, ten, or fifteen siloed Hadoop clusters, each with their own datasets and now they are wondering how to bring it all together and make better utilization of the hardware and data resources. This is precisely where a tool like DCOS, which allows for disparate bare metal and containerized applications to be run side by side, sharing the resources of a cluster, is most needed.
So to our way of thinking, if the average cluster size of commercial DCOS deployments was not in the thousands, this would be a pretty big indicator that companies were not understanding the value of Mesos or trusting it yet to manage diverse workloads across large clusters.
The other interesting thing is that, unlike with VMware server virtualization and OpenStack cloud controllers, companies are not just picking a greenfield part of their workloads to give this new approach a try when it comes to DCOS. They are using it to prop up and automate existing workloads. And thanks to tweaks to the Mesos stack that will allow stateful applications like databases and the legacy applications that talk to them to be encapsulated by Mesos, the size of clusters looks like it will grow even more. Up until now, Mesos has been generally used for stateless web-style applications where the data is not so tightly coupled to the applications.
“We see the DCOS increasingly embracing traditional workloads alongside the modern, stateless, containerized workloads,” says Trifiro. “You can deploy MySQL or PostgreSQL and other applications today, and that is just the start. We will increasingly be able to embrace traditional, legacy applications with this single control plane.”
The initial support for such stateful applications is coming with the 1.0 release of Marathon, which is an application framework and container orchestration layer created by Mesosphere to manage long running jobs on top of Mesos (as opposed to the Chronos framework, which is a used for short-running jobs, batch workloads, and ETL pipelines). At this point, most of the applications running atop DCOS are Hadoop batch analytics and Spark real-time analytics with Kafka messaging and HDFS and Cassandra datastores. This comprises about 80 percent of the current workloads being deployed by enterprises, and the other 20 percent is a hodge-podge of things.
By the way, each of these frameworks for running Spark, Kafka, Cassandra, and so on will have their own scalability limits, both technical ones and tested ones that provide a certified support level that enterprise customers will pay for. Marathon is a general purpose container framework used for homegrown applications, and it has been scaled up above 10,000 nodes in production, according to Trifiro. (Twitter, which does not use the commercial grade DCOS but as one of the creators of Mesos has its own tweaked version, has Mesos spanning more than 80,000 nodes.)
This week, Mesosphere is rolling out the 1.0 release of Marathon, and it has taken about a year and a half to extend the framework so it could handle stateful applications like databases much as it supports Docker containers and Marathon’s own container format that is based on Linux cgroups. The other big change is multi-tenant authentication and authorization support in Marathon, which will not only allow mixed workloads to run on clusters, but will only allow those with access to specific workloads to be monkeying around on the cluster.
In addition to the rollout of the upgraded Marathon, Mesosphere is still working on getting its Infinity stack of Spark, Kafka, and Cassandra software, which allows autoscaling and concurrent running atop DCOS, to general availability; this software was previewed last summer and is almost ready for primetime. Mesosphere is also previewing a new tool called Velocity, which is a variant of the open source Jenkins continuous application development platform that has been tweaked so it runs on top of Marathon and can be scaled out horizontally or scaled up vertically on server nodes and integrated into the DCOS stack so development and production workloads can exist on the same cluster. Mesosphere is giving Velocity to early access customers now and will have it generally available later this year.
The other big news is that Mesosphere has closed its Series C round of funding, this time bringing in $73.5 million with Hewlett Packard Enterprise and Microsoft, two big partners who are pushing production-grade Mesos, leading the way. Microsoft has tapped Mesos to be the tool for its own container management system on the Azure cloud, and HPE is looking to push Mesos into enterprise datacenters and possibly HPC centers and certainly cloud builders across its vast base of server customers. To date, Mesosphere has raised $125.9 million in funding, including backing from Andreessen Horowitz, Khosla Ventures, Fuel Capital, A Capital, and Triangle Peak Partners.
Mesosphere has a little more than 150 people on its staff now, with development offices in San Francisco and Hamburg, Germany. The company has a field sales team, which is mostly spread around North America where the lowest hanging fruit is for DCOS. Trifiro says that most of the funds in the Series C round will be used for software engineering to further flesh out DCOS, but that some of it will be dedicated to building out its sales force and technical support teams as it takes on more and more enterprise customers. He adds that Mesosphere is “generating meaningful revenues” from DCOS licensing, which costs “single digit thousands of dollars per node.”
With something on the order of 35 million server nodes in the world, and maybe half of them at large enterprises, there is a very big total addressable market to chase. Indeed. Using a tool like Mesos could be like skipping a server upgrade cycle or two – and that will be particularly appealing to penny-pinching large enterprises.