Verizon Satisfies Google Envy With Mesos
September 2, 2015 Timothy Prickett Morgan
No business can avoid being inspired if not somewhat alarmed by the vast infrastructure that the hyperscalers of the world have put together in the past decade and their ability to squeeze efficiencies out of their datacenters. Even Verizon, which like its telecom and hosting peers is not new to building vast fabrics of servers, storage, and networks and keeping them running throughout arduous conditions, is learning that building a more automated network is perhaps better than building a bigger one.
For a little more than a year, as we have seen the rise of cluster management and software container systems that are inspired by, if not derived from, the infrastructure created by search engine giant Google, Verizon Labs, the research and development arm of the $127 billion communications giant, has not only been paying close attention to the new cluster automation tools. Larry Rau, director of technology at Verizon Labs, has been weighing the company’s options to build more modern infrastructure. After much examination and testing, Verizon Labs has adopted the Mesos cluster management and application framework and is beginning to roll it out underneath various services.
Telecom companies are a mix of conservative and adventurous when it comes to technology, and they always have been because before the hyperscalers came along, the switched networks and billing systems for the phone networks were among the highest scale workloads on the planet. (There is a reason why the C compiler and the Unix operating system came out of AT&T Bell Labs.) But historically, telecom companies have had to over-engineer their networks to ensure high availability, and this locks up vast amounts of capital and produces silos of compute and storage that can have very low utilization and therefore not very high efficiency. Verizon wants to change that, and adopting Docker software containers and Mesos to manage those containers as well as the underlying server clusters is path that the company has chosen. (This is a story that we expect to hear again and again at The Next Platform.)
“We end up with a stack that is basically Linux at the core, commodity hardware across the datacenter, all of which is very symmetrical and you can contain your hardware costs and maintenance costs.”
Rau works in the New Products Group at Verizon Labs, which has the mission of figuring out how to build better infrastructure to support the myriad workloads that run on Verizon’s network. In the smartphone and tablet era, telecom companies do not just provide voice and data services, they also host applications for users and have internal-facing applications that help them manage users and their applications. As you might expect, the applications can be as important to the company as the raw network providing communications. Verizon, Rau tells The Next Platform, needs to scale some of these applications at the same level as the other webscale companies that are its peers, and things had to change.
“I took a look at how we do things currently, and it is a very traditional telecom approach,” explains Rau. “You build an application, you size it out, you order a bunch of machines, you find datacenter space, you go through a lengthy process to install it, and every time you do updates to the application, there is a lot of intervention. We said that we have to change how we do all of that, we have to move quicker and automatically scale and reduce our operational costs and hopefully leverage all of our capital investments in systems and not run isolated silos.”
These are lessons that every datacenter operator of any scale eventually learns. But the process doesn’t just end at trying to get a better cluster management system that allows multiple workloads to run side-by-side, in a secure manner, on a single cluster, as Apache Mesos and the commercialized variant from Mesosphere, called the Data Center Operating System, allows. Once you start thinking about managing applications, you start thinking about software containers because you have to have some means of providing isolation, software packaging, and resource allocation for applications.
“We started down the path and realized that we have to start looking at the datacenter collectively, as a big set of common hardware resources,” says Rau. “We want to deploy applications and let the system find a place to run those applications, and the concept is to treat the datacenter as a computer. This set us down the path thinking about how we want to compose our applications, and that led us to container technology. We want to compose our applications on sets of containers and we will run them on bare metal hardware because we do not need server virtualization for multitenancy like a public cloud because these are for internal Verizon datacenters hosting Verizon applications. Needing to orchestrate this is what led us to Apache Mesos as an open source project. We end up with a stack that is basically Linux at the core, commodity hardware across the datacenter, all of which is very symmetrical and you can contain your hardware costs and maintenance costs.”
What Verizon is really after is a system that looks like what Google has created for itself starting around a decade ago, when it first started deploying containers on its homegrown Linux operating system and the Borg cluster management that is closely meshed with it.
“If you look at the Google model, how they have been deploying infrastructure for years and years, how they create their systems and how their developers deploy their applications to those systems, what we are trying to deploy is a very similar model,” Rau explains. “What we end up with is a large set of resources that I can treat as a cohesive unit and that allows us to deploy applications more rapidly. It is not a matter of finding hardware and datacenter space. We develop an application and get it out there. We can be more innovative and we can try things and deploy them a little quicker because you don’t have to synch in an 18-month project and a lot of capital just to put out an application or service and see how it might do. Now we can try a service and see how it takes off and we have a platform that can scale it. Once we get an application up and running, we can update it rapidly, too.”
How Much Money And Time Can Verizon Save?
As you might imagine, Verizon is pretty secretive about the kinds of efficiency gains it expects by adopting Mesos and Docker containers, but Rau says it will be “fairly significant based on anecdotal evidence and from what people who are using this type of platform say.”
The hardware savings alone could be huge for Verizon. The way applications get deployed now, business lines that want to roll out something new have to scope it out for a three-year span and also do their best to guess what their peak workload will be during the year and factor that in. Then, because this is telecom, you have to build redundancy into the infrastructure behind the application, and then you go one step further and add geographic redundancy to push the application availability up to five nines and higher. This means Verizon buys a lot more iron when a service first rolls out than it will ever consume. We can imagine that Google was no different when it was experiencing explosive growth in the 1990s and early 2000s and learned all of these lessons, culminating in the creation of LXC containers and Borg.
“When Verizon launches a new service, because we have spare capacity in the cluster itself, we don’t have to size it for the three-year outlook and the peak load we might expect in those three years. This allows us to grow our hardware by looking at an entire set of applications, and we can get into a model where we expand the cluster quarterly. This is the other benefit this approach gets us: Our capacity planning becomes a bit more concrete and sensible in that we can look at real historical data in the cluster and look at what we are onboarding to and offloading from the cloud and make better hardware acquisitions.”
“What we end up with is a large set of resources that I can treat as a cohesive unit and that allows us to deploy applications more rapidly. It is not a matter of finding hardware and datacenter space. We develop an application and get it out there. We can be more innovative and we can try things and deploy them a little quicker because you don’t have to synch in an 18-month project and a lot of capital just to put out an application or service and see how it might do.”
This is precisely what the hyperscalers do, and incidentally, they say that capacity planning across a wide diversity of workloads is far easier than doing so for hundreds or thousands of separate workloads running on separate machines or smaller clusters. Moreover, the lesson from hyperscalers is that it is far easier to drive up utilization on one large cluster than it is one several smaller ones. This is one of the reasons why hyperscalers try to minimize differences in their hardware and software stacks.
While people don’t talk specifics because it all depends, we know that server utilization is generally on the order of 10 percent to 20 percent in the enterprise datacenter, and with virtual machines or software containers, companies can deploy multiple workloads on a machine and maybe drive that utilization up to 50 percent to 60 percent. We ran these numbers by Rau, who said they were in the ballpark.
What Mesosphere says is that DCOS has increased cluster resource utilization by maybe a factor of 2X to 3X amongst its early adopters, and for some it has been as high as a factor of 5X. When you run that out across tens of thousands or even hundreds of thousands of machines, these are big numbers. Those efficiency gains mean companies can deploy a lot more infrastructure to expand their application base, contract their server footprint, or do a little of both. As we have previously reported, Mesos will eventually get oversubscription capabilities, which will allow it to push up utilization rates even higher, thanks to Project Quasar. What this means is that customers should be able to see cluster utilization of 75 percent to 80 percent without too much trouble.
As for hardware, Lau is not about to reveal Verizon’s plans, but says that the goal is emulate the hyperscalers and get “commodity, symmetrical hardware in the datacenter.” That does not necessarily mean servers, storage, and switching that is compatible with the Open Compute Project founded by Facebook five years ago, but it almost certainly does mean hyperscale-inspired machinery like that which Dell is aiming to sell through its new Datacenter Scalable Solutions division, which was announced last week specifically for customers like Verizon. While Mesos is being deployed on some existing machines, the idea is to roll out new iron for new workloads on the Mesos clusters and over a several year timeframe when old systems are retired from the server fleet, that fleet will be completely upgraded and Mesos will take over everything.
Adopting Mesos is not just about saving hardware money, but saving time. At the recent MesosCon conference in Seattle, Verizon showed off that it could fire up 50,000 Docker containers on its cluster in 72 seconds. Verizon believes that with this kind of speed, and automated control of containers and the underlying cluster, it can speed up application deployment on its network by an order of magnitude.
Verizon plans to have services up and running on its Mesos cluster this year. Speaking very generally, Rau says the first applications that will be moved to the Mesos cluster will be those that support its wireless network and the mobile applications that customers use as well as the FiOS network. Mesos will also be The Next Platform that will support IoT services that Verizon is working on now, and media services such as video streaming will also be hosted on the Mesos clusters. Verizon is also planning to move its Hadoop and Spark analytics workloads off of their dedicated clusters and to plink them down onto the Mesos clusters.
As for containers, Verizon has predictably chosen Docker, but it is not embracing the Kubernetes container podding technology that Google created and that is inspired by Borg. Rather, Verizon plans to use the base container function inside of Mesos and Docker daemons to manage containerized software packages. Verizon looked at CoreOS and its rkt containers and the Tectonic container management system (based on Kubernetes) created by CoreOS as an alternative to Mesos, and concedes it could use rkt containers for some of its workloads or even raw LXC containers in the Linux operating system.
“Container technology is very mature, and it has been around for a long time, but the benefit I see from Docker is that it is now more easy to use, and I think the packaging format gives you a way to treat applications as a unit,” explains Rau in talking about the adoption of the Docker format for software containers. “Mesos has its own containers as well, where it can set up Linux containers and their namespaces and control groups and talk to the Docker daemon and launch that task on its behalf. That’s what we are doing right now, but I would not say that we will be using Docker exclusively forever. I see Docker’s image format and the standards that it has brought to the table as a critical component.”
Incidentally, Kubernetes can be run as a framework on top of Mesos, so it is possible that Verizon will use Kubernetes, too, if the need arises.