How OpenStack Lassoes Yahoo’s 4 Million Server Cores

As one of the first hyperscalers in the world, Yahoo, part of the Oath conglomerate that also includes AOL and that is owned by telecom giant Verizon, knows a thing or two about running applications at extreme scale. Back in the day, this was done by hand, but these days, most of the server infrastructure at Oath is housed in OpenStack clusters that have radically transformed the way server capacity is deployed and used.

Yahoo is best known for its online portal, search engine, and email services – and that exclamation point that we will never use in its name and that crazy yodel – and it did not get its server infrastructure automated in one fell swoop. Like many other organizations, Yahoo had to go through a lot of trials and tribulations, for both virtualized and bare metal iron, to get to the point that it can deploy bare metal machines with the same ease as it can do with virtualized ones.

It took a while to get a workable approach to virtualization, which was a priority even at a company that still, to this day, relies on bare metal servers for a lot of its workloads.

Yahoo started out in much the same place that Google did back in 2005, with some experiments with chroot jails – a kind of early container virtualization in Linux. Then Yahoo did some further experiments and deployments using the ESXi hypervisor from VMware and the Xen hypervisor from Citrix Systems, but still wasn’t happy. After that, Yahoo built Odin, a cluster controller that was built underneath Red Hat’s KVM hypervisor and that was tied into a homegrown procurement system for hardware that did much of what it wanted. But Yahoo wants to be in the media business, not the software development business. Just like it has replaced the homegrown MObStor/DORA object storage that backends its many services with open source Ceph object storage, Yahoo saw the momentum building behind OpenStack and six years ago made a commitment to building its clouds on that platform.

The problem that Yahoo was trying to solve for so long is not solely one of technology, but of culture and business practices. All of the different parts of Yahoo – called properties inside the company – had their own unique infrastructure and they were operated as such. But everyone knew that everything needed to be run like a cloud, a centralized shared resource. To accomplish this, Yahoo not only and dug into the code for the OpenStack cloud controller, it at the same time completely gutting its networks so they could support a new, datacenter-scale compute environment shared by all of the properties at Yahoo.

To get the inside story on Yahoo’s cloud evolution, The Next Platform had a chat with James Penick, architect director at Oath, who was there as the company plotted its path to consolidated bare metal and virtualized clouds.

Timothy Prickett Morgan: What was it like managing the infrastructure at Yahoo before you decided to adopt OpenStack, and what did you gain from this move?

James Penick: That’s actually a great question, and what it gives you is this overall broader alignment to the organization.

We had these puddles of compute resources, and this is due to the old school networking design. And so we made a number of decisions as a company. We chose to modernize our network infrastructure, and our network architect came up with a new folded Clos leaf and spine architecture, and it was a Layer 3 environment and we are going to move to that. That meant as a company we had to shed some of our old assumptions. We had applications that depended on things like multicast, which is a Layer 2 construct, and in this new environment we can’t have it any more.

With these Layer 3 environments, we are getting to a point where we can build a network backbone that is something that gives us a large lake, and by doing that, by pushing our applications to use Layer 3, we are now able to take the next step and pre-allocate compute hardware in all of these network backplanes, and instead of these little puddles, we have this huge lake and we can stock thousands of machines there. And then if a property comes in and says it needs ten or a hundred more machines, we just send them to a portal and it grants them the quota.

The big effect is that in the old days it could take months to get hardware, and we changed that to minutes and that was a huge cultural change.

I am glossing over this cultural change. We had to work with our properties and help them understand what we were doing, why we were doing it, and how it would benefit them in the long run. We had to work with our supply chain organization and change how they did business because this is no longer a completely demand driven model where we spec the hardware and deliver it and instead we create a list of ten types of hardware configurations that are our standard and that will cover almost every use case. We had to buy these ahead of time based on our forecasting. So the supply chain team understands what these properties need and how their applications and users are growing, so when the property actually needs it, they get it on demand.

We found out one neat thing. If you talk to companies about infrastructure, you will hear about hoarding. Groups get hardware, and it is not super easy to get, so they hold onto it. And then they start horsetrading it because they are afraid that if they let go of it, they will lose it forever. Once people figured out that they could get quota for servers through OpenStack, we very quickly found that people wanted the ability to give quote back, which blew me away. We didn’t prioritize that as a feature in the portal because we figured it would take a while to get people to the point where they would give it back. But very quickly people wanted to get rid of old hardware. We could actually see the benefits of streamlining the supply chain process and managing the fleet and fostering agility.

That is not to say that managing bare metal is not without its challenges.

TPM: How much is virtualized and how much is bare metal?

James Penick: Most of our infrastructure is bare metal, and we are trying to change that, but it is slow going. Our number of bare metal nodes has dropped – it is now 220,000 but it used to be closer to 300,000. I realized that the reason for this is a lot of applications have been virtualized. The majority of compute resources are still bare metal – probably 60 to 70 percent of the fleet is bare metal, with the rest being virtualized using the KVM hypervisor. It is about 4 million cores.

TPM: How is all of this infrastructure spread out? What is your management domain size and cluster blast area? It is relatively easy to throw together an OpenStack cluster, but it is hard to manage many of them.

James Penick: We have eight large datacenters, each with its own OpenStack bare metal cluster, and we are working on managing remote datacenters from them. So, for instance, we have a datacenter in Singapore and we have a remote edge site in Vietnam, the site in Singapore would manage that outside compute.

For virtual machines, I am going to take a step to the right and describe our four environments. First we have provisioned VMs and provisioned bare metal, where you go to the well and pick your configuration and boot your instances.

We also have another environment called OpenHouse, which I think was a very important part of making our environment OpenStack managed. We created these OpenHouse clusters, and there are four – one in the US west coast, one on the US east coast, one in Asia, and one in Europe. There is also one in Australia that is a bit different. With OpenHouse, the tenancy is based on the user, not on the property, and every single person in the company, regardless of job role, is authorized for up to five compute resources on these OpenHouse clusters. One of the reasons we did this is that we wanted to drive and foster agility and innovation. We wanted a place where engineers could quickly go and spin up compute resources and try out their ideas and if it works, then they are ready to move it to a provisioned VM.

That is the main purpose. But it is also how we put the OpenStack API in front of users and that is how they take control of their compute resources. It helps people get comfortable that there is not a team creating infrastructure for them. The other thing is that we never increase quota on OpenHouse for anyone, ever. This helps people understand that compute resources are disposable and it is not something that you need to hand tune and keep forever – and also that they can’t use OpenHouse for production workloads. We have held very strongly to these rules.

TPM: What do you do for containers? Oath is not like a large enterprise in that it probably does not have thousands and thousands of applications, even if it does have massive scale on dozens to hundreds of applications. I would think that containers on bare metal are the future, and virtual machines will eventually be phased out once the security issues with containers are fixed.

James Penick: As an organization, containers are done ad hoc. One organization has a container management system they have built that uses Docker containers, and otherwise, there is another team in the company that has been producing the Kubernetes stack and a reference architecture on how those containers are managed. Other teams are still responsible for spinning up their own Kubernetes clusters and control planes.

You bring up this interesting point, where people say that VMs are unnecessarily complex when it comes to containers, and containers are much simpler. There is this perception that VMs have a much higher overhead than they do. The truth is, a very common design pattern is to put a container in a virtual machine for the sake of process isolation, to add that layer of security. This will be the pattern until there is another way to solve the container security problem.

TPM: How is the OpenStack environment going to change over time? What percentage of your environment is on OpenStack?

James Penick: We don’t have everything managed by OpenStack right now, the reason being that we were enhancing our ability to manage remote sites, and for some of those sites, we are holding off until that is in place. We are also waiting for features relating to affinity and non-affinity to be added – such as boot compute resources and either make sure they are all in the same rack or spread out across different racks, as the case may be. We are also waiting to be able to boot a bare metal instance with a custom hard drive layout. With those features, that are getting deployed soon, we are looking at being 100 percent OpenStack before the end of the second quarter of 2019.

TPM: How do you go about securing the OpenStack environment?

James Penick: We have done something very interesting here in creating a tool called Athenz, which is authentication and authorization system that we have open sourced. We have been able to use this to drive the concept of unique service identity and we are moving the company over to use this such that all of our compute resources are secure.

It is possible to boot up compute resources on OpenStack at Oath and have them come up with a unique identity in the form of a X.509 certificate. That certificate containers a number of things, including a service identification string. What this means is that if you boot all of your nodes with this security system, you can also use Athenz to define policy and solve a lot of security problems that are endemic to a number of infrastructures.

We have been able to solve the problem of how do you bring an instance up and give it a secret in such a way that it can attest that it is what it says it is. Every server has an identity that has been signed by a secure root of trust, and you don’t have to use network controls to define what applications can talk to one another. It is still early days with Athenz, and we are still getting everyone to enroll. It is actually baked into the OpenHouse developer environment.

TPM: How do you gauge the success of the OpenStack deployment, aside from the time to deploy a server  dropping from months or days before OpenStack to hours or minutes after?

James Penick: With OpenHouse, we have pretty much done away with people having development desktops. Those are just gone, and that is a considerable savings to the company. I think also that seeing that we have had such a dramatic drop in physical compute resources is a significant improvement as well.

TPM: Will your OpenStack clusters continue to contract as you virtualize, or will it stabilize?

James Penick: I think it is going to eventually plateau. I think our virtual machine environment will grow over time, but I expect that the bare metal environment will not shrink below a certain point because there are applications where bare metal makes more sense. Could that change? Possibly.

TPM: What kind of utilization can you drive within a server and across the nodes on the clusters? Google is doing something like 50 percent utilization on clusters, and sometimes they can drive it a bit higher to 65 percent and rarely up to 80 percent on analytics workloads. Driving utilization in a virtualized or containerized environment is easier, because you can schedule multiple workloads on a server and then cluster them as needed, at different scales, across the cluster.

James Penick: Our overall peak average utilization has definitely increased over the past four years, and for the virtualized workloads it is much higher, and for our analytics workloads it is very similar to those numbers you cited above, from 50 percent to 80 percent. Bare metal that is not enrolled in big data analytics workload or is not running a dense collection of containers is going to be lower.

TPM: Where does Oath get its servers, and how many different configurations do you maintain? When you operate at scale, you want to cut the number of vendors and the number of distinct configurations down to get economies of scale. A lot of hyperscalers have maybe three or four server designs under works at any given time, and maybe six, eight, or ten different ones in the fleet.

James Penick: We work with a number of different vendors. And as for different types, we are in that same ballpark. It really comes down to analyzing total cost of ownership and looking at all of the use cases and bucket them and at how we narrow them down to specific bands. Depending on the year, we have six to ten configurations.

TPM: How hard was all of this?

James Penick: It was not easy. Managing bare metal is a challenge because what you are trying to do – and this is something that we are always improving at – is a very complicated thing: manage hardware across different vendors, and every technology has its own faults and present an environment that is consistent to the users such that when they ask for a compute resource it just appears. And it is not easy to change an entire company in how it is doing business. It is very difficult but it is very rewarding. Sometimes, we have to look back and remember all of the change we have absorbed.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.