Facebook Ops Director On Breaking Open The Switch
March 16, 2015 Timothy Prickett Morgan
Facebook and its peers in the hyperscale sector have been pushing the very closed network switch business to open up and embrace hardware designs that look more like X86 servers and that are based on a Linux operating system. The idea is to break the software free from the hardware and let both evolve independently from each other as happens on servers. Open switches also speed up innovation and allow companies to deploy precisely the networking stack they need for their specific applications – and no more than that.
Although his title of director of technical operations doesn’t explain it, Najam Ahmad is in charge of the networks at Facebook, and before joining the social network he was in charge of Microsoft’s global network. In the past few years, Ahmad and the networking team have been spearheading the development of open switching gear and cultivating an open source switch platform to run atop it. This open switch movement is shaking up the switching business, much as the combination of Linux and X86 servers has done on the compute side of the datacenter.
At the Open Compute Summit in San Jose last week, Ahmad sat down with The Next Platform to talk about why Facebook has been pushing so hard for open switching, how Facebook is rolling it out in its own environment, and what further changes will be necessary for networks to keep scaling to support hyperscale companies like Facebook.
Timothy Prickett Morgan: A lot of things have happened since Facebook started talking about open networking and driving changes in the industry through the Open Compute Project a little more than a year ago. I have seen Open Compute servers and storage being deployed in Facebook’s datacenters, but what is the status of the deployment for open networking? I visited the Facebook datacenter in North Carolina last year and when I was there that was still using Cisco switches. Since then you have launched the Wedge top of rack switch, the 6-pack modular switch, and the FBOSS switch operating system. What other things might you do on the networking front?
Najam Ahmad: Let me start with some context. We were looking at disaggregation as our primary motivation. That meant disaggregation of the network appliance because it is vertically integrated and all features and functions are very closely tied to each other and you essentially have to go from one appliance to another appliance if you need to do something different. If you go to Cisco or Arista or whoever, what you end up with is vertically scaling boxes and platforms that are very closed and we don’t have a lot of control.
So we tackled this in three layers. One is the topology, the architecture itself, and the idea of the architecture was to build it in a modular fashion that scales with the number of machines and the amount of bandwidth they provide so we could grow over time. And we released that architecture a few weeks ago, called Fabric.
Fabric disaggregates the large chassis-based platforms that we used to have in our previous architectures in a cluster environment and now that is more of a heavily modified CLOS-based architecture that employs smaller, pizza-box devices. To complete that architecture, we built two platforms: The Wedge and the 6-pack. If you look at the Wedge, it is actually a module of the 6-pack system. If you take twelve Wedges and put them into a 6-pack, eight of them are outward facing ports and four of them are used for the fabric inside. On top of that, we built two pieces of code: OpenBMC, which we are open sourcing and which runs on the chip in that Wedge platform and gives us the ability to manage that platform including its environmentals and so forth. OpenBMC runs on the microserver embedded in the Wedge switch, and that microserver is running an operating system as well as doing other things. FBOSS is the operating system we run on these switches, and we are making a bunch of FBOSS libraries open source, too.
There are a bunch of partners who are developing this. Cumulus and Big Switch have both ported their software onto the Wedge, so you can buy the Wedge from Accton running Cumulus or Big Switch software if you did not want to roll your own switch operating system. Broadcom has also been a partner and has built OpenNSL.
TPM: Where is Facebook in terms of its rollout of these technologies?
Najam Ahmad: If you go to our Altoona, Iowa datacenter, it is all built using Fabric and it is using Wedges. The 6-packs are in production but are still in limited quantities because we are still doing verification testing, and once we are done with our testing it will become our primary platform. The 6-pack will be replacing the spine switch, and we use both Cisco and Arista up until now.
TPM: How do you backcast new technologies like Wedge, 6-pack, and Fabric across the older datacenters? What is the process here? Do you only use the new technologies in a new datacenter, or will you go back and put this in Prineville, Oregon and Forest City, North Carolina and Lulea, Sweden?
Najam Ahmad: We do a refresh cycle. Any new datacenter we are building gets the new technology. And then when we do a refresh cycle with the servers in the other datacenters, we refresh the network as well. Our refresh cycle tends to be much shorter than most companies, around two to three years.
This refresh is something that we take very seriously because if you have lots of different technologies in production, it gets harder to manage. So our goal is always to have no more than two generations of technology in production. We will actively replace stuff, and the most current two are what we end up with.
TPM: What triggers the refresh cycle in each datacenter? Do you start with older stuff first in Prineville or work back from Altoona to Lulea to Forest City to Prineville? Is it first in-first out?
Najam Ahmad: It all depends on where server refreshes are happening, and they happen for all sorts of different reasons. We might have an app in a datacenter that needs more capacity and we start adding more servers and at the same time decide to replace the old servers, and then we will add the network along with it. There is no specific timeline, it is all based on need.
If you look at Lulea, it is not just one building. There is Lulea 1, Lulea 2, and Lulea 3, and they are in different stages of lifecycle.
TPM: So you can add new gear to a new building and then move workloads from an old building to a new one and then add shiny new stuff to the original building.
What else do you have to do in terms of evolving the Facebook network? If you have a fabric that can scale across the datacenter and you have the disaggregated switches you wanted and soon multiple vendors who will be building them not just for you, but for other Open Compute adherents, what do you do for an encore?
The whole point is that at scale, you need to be able to get to problems very, very quickly, and ideally through software, not human beings.
Najam Ahmad: What I would like to see more of an ecosystem where a lot of companies are building solutions and writing software that is open source and available for the rest of us to use. And that ecosystem is starting to develop. But we couldn’t get that ecosystem to start without the disaggregation.
Now Microsoft and Dell are leading the development of a hardware abstraction layer, called Switch Abstraction Interface or SAI, that provides an abstraction on top of Broadcom or other silicon inside the switch. And that SAI source code will be open as well.
TPM: Will you use SAI?
Najam Ahmad: Absolutely. Our intent was to build it ourselves, but then Microsoft and Dell said they wanted to build it so we said great, take the lead. That is actually a good thing to see: different companies taking the lead and solving different parts of the problem so we can build this ecosystem of network operators that are building solutions that apply to large-scale networks.
We still have a lot of other things to think about in terms of chipsets. There needs to be more options, and there are different languages to program these network ASICs.
TPM: Do you us only Broadcom ASICs in your new gear?
Najam Ahmad: Primarily, but we are looking at other options. The way we have built the Wedge and 6-pack they are modular so within the same architecture you can have different boards or line cards. As more network chips get developed and more companies decide to build a Wedge board, then we and everyone else using these two switches will have more options available.
TPM: Do you have any interest at all in InfiniBand, or with RoCE running atop 10 Gigabit or 40 Gigabit Ethernet, is that low latency enough if you have a workload that requires remote direct memory access over the network between machines?
Najam Ahmad: We are all Ethernet.
TPM: Why have two standards when you can have one. . . .
Najam Ahmad: Exactly.
TPM: How far can Fabric scale in terms of switches and node count and bandwidth? The intent, I believe, was to create switches and a fabric that could span an entire Facebook datacenter, which might have hundreds of thousands of machines. Does it just go forever in terms of its expansion?
Najam Ahmad: Forever is a long way out. I try not to say “ever” or “forever” or “never” when I talk about technology. But we have tried to build an architecture that scales horizontally and scales to the size of the buildings that we have, and you end up with hundreds of thousands of machines behind that. Theoretically, there is no limit. At our scale, it is working in production today, and that is hundreds of thousands of machines on fabrics. We will see how far we go as we try to scale it.
TPM: You have had a $2 billion savings to date from the Open Compute effort. Imagine you had to go back in time and do what you are doing today in your datacenters, but you did it with commercial and closed switches. How much less cost and grief and hardware does this open network approach yield?
Where we have to spend more time is on the optics. With this disaggregated architecture, you have hundreds of thousands of optical interconnects, this becomes a much more significant part of your spending in the datacenter than switching is.
Najam Ahmad: Let me quantify the most important thing for me, which is scaling operations. If you look at the Fabric deployments, every aspect of it is automated, and that comes from disaggregating the software from the hardware and having the ability to write code right on the boxes themselves. All configuration is done much like we do with servers, where it is top down and where the management system determines what the topology of the network is, and based on that topology and where the machine is, it pushes a configuration out to the box and no human being is involved in that. This is rather than what happens with traditional networks, where an engineer logs into a device and sets up a configuration and goes on to the next device and the next and the next. That is a bottom’s up approach.
We leverage our Linux-based deployment environment, and one of the reason why Linux is so important to us on the switch is that we manage our server fleet with a lot of tools. This is all homegrown stuff and we have not officially released the names of these tools.
Our goal was to get this network in a state, from a monitoring perspective, so that only a single person was required to watch the network at any given time. We don’t have a network operations center, and I don’t intend to build one and that is by design. FBAR, our server remediation tool, does a lot of that stuff. We run a monitoring agent on the Wedge, and that gathers a lot of stats and pushes it up. We know what the norms are and we can use our systems that we have used to manage our large fleet of servers to deduce what is going on with the switches.
A traditional network operations workflow is that an application sees a problem, someone figures it must be the network, a network engineer gets called and logs into one device after another to try to figure out if there is a problem and often doesn’t see anything. That is a typical flow. What we do instead is we have an entire heatmap of the Fabric, and this gets stats from the entire network and we look at what is normal and if something is not normal it turns yellow or red and you can doubleclick and see what is going on and troubleshoot from there. So this doesn’t require that many engineers.
The whole point is that at scale, you need to be able to get to problems very, very quickly, and ideally through software, not human beings. Our tagline is, we want robots to run the network, we want people to build the robots. The people who used to do all of the monitoring are now writing more code to do more automation of remediation. For me, that is the real cost savings and the real impact of building this networking technology. That we can have one person watching the entire network.
TPM: Would it even be possible to run Facebook’s network the old way?
Najam Ahmad: Absolutely not. This is why we had to drive technology change, and this is why I am so passionate about this. Our whole motivation with Open Compute is to get more people building it, more people using it.
TPM: When do you not have a job, when the industry starts doing what you had to bootstrap yourself? At what point can Facebook just buy Cumulus Linux and a stack of applications, or do you never reach that point because you are Facebook and you have to scale so much further than most other companies?
Najam Ahmad: In a lot of ways, when you solve one problem, you are just moving the bottleneck. We do have the rest of the network. We have been talking about the datacenter network, but there is the backbone, there is the optical domain, and there is the edge network that connects to all of the Internet service providers of the world. There is a lot of other work that we are doing that we will start bringing into the fold. Datacenter was a place to start. There is a lot more networking left.
We feel that the datacenter architecture will scale for a number of years. Fundamentally, we think the architecture and topology of our network will remain the same. It will just require different components. Different chipsets, more ports, higher speeds – all of that will be required as we scale.
In the spine, we use a lot of 40 Gb/sec today, but to the NIC is still 10 Gb/sec. As that NIC goes to 25 Gb/sec and 50 Gb/sec, that forces increases upstream. We will have to go from 40 Gb/sec to 100 Gb/sec to 400 Gb/sec. What happens in the 400 Gb/sec timeframe?
Right now, we are trying to solve the 100 Gb/sec problem and optics are too expensive and there are several competing standards that people are pushing. We are trying to work with people to get that squared away so optics are not a hindrance so we can go to 100 Gb/sec on a massive scale. Then it becomes a 400 Gb/sec problem and how do we do that.
We have a lot of east-west traffic within the datacenters and across datacenters, so build fairly large amounts of bandwidth into our networks. That is why we have been focused on reducing costs. And in the past few years we have put pretty significant dents in per-unit costs of bandwidth inside the datacenter. We have not officially released numbers there. Where we have to spend more time is on the optics. With this disaggregated architecture, you have hundreds of thousands of optical interconnects, this becomes a much more significant part of your spending in the datacenter than switching is.
TPM: That’s just stupid.
Najam Ahmad: Yes, it is – and hence we need to work on the optical interconnects and reduce costs there.