New Protocol Targets Cloud Scalability

It is no accident that virtualizing the network has taken longer than virtualizing servers or storage, and it is similarly not a coincidence that the hyperscale datacenter operators and a small number of very large businesses have decided to build their own switches and routers operating systems for these devices. Networking gear is overloaded with protocols to support every possible scenario and customer, and networks are about as difficult to scale as NUMA systems.

LANs and virtual LANs have their physical limits, and as always, customers want to expand beyond those limits, but this requires yet another layer of abstraction and more bit fiddling to accomplish. Networking giant Cisco Systems has just put its stamp on a relatively new protocol for virtual networks, called Border Gateway Protocol Ethernet Virtual Private Network (BGP-EVPN for short). Now the networking arena is trying to read the signs to see if this signals a big change in Cisco’s plans for software-defined networking as outlined in its Application Policy Infrastructure Controller (APIC), which was launched for its high-end Nexus 9000 switches back in November 2013.

By adopting BGP-EVPN, Cisco is essentially admitting that its APIC approach, which etches some of the SDN functionality into the chips inside the Nexus switches, is not for everyone. APIC itself was launched in part to counter the onslaught of VMware, Big Switch Networks, and others who are creating software-only network virtualization stacks. And APIC absolutely intended to offer a hardware solution to the virtualization and management of networks and the applications that interface with them that was an alternative to VMware’s NSX network virtualization software, which works hand-in-hand with its vSphere server virtualization tools. The sales pitch from Cisco when APIC was announced for the Nexus 9000 switches is that it could deliver virtualized networking for 1,000 nodes with 2,000 network ports at a cost of about $40 per port per month (amortized over several years) compared to running a virtualized stack like the VMware vSphere-NSX combo on commodity switches and servers at a cost of about $154 per port per month.

If customers in the vast Cisco installed base had just snapped up the APIC approach, there would be little need to adopt a new protocol that offers improved support for large, virtualized networks. But not all IT shops want to go all the way with APIC at Cisco just like many customers do not want to go all the way with VMware with NSX. These technologies are too new for many risk-adverse enterprises to adopt. And so Cisco has to offer alternatives to better scaling for large, virtual networks and support for the new BGP-EVPN protocol on the Nexus line of switches is an example of Cisco being sensible about offering options even if it does undercut its APIC sales pitch a bit. (Cisco has over 1,000 customers using its Nexus 9000 switches and over 200 customers are using the APIC extensions instead of just running them as plain vanilla switches using a cut-down version of the NX-OS network operating system.)

One of the basic problems that customers running large virtualized clusters is wrestling with is that Layer 2 switching networks can only have so many end VLANs – 4,096 to be precise, and that is not very many for a large public cloud or hyperscale datacenter that needs an isolated virtual network for tens or hundreds of thousands of virtual machines on their virtualized server clusters. So two protocols were created to virtualized Layer 2 networks and stretch them over Layer 3 networks, one called Virtual Extensible LAN (VXLAN), pushed initially by VMware, Cisco, and Arista Networks, and the other called Network Virtualization using Generic Routing Encapsulation (NVGRE), started by Microsoft and adopted by many network chip providers since then.

With VMware being the dominant server virtualization stack in the enterprise, support for the VXLAN protocol was enthusiastically adopted by switch ASIC makers, especially because its 24-bit header allowed for as many as 16 million tenets on a virtualized Layer 2 network. In plain English, what that means is that companies could build server clusters with virtualized networks that would push up into the hyperscale realms, allowing for VMs to flit around across what would have been isolated Layer 2 networks before. But there was a problem, and one that the BGP-EVPN protocol helps solve.

As it turns out, the VXLAN protocol as initially implemented has some issues, as Mike Cohen, the director of product management for Cisco’s contributions to the OpenStack, Open vSwitch, and Open Daylight projects, explains to The Next Platform. (Cohen did the same product management job at network virtualizer Big Switch Networks a few years back, and has held positions at Google and VMware as well.) The addition of the BGP-EVPN control plane aims to fix these problems.

BGP-EVPN is a standard that is being developed under the auspices of the Internet Engineering Task Force and the draft proposal was developed by Cisco, Alcatel-Lucent, Huawei Technology, and Juniper Networks, along with input from network operators AT&T, Verizon, and Bloomberg. BGP is a routing protocol used to link service provider networks together and has been tweaked to perform similar work switching and routing inside large-scale datacenters and across distributed datacenters among the hyperscale elite. The BGP and EVPN protocols are already in use among Cisco customers today, and the combination of the two allows reachability information concerning Layer 2 and Layer 3 parts of the network to be brought into a single control plane. VXLAN has what is called a flood-and-learn control plane, which is not a very efficient way to map out where different virtual machines are on the network.

“Traffic is flooded through the whole network, and that is how you learn where different endpoints are,” explains Cohen. “This is very inefficient from a bandwidth utilization standpoint, and it is resource intensive on the hardware tables in the network chips at this information is flooded at them. Eventually virtual machine A learns where virtual machine B is on the network on this VXLAN fabric. This protocol basically says if you don’t know, tell everybody, and everybody is screaming for everybody else, in a sense.”

The other problem with the initial VXLAN fabric, says Cohen, is that while it stretched the Layer 2 networks, it did not change the default gateway that was configured with particular VMs. The upshot is that when a VM was live migrated across the datacenter to a different physical Layer 2 network, traffic relating to that VM still had to be passed back all the way across the datacenter to the original gateway.

“This is a far more efficient way of sharing the reachability information,” says Cohen, because the location information of virtual machines is known in advance thanks to this BGP-EVPN control plane and you do not how to do the flood-and-learn method to figure out where virtual machines are located. “The key thing is that there is no more flooding going on in your VXLAN network, eating up your bandwidth and limiting your scale.”

How much of a performance improvement the shift to the BGP-EVPN control plane for VXLAN will depend on how many hosts and how VMs move around the network. If customers are careful about placing their workloads within a rack on the same top-of-rack switch, the performance improvement won’t be much. But on a cloud, where VMs are scattered all over the place and moving around all the time, the adoption of the BGP-EVPN control plane could make a big difference. Cohen did not have any benchmark test results to share on this.

BGP-EVPN is available now for the Nexus 9000 switches, and will be available for the Nexus 7000 switches and ASR 9000 routers in the second quarter of this year. The BGP-EVPM specification was written to work with other overlays and will, in fact, work with NVGRE and Muliprotocol Label Switching (MPLS) over GRE, the latter being yet another transport that happens to live in the interdimensions between Layer 2 and Layer 3 in the network protocol stack.

Cohen says that the advent of BGP-EVPN as the control plane for VXLAN and soon NVGRE network overlays does not in any way signal the end of APIC, which is a different kind of SDN beast entirely. It is merely a matter of different horses for different courses. And, since Cisco is running BGP-EVPN on the Trident-II ASICs from Broadcom embedded in its Nexus 9000 switches, it stands to reason that others will adopt BGP-EVPN as the control plane for VXLAN and NVGRE overlays if they are using Broadcom chips, too.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now


  1. I give this effort a B-. It doesn’t seem very cloudy or open that the solution to the VLAN scaling problem is to build a new protocol that needs specific Cisco hardware to implement. Certainly BGP is the way to go — it has years of reliability at the largest scale behind it and the idea of VMs coming online and being announced to the network via BGP is a very elegant one. However, that elegance should flow naturally throughout the fabric, and not need custom hardware.

    Project Calico ( is doing work in this area already and is open source to boot. Well worth a look.

    • EVPN and VXLAN don’t seem very open? What kind of FUD is that – these are both standards that have been released via the IETF, the same way IPv4, IPv6, BGP, IS-IS, OSPF, and all the other protocols that we use are defined. IETF is as open as it gets. More to the point – how is Calico open? I can go read the specification for EVPN and VXLAN easily, where is the specification for Calico? If it’s such a good idea, why aren’t they participating in the IETF towards standardization?

      And not very “cloudy”? Project Calico is about as far as you can get from a “cloud” as you can get. Inability to handle overlapping IP space and provider coordinated addressing? Is this an early 2000s colo or a modern “cloud” solution? And the solution to these problems is a kludge like 464XLAT when you’re talking about things like “elegance”?

      Calico has some good ideas, like router in the hypervisor, but how is that different from VXLAN with the hypervisor being the NVE that controls VNIs?

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.