The Tectonic Shift To Virtual Distributed Routing

In the IT sector, there is a constant flow of little changes to hardware and software that culminate in progress. And every now and then, there are tectonic shifts that delineate a much more dramatic change and delivers the rare “great leap forward.” Virtual distributed routing, or VDR, is one such tectonic change – and one that datacenters will welcome, at that.

As previously reported in The Next Platform, the first virtual distributed router, delivering a massive 768 Tb/sec of aggregate bandwidth, was launched by Arrcus in July 2020. Others in the routing segment of the networking market may try to follow suit, but they have a lot of work ahead of them if they do.

For context relating to the perspectives outlined below on how transformative VDR will be, I have been a student of, and practitioner in, the networking industry since the dawn of the modern broadband communications era in the 1990s. I am fortunate to have been a founding team member of broadband Internet access pioneer Roadrunner (the key driver of the $79 billion acquisition of Time Warner Cable by Charter Communications in 2015), and an early stage frontier technology venture investor in networking pioneers such as Abrizio (acquired by PMC Sierra), Dune Networks (acquired by Broadcom), LVL7 (acquired by Broadcom), and others. I am privileged to be an invited guest as a founding investor and BoD member for Arrcus. The Arrcus team worked out of our Clear Ventures office until their rack space needs exceeded our wiring closet and HVAC capacity, and their espresso consumption exceeded our brewing capacity.

The Internet Today

Dramatic progress has been achieved over the past 30 years building and scaling a global Internet. The “Jetson Phone” that was once the domain of science fiction is now part of daily life via FaceTime and Zoom. Yet, today’s Internet is facing intense pressure from unprecedented growth in connected devices, unprecedented bandwidth and computational demands per device, and emerging use cases with divergent requirements for latency, performance, and ability to pay. According to Statistica, by 2018 there were 22 billion Internet connected devices in use around the world. This is expected to grow to over 50 billion connected devices by 2030.

With aggregate revenues exceeding $60 billion in 2019 (figures are not yet out for 2020), the networking industry is a crown jewel of the global information technology industry. Many of today’s market leading networking industry solutions still follow a decades-old paradigm of vertically integrated embedded systems. In the compute world, this model of purpose-built embedded systems has long since passed – with a progressive transition from mainframes to high-end servers and onto scale-out clusters of commodity servers. A new generation of networking solutions is needed now to meet the world’s burgeoning requirements for network scalability, installed first cost, reliability, and operational simplicity.

The building blocks that make today’s ubiquitous mobile broadband Internet possible are:

  • Powerful endpoints: smartphones; laptops; servers
  • Fast access networks: mobile/wireless; residential broadband for transmission
  • High-capacity backbone network: metro and long-haul fiber for transmission
  • Massive data centers: football field size public and private clouds, private cloud for shared compute and storage

Although invisible to most of us, networking technologies such as Ethernet switching and IP routing are truly foundational to our daily lives. The networking technologies that select an optimal path for, then forward our information in, between, or across multiple networks to get us where we need to go are the “postal service” integral to all elements of the world’s information, communication, and entertainment infrastructure. Networking solutions are realized with a sophisticated collaboration between leading edge software and hardware innovations. There is a vast set of “protocols” such as BGP, OSPF, IS-IS, SRv6, IPv6, and more that define how the Internet functions. These protocols define how packets move from node-to-node,from end-to-end across many nodes, and help create the central nervous system for the internet.

In the early days of the Internet, a common refrain was “switch when you can, route when you must.” These days as the number of uniquely addressable IP endpoints has skyrocketed to tens of billions, this rule of thumb has been flipped on its’ head to “route when you can, switch when you must.” This is a theme developed and explored at length by The Next Platform two years ago, in The Switch-Router War Is Over, And The Hyperscalers Won. The point is that routing, which was primarily a core backbone technology, is now a ubiquitous presence all the way from a datacenter top of rack, to the edge, and on to multi-cloud environments.

Why Legacy Routers Must Be Reimagined

Core routers are the very definition of mission-critical core infrastructure. Ensuring that systems are up with more than “five nines” of reliability without compromising on the scale, speed, efficiency, or latency of operations is a daunting task. Today’s legacy routers meet this challenge with purpose-built, chassis-based, refrigerator-sized embedded systems. The historical analogy with mainframe computers is striking. The issues with this approach are:

  • Sub-optimal resource utilization: Need to buy eight and sixteen slot chassis upfront, which may not be utilized fully – minimum (atomic) unit is large (a chassis) versus a pizza box
  • Linear CapEx and OpEx increase: Linear (not-flat) licensing model leading to linear OpEx increase, custom (not merchant) silicon and lack of merchant silicon flexibility leading to linear CapEx increase
  • Proprietary solutions and a closed ecosystem: Vendor lock-in limiting innovations needed to meet the current market demands/trends shown earlier
  • Poor software quality: Ever increasing network/business interruption from both legacy software (generally bloated with creeping featurism) and relatively-low feature velocity

We have seen this natural lifecycle before. There was a time when the apex of human networking achievement was the telco central office switching system. Products such as the AT&T/Lucent 5ESS and Nortel DMS switches formed the central nervous system of the global public switched telephone network. As those products evolved, their software bases grew in size, became untenably brittle, and ultimately died under their own sheer butt weight. They were disrupted by the move from TDM voice to routed IP-based packet backbones.

Why Virtual Distributed Routing Is A Huge Leap Forward

The virtual distributed router is born of domain routing expertise and customer relationships with those that build and operate the world’s Internet backbone. These Tier 1 communication service providers were/are faced with the intrinsic problems of a legacy network design. Every time they run out of slots on their refrigerator-sized core router chassis, they needed to spend a fortune to buy a new chassis from the incumbent supplier. The move to a new generation chassis required/requires a forklift upgrade of their existing chassis-based infrastructure to a new platform to get new features or to enable a new set of speeds and feats. VDR solves these problems with its distributed Clos model because it elastically scales across multiple generations of hardware and software. VDR eliminates the fate-sharing problem of a physical chassis thereby meeting and exceeding network uptime and resiliency SLAs. By moving the control plane and analytics plane out of controller cards in a physical chassis to an independent compute cluster, VDR is able to decouple the network service from the physical chassis constraints. This enables the deployment of any network service, anytime, anywhere, at best-in-class installed first cost and total lifecycle cost points.

As far as we know, Arrcus has delivered the world’s first software-based, virtualized, distributed and massively scalable router. (It is possible that one of the hyperscalers has done this, but they are so secretive that it is hard to be sure.) The Arrcus VDR consists of three major functional entities:

The first piece is an underlying data plane packet forwarding and switching fabric that runs on state-of-the-art merchant silicon chips such as the Ramon and the Jericho ASICs from Broadcom. A thoughtful hardware abstraction layers greatly simplify porting the ArcOS network operating system to future generations of packet forward and switching hardware solutions. This de-coupling of merchant networking hardware and ArcOS software transitions the networking industry from vertically integrated embedded systems from legacy vendors to horizontally segmented best-in-class building blocks for the first time.

The second piece is a control plane cluster that manages the distributed fabric. Being able to independently scale control plane logic future proofs the next-generation Internet core to be able to scale to hundreds of billions of uniquely addressable IP endpoints with record setting routing convergence times.

And finally, the third piece is a deep-visibility cluster that ensures a healthy,highly available, and automated, system, that delivers self-provisioning and self-healing networks.

These three functional entities work together to deliver the following key benefits:

  • VDR is a distributed Clos topology that can scale to at least 7,680 100 Gb/sec ports (or 1,920 400 Gb/sec ports). Thus, it scales far beyond the existing refrigerator sized, chassis-based core router systems in the market.
  • VDR is pay-as-you-grow – so an operator can start with an investment of as little as one 40 port 100 Gb/sec “line card” and grow the fabric based on business needs up to 48 “line cards” and 13 “fabric cards”
  • VDR is open – it uses OpenConfig, YANG and standards-based APIs. Thereby third party operational support system platforms can integrate seamlessly with it.
  • VDR is simple to operate – it has a single point of management and control that appears as a single logical entity to the user, thus giving the user the look and feel of a traditional chassis.
  • By using Kubernetes for orchestration of the containerized system components, VDR enables on-demand, scale-out with optimal resource utilization.
  • VDR runs ArcOS both in the fabric and in the control plane cluster. ArcOS is a highly scalable, 64-bit network operating system (NOS) that is capable of scaling to more than a 100 million network paths and converges much more rapidly than the other all other vendors in the market.
  • VDR runs ArcIQ in the deep visibility cluster. ArcIQ is an intelligent monitoring and analytics engine, which not only provides deep visibility but also predictive analytics and actionable insights

The benefits of this approach are manyfold, but can be summarized in with four Ss:

VDR is Simple: VDR has a single point of control and management providing a single pane of glass from the user experience perspective. VDR is also standards based and thus it is very simple to integrate VDR with northbound interfaces like netconf, cli, restconf as well as very easy to integrate VDR with third party OSS models.

 VDR is Scalable: VDR scales in terms of ports and also in terms of route scale. VDR runs ArcOS on all of its components. ArcOS supports a massive routing scale of more than a million routes along with rapid convergence. VDR augments this scalability with automation. VDR is an automated fabric built with cloud-native design principles in mind.

VDR is Seamless: VDR can be deployed in any point in the network – at the edge, the core, the aggregation layer. Also, VDR is seamless in the sense that it has the same ArcOS software programmatic APIs as when ArcOS runs on a fixed-form factor switch or router (irrespective of the underlying silicon chipset).

Finally, VDR is Secure: It has the deep visibility capability to it that aids a network operator to understand what is happening in the VDR network – on every component level – be it LC, FC, CC as well as for every use case level. VDR streams the telemetry information outside – information as to what is happening in the network and it does so in real-time. This information can then be consumed by telemetry applications which can then act upon it based on the business need

The table below summarizes how the Arrcus implementation of VDR compares and contrasts to traditional chassis-based “big iron” routers and various open source software efforts to re-imagine networking software:

Having said all of this above, it is worth pointing out that you have to have tremendous respect for today’s networking industry leaders. They have been pioneers in connecting the world’s businesses and people together.

Cisco Systems was founded in December 1984 and has been an impressively durable franchise company. Juniper Networks was founded in February 1996 and succeeded in turning Cisco’s core router monopoly into a duopoly for nearly two decades. Powered by acquiring Silicon Valley edge router startup Timetra in May 2003, Alcatel continues to be relevant. Arista Networks was founded in October 2004 with the accurate thesis that merchant networking silicon had become performant enough that captive ASIC development was no longer needed, and that Cisco’s Ethernet switching franchise could be challenged with thoughtful systems engineering combining their proprietary software assets with merchant networking components from the likes of Fulcrum Microsystems (long since part of Intel) and Dune Networks (acquired by Broadcom some time ago). While under facing significant headwinds now for various business practices, Huawei Technologies has captured consequential networking industry market share.

All of these vendors follow the same basic paradigm – purpose-built complex machines that command high gross margin, create extreme vendor lock-in, and have limited scalability due to basic architectural tenants. In the words of economist Joseph Shumpeter, this has made them vulnerable to “creative destruction.”

The beginning of every new chapter of disruptive technology innovation is some chapter’s end. Google’s innovation of building massively scalable distributed computers from commodity servers paved the way to the modern cloud. It marked the beginning of the end of muscle car server vendors like Sun Microsystems. History is repeating. After several years Google concluded traditional switching and routing solutions could not meet its business needs. And the company also recognized that networking industry leaders were based on readily available merchant networking ICs, who added on a 3X to 4X markup on the hardware costs for the addition of their captive networking stacks. Google, then Facebook and Amazon Web services responded in three ways:

  • They built in-house networking software expertise, especially in the area of network protocols.
  • The adopted merchant networking ICs. This has allowed them to gain a new level of economies of scale. This degree of vertical integration is impractical for many, if not most, others who build and manage consequential IT infrastructure. Arrcus ArcOS and VDR is the first and only truly viable merchant network operating system with a robust, field proven, L3 routing capability.
  • They deployed whitebox hardware built to their specs, using their captive NOS.

This has created an urgent market need for a robust merchant NOS. It is worth noting that the core router market has not been kind to many aspiring VC-backed companies with high technology barriers to entry. Ambitious – and capital intensive – core router efforts at Charlotte’s Web, Avici, Procket, Compass EOS and many others have come and gone. Most have failed to deliver a usable product particularly in their software offerings, never for lack of effort or resources. Core routing is hard, very hard. It has been a Game of Kings.

The exceptions like Juniper and Arista that have gotten into production deploying on the Internet backbone took many years to do so. At the risk of sounding like a proud uncle, the Arrcus team defined their compelling vision worth fighting for, worked tirelessly in classic startup fashion to marry their good idea with great execution, and has delivered what comes next to scale the routed Internet backbone. . . in half the time as their historical peers. This was only possible by working smarter, with aggressive adherence to modern software development best practices and deep domain expertise. ArcOS is the first and only network operating system available for anyone to use to have the modularity to scale up and down to truly unify from the datacenter, to edge, and to the multi-cloud on merchant hardware.

Chris Rust is founder and general partner of Clear Ventures. Rust got his bachelor’s degree and master’s degree in electrical engineering at the University of Lowell and a master’s degree in telecommunications engineering and another master’s degree in engineering from the University of Colorado. Rust held engineering and product management roles at Carrier Access, ComCore, US West, and MITRE Corp, and was a co-founder and lead architect of broadband pioneer Roadrunner. After that, Rust spent 14 years at Sequoia Capital and USVP as an early stage investor, and hung out his own shingle in 2014 to start Clear.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. It’s really not that complicated. The cisco ASR 3xxx and some 4xxx routers are going EOL. And cisco has been hard charging on their 9xxx series soft routers.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.