Networks may not be the most expensive thing in the datacenter – they typically comprise about 10 percent to 15 percent of the cost of a distributed system, including cables, transceivers, switches, and routers – but they are without a doubt the most complex part of distributed systems. And anything that can cut down on both complexity and cost at the same time should have a fairly easy time selling in the datacenter.
This is, after all, how chip makers like Broadcom (eaten by Avago, which took its name), Fulcrum Microsystems (eaten by Intel and then largely ignored), and Mellanox Technologies (eaten by Nvidia and now the cornerstone of its datacenter-as-computer strategy) paved the way for merchant silicon for datacenter switching more than a decade ago, and this is how newer merchant silicon suppliers such as Innovium (not acquired yet) and Barefoot Networks (eaten by Intel last year) have been able to carve their niches as well.
While companies certainly want choice when it comes to the chips in their switches – and are increasingly demanding more open and less costly routing chips – when it comes to network operating systems, they are sick of making choices. Or more precisely, they are sick and tired of having choices thrust upon them. Switching is like the RISC/Unix operating system era, where vendors had their own silicon and a flavor of Unix that was just enough alike the others it could be called Unix and offer a certain degree of portability between platforms. But these RISC/Unix systems had enough differences when it came to their APIs and the way they were operated that it was nonetheless still hard to move from platform to platform. With routers, the situation is more like the proprietary minicomputers that predate the RISC/Unix revolution in servers in the late 1980s and early 1990s. The chip is closed and the operating system is closed and there really is not interoperability or porting.
What companies really want in networking is something is akin to the situation with Linux, and we think companies would be willing to give up on having source code if they can get a unifying network operating system that can do switching and routing in the on-premises datacenter and also span the public clouds, bringing their networks under the same unifying control plane. And if they could smash the hegemony of Cisco Systems and Juniper Networks in the routing space and provide an equal or better operating system than the switch and router suppliers do individually while also driving down the cost of the NOS and supporting a wider array of switch/router ASICs, all the better.
Arrcus, an upstart network operating system supplier that we have gotten to know over the past three years, wants to be the NOS of choice for modern datacenters. And not just for the enterprises that want to emulate the hyperscalers and large public cloud providers, who have created their own NOS stacks, but also for those hyperscalers and large public cloud providers themselves. That’s a pretty bold move, but network operating systems are difficult and tricky and if hyperscalers and cloud builders could get out of creating this software, they would.
Like the hyperscale datacenter operators, Arrcus took a clean slate approach when it created ArcOS, its network operating system. And also like the hyperscalers and cloud builders in the United States, Arrcus also took a routing-centric approach to datacenter networking.
Arrcus uncloaked from stealth mode two years ago, and its technical team is led by Keyur Patel, the company’s chief technology officer and formerly a distinguished engineer at Cisco for 14 years, and Derek Yeung, who is chief architect and formerly at Cisco in various engineering leadership roles for 25 years. Patel and Yeung are among the world’s experts on various routing protocols, including the Border Gateway Protocol (BGP) that is favored by hyperscalers and cloud builders for their hybrid switch/routing gear. So it is no surprise, then, that Arrcus has pitched itself as the substrate to bind switching and routing together on merchant silicon, and leading us to quip in a subsequent analysis of the ArcOS platform that the switch-router war is over and the hyperscalers have won. In our coverage describing what is going on within the networks of the hyperscalers, we pointed out that the old network adage switch when you can and route when you must has been turned on its head and now these massive datacenters route when they can with deep buffer ASICs and switch when they must with shallow buffer ASICs. The latter are much cheaper, and the whole shebang brings a unified switch/routing infrastructure inside the datacenter.
But all of this is even more complicated, even among the hyperscalers, which use a mix of their own network operating systems and control planes as well as those provided by switch makers such as Cisco, Arista Networks, Mellanox, Innovium, and sometimes Intel thanks to Barefoot Networks.
“We talk to customers in the enterprise and communications service provider portions of the market,” Devesh Garg, founder and chief executive officer at Arrcus, tells The Next Platform. “They have data and applications that they run on premises, and they also run at the edge of the network and are increasingly availing themselves of the cloud. But these are all very disparate, siloed, non-communicative, inefficiently connected environments. They want networking infrastructure that can provide any service, anytime, anywhere. We are able to do this because of the power of one simple network, one scalable architecture, and one seamless experience to connect billions of devices. If you get your switches and routers from the same vendor, or from a mix of them, they not only have different operating systems, but even within one company they have inconsistent data models and inconsistent ways that the same hardware equipment communicates. Adding the edge and the cloud to this hodge-podge only makes it worse for them.”
And, as it turns out, better for Arrcus, which is launching two new features of its platform, called Virtual Distributed Routing and Multi-Cloud Networking to the stack, expanding not only its total addressable market, but the places where ArcOS and its adjuncts can run. VDR allows companies to create virtual routers out of fabrics of merchant switch/router silicon as Google and the other hyperscalers have done, and MCN allows for ArcOS to be deployed as a unifying substrate across on premises, cloud, and hybrid infrastructure that might consist of multiple public clouds or a mix of on premises and public cloud. (We will be drilling down into the capabilities separately.)
This transformation for ArcOS didn’t happen overnight, of course. It has been a steady march for Arrcus from the time it uncloaked in 2018 until today. But even though Garg and his colleagues did not initially about the full breadth of capabilities they wanted to bring to bear, this was, he says, always part of the plan.
“I have worked at a bunch of different startups, as well as advised them at Bessemer Venture Partners, and it is important to not try to boil the ocean during the inception of the company,” says Garg. “Amazon Web Services, Facebook, and Google have led the way, though, and they have showed us how to do it.”
And, we would add, perhaps teach some of the hyperscalers and cloud builders a thing or two about how to unify the software for their necessarily disparate hardware. At some point, having thousands of PhDs reinventing this wheel will not make much sense. But, then again, the hyperscalers and cloud builders have nearly infinite money to invest in whatever they feel like doing, they like having perfect control of their platforms, and they often reinvent wheels as a matter of pride as well as curiosity. Which is why they still create everything from their own Linux kernels (and with Microsoft Azure, their own Windows Server kernel) and everything else that runs on top. We are somewhat skeptical that the AWS or Google will ever let go of their network operating system, but Facebook could if it thought it could use a better one and so could Microsoft, despite its heavy investment in SONiC and SAI. In fact, Microsoft is the natural company to buy Arrcus and commercialize networking software as it has done for systems with Windows Server. Why not?
But none of that is necessary for Arrcus to be successful. With the annual cadence of updates to ArcOS, the company has been able to grow its total addressable market by around a factor of 7.5X, from the IP Clos fabric based on routing protocols that are reimplemented completely from scratch and multithreaded so they run well on any kind of modern iron – something that Garg contends no other NOS supplier can boast because even if they have expanding the threading, and therefore the performance of certain parts of their NOS stack, they have neglected other parts which still run poorly. (We will be chasing down these issues in the future with all the major switch vendors to see how they stack up.)
ArcOS could be deployed on top of rack, leaf, spine, and spline layers of the fabric and initially could be supported on Broadcom switch ASICs in the “Tomahawk” shallow buffer and “Jericho” deep buffer families. Over time, switch ASICs from Barefoot Networks, Innovium, and Marvell have been added and equally importantly, network functions can also be offloaded into containers running on X86 or Arm processors as well.
Last year, Arrcus added capabilities to do route-reflecting with BGP as well as routing to the host and peering, which nearly tripled its TAM to around $22 billion. And with the capabilities that Arrcus is adding now, which include support for Ethernet Virtual Private Network (EVPN) and Virtual Extensible LAN (VXLAN) for stitching together Layer 2 switching networks over Layer 3 routing to scale them, plus segment routing and the new Virtualize Distributed Routing and Multi-Cloud capabilities, Garg says that the TAM has nearly tripled again to more than $60 billion.
To capitalize on that opportunity is going to take some expertise, some cash, and a whole lot of pushing. There will no doubt be some pulling on the part of some hyperscalers, cloud builders, and other large enterprises and service providers who have had it up to their ears dealing with this complexity – how do you think Linux and Windows Server came to dominate the datacenter? – and will be willing to give ArcOS a try in proofs of concept. And capturing more of the market and smashing the hardware silos and crossing the software moats that NOS suppliers have created will require more than just the freedom to choose switching and routing hardware and unifying the software substrate in the network. It is going to take a radically better price/performance argument – which is precisely how proprietary minicomputers ate share from mainframes, how RISC/Unix ate share from proprietary minicomputers and mainframes, how Linux and Windows Server ate share from all of the above, and how cloud providers are going to use virtualized Linux servers running either Linux or Windows Server to pull some of those servers onto the cloud.
This is ever the way, and such transformation is inevitable. Why should networking somehow be immune? It isn’t, and the open sourcing of so many different NOSes over the past few years came to nothing, in the long run, just as RISC/Unix did in the datacenter. And it had very little to do with open source and more about the quality of the software being written. Linux was a threat to Microsoft and Microsoft responded by making Windows Server better. That is why these two platforms dominate. That one is community developed and that the other is created by experts is not precisely incidental – there are those who will only deploy open source software as a matter of almost religion. But, then again, look at all of those hyperscaler and cloud provider NOSes that are not open and that dominate the networks of the world (albeit among a handful of vendors) and all of the open source NOSes that utterly failed to move the needle. Arista’s Extensible Operating System (EOS) for switches might be based on Linux, but it is not open source any more than Cisco’s NX-OS or IOS are. And in the datacenter, these are the three dominant operating systems, even at some of the hyperscalers and cloud builders.
This battle of the NOS is only just getting started, and Nokia entering the datacenter NOS market (which we will cover shortly), Nvidia buying Mellanox and then Cumulus Networks, and Microsoft pushing SONiC/SAI and others getting on board while at the same time buying Metaswitch Networks in May to attack the telco market and build out its 5G software strategy is only round two in this long fight. But in the end, Arrcus will be right, no matter how much share it does or doesn’t get. This moat, this final moat in the datacenter, will be filled in and the tying of hardware and software will not stand.
Provided the software is good, of course, and the right people can tell its story and help drive it.
Garg says that after ArcOS has only been in the field for a little more than a year, it has more than ten revenue-generating customers, with uses in the datacenter for enterprises and cloud service providers as well as edge use cases with telcos and other service providers. By their nature, these are not small deals. The company has more than 50 additional big customers that are evaluating ArcOS and the pipeline of opportunities is in excess of $100 million already. Garg is taking a “land and expand” strategy, starting with a particular use case in the datacenter or at the edge and then working to expand ArcOS usage within existing customers as it also seeks to keep adding new customers. Early customers are already in place for the VDR and MCN capabilities announced this week, which could potentially grow that pipeline as Arrcus gets its footing with these.
With networking being of such strategic importance to large enterprises, hyperscalers, cloud builders, and various service providers, Garg can’t say much about customers to date, but did share this:
Arrcus has been getting faster to support new merchant silicon and can get through the qualification process a lot faster than the incumbent switch makers when there is a big change. (This could get more difficult or less so depending on the nature of the next batch of customers that come in. It’s hard to say.) The key datapoints that jump out here are that the routing performance is blowing the incumbents – mostly Cisco and Juniper in routing – out of the water, and this is also leading customers to move down to the switching network and up to the wide area network and backbone as they get experience with ArcOS. The scale of the control plane, which is distributed and independently scalable from the switching and routing itself, is also a big differentiator for Arrcus, and so is the lower price tag that ArcOS has compared to those incumbent NOSes.
It also helps to have influential friends in high places. As part of this announcement, Arrcus is expanding its technical advisor board with three new members. Sumeet Arora, formerly senior vice president and general manager of the Service Provider business at Cisco and currently SVP and head of engineering at ThoughtSpot, is now helping out. Dave Ward, who is chief executive officer at PacketFabric and who was previously chief technology officer and chief architect of Cisco as well as a distinguished Fellow at both Cisco and Juniper, has joined the Arrcus advisory board. So has Vijay Gill, who is currently senior vice president of engineering and product at Twilio; Gill previously had that same job at DataBricks (the commercializer of the Spark in-memory analytics platform) and held senior roles at both Microsoft, Google, AOL, and Worldcom implementing their internal networks.
Up next, we will explain why the world needs a virtual distributed router and then follow that up with what it means to create a network stack that can run across multiple clouds that have their own ideas about how to do networking that are not always perfectly aligned with on premises networks and yet which need to be lashed together and taught to behave.