After Long Last, A Commercial-Grade SONiC Network Operating System

It is perplexing to us that the world’s largest distributor of client and server operating systems and also the creator of the Linux-based, open source SONiC network operating system – that would be Microsoft with its Windows and Windows Server franchises – did not see the benefit or the need to commercialize SONiC and lead the open networking revolution.

As Red Hat so aptly demonstrates, there is money in this game and there is first mover advantage.

To be fair, Dell has offered support of a kind on SONiC for two years now has more recently rolled up a distribution for selected switches in its own portfolio of hardware. But that is not the same thing as creating the Red Hat of NOSes.

SONiC, which was created by Microsoft in 2016 and moved over to the Linux Foundation in 2020, is not the only disaggregated NOS out there. (Disaggregated means the NOS is not created and sold by the maker of the switch or router it runs on, which used to be the case with servers and is still largely the case with switches and routers excepting the hyperscalers and cloud builders, who roll their own or use SONiC.)

Hewlett Packard Enterprise, Dell, and Big Switch Networks all open sourced their NOSes many years ago as Cumulus Networks (now part of Nvidia) joined the Open Compute Project and when it looked like the Red Hat Linux server story might repeat itself on the switch. But none of these NOSes got much traction and therefore a commercial model was not viable. (It was the classic chicken and egg problem – there was no support model and no large customer base, so the software could not proliferate.)

Arrcus, which has created its own closed source ArcOS routing and switching software – analogous to the Windows of NOSes – has yet to become ubiquitous, and the company is now focusing on the edge use case, not datacenter switching and routing. (It was contemplating putting a SONiC layer on top of ArcOS, which is beginning to sound like a good idea these days.)

DriveNets, which focuses on routing, is growing but doesn’t support switching. It is not clear what Nvidia has planned for NOSes based on either the Mellanox MLNX NOS or Cumulus Linux. Arista Networks bought Big Switch Networks, mostly for its network telemetry tools and not to have a companion open source NOS to its own EOS. Cisco Systems is allergic to open source NOSes unless it helps the company win a hyperscaler or cloud deal for its Silicon One switch and router ASICs, and even then it probably breaks out in hives.

It is now several years since SONiC started winning the war of the NOSes, and finally – finally – there is an independent company that has committed to providing a fully open source commercial SONiC distribution with enterprise grade support. And the company is, absolutely logically, called Hedgehog.

Run, SONiC, Run

Hedgehog was founded in May of this year by Marc Austin, Mike Dvorkin, and Josh Saul. Austin, who is the chief executive officer at Hedgehog, was a scout platoon leader for the 4th Cavalry Regiment in the US Army three decades ago, joined the dot-com boom in the mid-1990s managing Internet shopping networks at IAC and portals for Infoseek – remember those companies? – and started a mobile ride sharing company called Mobiquity in 2000, a decade before Uber. After the dot-com bust, Austin joined AT&T Wireless and ran its BlackBerry solutions business, did a stint at Amazon commercializing the Kindle in schools and governments, moved to Cisco to manage Cisco’s IoT strategy, and finally was a managing partner at IoT capital, a venture firm based in Seattle that (we presume) has invested in Hedgehog.

Dvorkin, who is the company’s chief technology officer, was the system management architect at Nuovo Systems, the company that was formed in 2006 to create Cisco’s “California” converged server-switch platforms. After being a distinguished engineer at Cisco for a few years, Dvorkin joined Insieme Networks, the spinout that created Cisco’s Application Centric Infrastructure software-defined networking, launched in 2014 and not exactly taking the world by storm. (Dvorkin says that is more about the way Cisco did with the implementation of SDN then it was about the core tenets of ACI, which was “to bring plain boring switch OS into modern age, where nothing is synchronous, data is shared, and locking is not required,” as he puts it in his LinkedIn profile.

Saul was a senior network engineer at a number of large enterprises, including GE Capital, Barnes & Noble, and NBC Universal before joining Cisco as a pre-sales systems engineer in 2006. Saul was a consulting systems engineer at Worldwide Technology, VMware, Cumulus Networks, and Dispersive before joining Apstra, the intent-based networking company that was founded in 2014 and acquired by Juniper Networks six years later.

The inevitability of SONiC is driven by three vectors, according to Saul. The first one is the ease of use of cloud infrastructure: You have a YAML file that represents the entire application, you upload it to a cloud, and it gets all of the resources it needs from the cloud services and just runs. The network is pre-plumbed and you don’t even have to think about it.

The second vector, which is in opposition, is that there are applications that don’t run well in the cloud. Data intensive applications out on the edge can’t ship all of that data back up to a cloud for processing because it would take too long and it would cost a ton of money to move that data around.

The third vector, which we have correctly called datacenter repatriation and that others have called cloud repatriation, is real and it is happening. Saul sums this up succinctly this: “Today, you are crazy if you don’t start in the cloud, but at a certain scale, you are crazy if you stay in the cloud because of the high costs.”

When you come to that final point, you can virtualize your compute infrastructure with KVM and Kubernetes, like the hyperscalers and cloud builders do, but what are you going to do about networking? All of this switching gear has proprietary NOSes, with proprietary APIs and tooling, and network engineers still largely work through command line interfaces like we live in the computing Bronze Age of the 1970s and 1980s.

Unlike prior open source NOS efforts, the Hedgehog team has identified that not only does the whole NOS have to be open source, but that open source stack must also include a slew of automation to make setting up and running datacenter networks as easy as using the network services on one of the cloud builders. (Microsoft and Alibaba literally already use SONiC, but they don’t expose all of its feature to end users, of course.)

In the long run, we think Hedgehog will have to provide remote network management services to its customers who want to rely on its own expertise to monitor, secure, and manage their networks better, and that this, above and beyond providing technical support and rolling up patches into SONiC distributions as Red Hat does for Linux will be the real value. And that is because there just isn’t enough SONiC expertise to go around the world if hundreds or thousands of enterprises all try to adopt Hedgehog’s distribution all at the same time.

Dvorkin made no commitments to such a strategy, but he didn’t say it was a wrong idea, either, when The Next Platform talked to him.

There is another factor at work, we think, that will help drive SONiC adoption, too, and one that the hyperscalers and cloud builders have been able to have for more than a decade because they created their own NOSes to run on merchant switch and router ASICs: breaking that proprietary link between a piece of networking hardware and its NOS. If enterprises use Cisco iOS or NX-OS, Arista EOS, or Juniper JunOS in production a world where supply chains are all messed up and switch delivers are 52 weeks to 75 weeks out into the future, you are dependent on those particular vendor’s switches because you are dependent on their NOS. If you use SONiC, you can buy any switch that runs SONiC, and there are over 100 of them today and the number is growing fast.

So why did it take so long for a SONiC distribution with enterprise-grade support to come into being? First of all, there is not a lot of appetite among venture capitalists to invest in the software portion of the datacenter switch market. It’s just too small for the big companies, and there are already quite a few attempts with limited success. But it was eight years between when Linus Torvalds created the Linux kernel and when Red Hat went public, and there was more than two decades of open source and proprietary Unix in academia and then the enterprise that laid the foundation for Linux ahead of that time. It has only been eight years since Microsoft opened up SONiC and its Switch Abstraction Interface (SAI) underpinnings that allows it to run across diverse network ASICs.

Dvorkin’s explanation as to why now is the time to commercialize SONiC, and why it will work now, makes perfect sense to us:

“What we have learned from the failure of Cumulus and others is that you need to have a platform transition to go successfully against Cisco,” Dvorkin explains. “Cumulus had really wonderful ideas they built this thing for Amazon that was pure Layer 3, and then they go back to the enterprise customers, who want them to add in MLAG and all sorts of Layer 2 madness. And that basically pushes them into competing with Cisco and Arista because VMware was the platform and Layer 2 was driving everything. But now, there is a platform shift and all new applications coalesce around Kubernetes, which is again, all Layer 3 stuff. It doesn’t have Layer 2 stuff. Now a lot of the value that Cisco and Arista have in their switches and network operating systems no longer applies. And the people who deploy the Kubernetes stack, they care about open source, they do not want any proprietary stuff, and they want the networking fit into the rest of the operational stuff that they already have, such as Prometheus, Grafana, Elasticsearch, Kibana, and so on. For us to show that open networking is possible, it’s not just like we drop Hedgehog on GitHub and say knock yourself out. We need to provide the experience where SONiC is consumable and usable in a prescribed way.”

And that is the plan with the Hedgehog NOS. The pricing details are still being worked out, but the idea is to charge not by the switch, but by the node count on the Kubernetes clusters. And this, says Dvorkin, will work because you are selling SONiC to the cloud architects instead of the network team that is used to Cisco, Arista, et al and speaking their language. And with this pricing model, you don’t have to care about how many switches it takes to support your Kubernetes clusters. There no nickeling and diming as happens with proprietary NOS features on switches. (Well, it is more like $5,000-ing and $10,000-ing, to be honest.)

The Hedgehog distribution of SONiC will be in early field trials by the end of the year, and around Q1 2023 the Hedgehog automation features that layer on top of it will be out.

It will be truly funny if Microsoft someday buys Hedgehog and completes the circle.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. You are kidding yourself if you think VM is going away. Pensando, acquired by AMD for $1.9B, with smartNIC, is partnered with VMWare – more startups or enterprises will partner with VMWare. Also if a new useful protocol comes about, it is Cisco who implements the protocol first, e.g., Lisp routing, don’t give up on Cisco yet. SoNiC is suited for web-scale data centers, not sure if is for Enterprise. If Nvidia is listening, they should add layer-2 features to Cumulus and sell to Enterprise. Lastly, there is the telco data center which uses proprietary NOS from vendors – maybe Stratum will take off in telco now that Intel owns the team the developed Stratum: https://github.com/stratum/stratum

  2. Interesting proposition, Obviously, you are assuming a lot here, Kubernetes wipes out VMware ? and therefore layer 2 will not be needed and since all your NOS connections will be high-end so no need for Layer 2 bundling.
    I see the value in Hyperscale WEB services, All your proposed technologies fall in line with your hopeful outcome. I don’t see this happening in the Enterprise. But then Maybe you don’t need to address the enterprise market. We shall see how this plays out.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.