Microsoft doesn’t just love running Linux workloads on the Azure cloud. It is running Linux workloads to make the Azure cloud. Specifically, Microsoft has created its own variant of Linux to run on home grown switches – just like its hyperscale peers at Google, Amazon, and Facebook. Embracing Linux as a switch platform is obviously the easiest thing to do, and given that, it is not much of a surprise that Microsoft has Linux embedded inside of its Azure Cloud Switch.
Back at the Open Compute Summit in March, The Next Platform sat down to have a chat about networking and server infrastructure in the Azure cloud at Microsoft. Bill Laing, corporate vice president for the Cloud and Enterprise Division at Microsoft, and Kushagra Vaid, its general manager of server engineering, had a wide-ranging conversation about all kinds of things. When the conversation turned to the Switch Abstraction Interface that Microsoft had created to provide consistent switch management across multiple vendors and switch ASICs, which the company donated to the Open Compute Project, we quipped that it seemed very unlikely that Microsoft would take the Windows kernel and make its own switch operating system.
Laing and Vaid didn’t laugh, but Laing did say that “the problem is that people have all of this functionality that you no longer want in the switch and you want software automation.”
Given Azure’s scale, both for public cloud computing and for supporting Microsoft’s consumer and business apps like Office365, Bing, and Xbox Live, it seemed likely that Microsoft would embrace open networking similar to Facebook’s Wedge top of rack and 6-pack modular switches. Not just for the cheaper networking iron, but because open switches would allow for Microsoft to support just protocols and features that Azure needs and not one bit more. This is why Google created its own network hardware and software stack starting a decade ago, and why Facebook is following in its footsteps now.
So it was logical to expect that Microsoft was, in fact, hacking together its own network operating system. We now know, thanks to a blog post by Kamala Subramaniam, principal architect for Azure networking, that Microsoft has developed a Linux-based network operating system called Azure Cloud Switch.
“ACS allows us to debug, fix, and test software bugs much faster,” writes Subramaniam.” It also allows us the flexibility to scale down the software and develop features that are required for our datacenter and our networking needs. ACS also allows us to share the same software stack across hardware from multiple switch vendors.” This latter bit is done in conjunction with SAI, which Microsoft crafted at the end of last year, which switch ASIC vendors Broadcom, Mellanox Technologies, and Cavium Networks are lined up behind, and which has been accepted into the OCP as of July.
Subramaniam says that the Azure Cloud Switch software is lean, making it easier to fix and then test fixes to network issues and to do so faster than “the current run rate,” presumably with switches from vendors like Cisco Systems, which have closed switch operating systems with every protocol under the sun embedded in them. The Microsoft switch OS is also modular, she says, which makes it easier to add features to it without impacting the entire OS software stack. Azure Cloud Switch integrates with Microsoft’s monitoring and diagnostic tools for Azure, which no doubt includes its Autopilot cluster and network manager and provisioning system and probably a slew of other things that feed into it. “By deviating from the traditional enterprise interactive model of command line interfaces, it allows for switches to be managed just as servers are with weekly software rollouts and roll backs thus ensuring a mature configuration and deployment model,” says Subramaniam.
Within the Azure Cloud Switch stack, applications include things like the Quagga network routing software or the pieces that hook into Autopilot. The switch state service inside the Microsoft switch has a subset of the global network state and is backed by a Redis key-value store that has broader network state and is used to shape traffic and get each switch – and ultimately the entire Azure network – to a desired state of operation as conditions change on the network. As they always do. All of this rides on top of the SAI layer, which in turn runs on top of the ASICs. To a certain way of looking at it, SAI is a hypervisor for switch ASICs.
Microsoft dabbled with InfiniBand for some HPC use cases and behind Bing Maps, but Laing told The Next Platform back in March that the company was embracing Ethernet for the majority of its networking because the RDMA over Converged Ethernet (RoCE) protocol had significantly closed the gap between InfiniBand and Ethernet and that it was easier to have one protocol rather than two in the Azure network. And it will now be easier to have one network operating system, too.
The one thing that Microsoft will not have to do – but which it might do anyway – is open source the Azure Cloud Switch software. Speaking very generally, the way Linux licensing works, if you create modifications of the stack for internal use, you can keep them to yourself – as Google, Amazon, and so forth certainly do with their internal Linux distros. Facebook has open sourced its FBOSS open switching system, but this is really more a set of applications riding atop a Linux-based open networking operating system, much like the application layer in Microsoft’s Azure Cloud Switch stack.