Making Datacenter Networking as Consumable as Compute

It’s a cliché that the pandemic has changed the way we work forever, but it has certainly turned the spotlight on technologies, processes and practices that have reached the end of their useful life.

It has also reinforced the central role of the network and the data center and cloud networks which experienced tremendous traffic demands as newly home-based workers and students piled on to their platform.

“How we bring up the network, how we debug it, how we make sure those outages are minimised, all of that became so much more important, because we didn’t have the fallback of ‘you can always do a truck roll or replace a switch’,” says Nokia senior director of product management, Bruce Wallis. 

In a sense, this makes data centers even more mission critical, he adds, as when operators do hit problems “the impact is an order or magnitude higher now, because their ability to react to it is lessened through the lack of resources.” The key resource being trained engineering personnel on the ground. And yet the way the industry approaches the care and feeding of data center networking has changed little in the last couple of decades. Traditional network vendors have continued to push proprietary data center operating systems and black boxes which may expose some – but not all – of their workings and leave customers little choice as to what management tools they can use.

There’s an argument that this slow-moving approach simply reflects the importance of stability. The flipside is it constrains more daring or innovative organizations from choosing other protocols or tool sets, or even building their own. In the worst case, trying to replace any of the elements of a proprietary stack or fabric can result in customers being penalized by their vendors. And this limits operators’ flexibility in managing workflows and processes.

This is in stark contrast to what has happened to compute, storage and software development, where virtualization, automation, open standards, and DevOps and CI/CD have come together to allow self-service deployment, improved resilience and vastly accelerated delivery.

No more black magic box

That said, hyperscalers such as Facebook and the main cloud providers been able to automate large chunks of their operations. What does this mean for everyone else? Well, you can benefit from their innovation, but only by moving onto their own cloud platforms, because as Wallis explains, they haven’t rushed to open up their own solutions to benefit mainstream enterprises or service providers who need to run their own data centers.

Nokia’s response has been to build a Network Operating System (NOS) that is open by default, in part through Service Router Linux (SR Linux), which is part of Nokia’s Data Center Switching Fabric, and which also includes the Nokia Fabric Services System, and Nokia’s switching hardware platforms.

Customers can choose to take the entire stack as a turnkey solution. But, as the Linux moniker suggests, the platform is also designed to be open, giving customers the opportunity to turn to third parties for specific elements or simply create their own.

The echoes of what is happening in mainstream software aren’t hard to pick up, though Wallis is wary of using the term “microservices”. He says the aim is to break down the NOS into modular pieces, “each of those pieces being its own functional block”, exposing their APIs and data models.

“So, the box still looks and feels like a single monolithic appliance to a northbound system, if the operating model requires that. Representing a group of functions as a single managed element has undeniable benefits, but under the hood the services making up the network, the protocols, network-instances and so on, are modular, and able to add their schemas to the system-wide schema exposed northbound”.” The aim is to make the functions “as modular and decomposable at a software level as possible.” 

This also gives networking folks the freedom to overhaul their processes and workflows, potentially following the sort of patterns that DevOps has opened up to software developers.

“The general idea of breaking things into modular pieces and driving change of those pieces through CI/CD, including deployment into production, and doing canaries or staggered rollouts, and all that good stuff, you’ll start to see in networking,” Wallis predicts.

Automation for the network people

Of course, he adds, there are limits: “In a microservices world, if I have ten different endpoints, I don’t care if two of them are offline. However, if you have racks offline because your switches are down, because you’re doing upgrades, or you’re doing CI/CD and something goes wrong, you’re getting screamed at.”

Nevertheless, he says, network engineers are perfectly capable of writing Python code and producing applications that can improve their workflows. “What we’re trying to do is give people, that have really cool ideas for how they would automate or help drive change in their own environment to make their life easier, all the tools they need to do that.”

From Nokia’s perspective, “rather than implementing a management stack 20 times for 20 different applications, we implement it once and we provide clean APIs to it for those applications to use. So, our BGP (border gateway protocol) stack uses the same APIs that we would expose to a customer to be managed. This means a customer could take out our BGP, if they wanted to, they could stick in their own.”

In parallel, to this, Nokia has taken an open approach to telemetry for the platform. “We provide a gNMI interface for all applications in the system. You just have to give us your data model and publish data to that data model yourself, and we’ll handle on change telemetry for you.”

And telemetry is crucial to data center networking automation, says Wallis: “In today’s operating model, the level of granularity we’re getting out of the network isn’t enough that we have confidence that we can drive it using machines.”

This means the reality of ongoing management remains the status quo represented by an operator sat in a network operations center surrounded by alarm screens. That individual is effectively “sitting at the end of that stream of consciousness and is having to make a decision for every event.”

By opening up telemetry and giving operators the freedom to choose their own tooling, “I think where we’re going to see operations head in the direction of starting to identity more and more patterns in the network to do with outages… We’re going to start to see those remediation pipelines be used.” 

This will begin to close the gap between the vendor’s perception of what an acceptable level of error on a link might be and the operator’s experience of what might be more serious problem – without having to check every error manually.

“But you can’t do any of that unless you have the underlying infrastructure to give you the information,” says Wallis. “All it really means is that I’m getting information at the rate I need it to make decisions.”

Maximizing flexibility 

“SR Linux is the foundation to all of this,” he says. “You need the high-speed telemetry, you need the extensions, you need everything to be modular, not monolithic. Customers need the flexibility to take and leave what they want, they need the ability to add what they want.”

Nokia’s customers are already putting these principles into practice, he says. For example, he said, one initial customer had produced five workflow optimizations for its SR Linux-based platform.

“To take a really simple example, they have a little tiny application that sits there and monitors telemetry for a config change. When that happens, it just does a Git add and a Git commit and a Git push. So it’s taking the config on the box and is actually managing it using Git,” he explains.  The application publishes the last time it successfully pushes the config to the Git repo, meaning this function of the agent can all be verified via gNMI.

“So they wrote a simple application that does that. So now all their configuration of their configuration is centralized. It’s all sitting in a Git repo somewhere. It is version controlled, because Git is giving them that.”

This may sound like a minor incremental improvement, but as Wallis points out, the cumulative effective is enormous, because of the amount of manual work it potentially eradicates.

“We think that data centers are small, because they’re very dense, or we think of them as large but not as large as a global network. [But] data centers from a node standpoint are an order of magnitude larger than the global internet,” he explains.

“It’s a different kind of scale that people aren’t typically used to, and if you write a small process and all it does is look for that one specific issue, and apply some remediated fix, so someone doesn’t get a phone call in the middle of the night… times that by two thousand network elements and that’s a huge impact on your day-to-day work.”

Sponsored by Nokia

 

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now