New Network Architecture Bridges Supercomputer, Cloud Divides

As we have discussed at length here at The Next Platform, the dividing lines between high performance computing datacenters and their cloud or web-scale compatriots are blurring, fed in part by increasing data sizes and the speeds at which data must be moved.

While there is still no singular, common platform that can handle requirements evenly from both ends of the high-end computing spectrum, there are efforts underway to tackle core elements of the stack, ranging from the data-centric approach from IBM to Intel’s push for a common framework that blends the HPC, large-scale data analytics, and web-scale worlds. There are others, all of which are seeking to reinvent the stack to fit a common base of needs at the top tier of computing – cloud operators, supercomputers, and web-scale or large enterprise datacenters.

This push certainly includes the network layer, where the drive is to create a balanced performance and scalability profile. At its simplest, for cloud operators at scale, the concerns of performance are too often removed in favor of efficiency and the various concerns around multi-tenancy. And at the extreme scale of computing, that blazing fast performance is achieved by sacrificing the efficiencies and application generalization structures for faster, larger simulations.

Bill Dally, who is Chief Scientist and SVP of Research at GPU maker Nvidia in addition to his role as Professor at Stanford University, is working with fellow Stanford researcher, Nicholas McDonald, on a new framework that can fit the needs of these diverse computing areas at the network layer in particular via a new distributed system architecture for secure, performance-aware computing called Sikker.

Named after the Danish word for “safe center,” the proposed Sikker architecture pulls from the best of high performance computing and matches it with the newer generation of hyperscale computing requirements. As the pair describe in detail, “current network technologies are unable to simultaneously provide high performance network access and robust application isolation and security. As a result, system designers and application developers are forced into making tradeoffs-between these requirements.”

Under a single admin domain, Sikker is based on a unique SOA security and isolation approach that is coupled with a network interface card (NIC) the creators call the Network Management Unit (NMU) that serves a traffic cop while emphasizing access and performance capabilities. In early run experiments, Dally and McDonald show how this approach provides the “complex interaction policies of modern large-scale distributed applications” and how, even when tested against very large systems with highly demanding access patterns, the architecture can deliver message latency in the 52 nanosecond range on average and 66 nanoseconds at the 99^th percentile. For smaller clusters and less demanding access patterns, they found network performance to be in the 35 to 45 nanosecond range.

The unfortunate truth is that modern network technologies have not provided distributed systems that are capable of supercomputer-like network performance while simultaneously providing robust application security and isolation.”

As with any approach that seeks to bring down the latency while keeping both performance and security intact, one might expect that it is still a matter of passing the bottleneck buck. But they note that SOA and security and isolation method removes the expensive overhead of software-based implementations and allows applications to operate in a secure environment with network performance that is equivalent to supercomputers.

The NMU, which is central to Sikker, is the enforcement agent for the permissions scheme Dally and McDonald outline. As they note, “working under the direction of a network operating system, the NMU provides network isolation through enforcing permissions at the sender and provides security through its inherent implementation of the principle of least privilege as well as source and destination authentication.”

Dally and McDonald note that even when compared to top-ranked supercomputers, what they have demonstrated reflects only a slight amount of extra overhead for network transactions. The goal is to create a new way of blending the best of all areas of scalable computing with the same secure, multitenant capabilities and same application performance from the network layer.

Aside from the NMU, there are several keys to the latencies they demonstrate, but also several that contribute to the ultimate security while leveraging the best of supercomputing and cloud models alike. For instance, consider the well-known tactic from supercomputing of bypassing the operating system for network performance, which results in lowering CPU overhead. However, in exchange for this advantage, it bypasses the kernel, thus taking with it the kernel’s ability to manage network traffic going out—something that in a cloud environment would open a security hole.

Instead of looking at this as a problem, Dally and McDonald exploit this feature that adds to high network performance in supercomputers by using a “best of cloud worlds” approach that taps network isolation mechanisms. At the application level, Sikker has a unique way of handling APIs and ports where each port represents a piece of a service’s functionality with respect to specific permissions. “Unlike TCP/UDP ports, the ports in Sikker are not used for multiplexing, are not shared, and are only used to specify a destination. Each service has its own port number space, thus, two services using the same port ID is acceptable.”

Ultimately, the Stanford duo say this will support the next generation of large-scale, multi-tenant computing platforms where architects and developers alike will be able to access remote data quickly without adding overhead at the system level—or in terms of running through time-consuming security checks that still do not eliminate the possibility of error.

It generally takes a while for research to be commercialized, if it happens at all. But given the low latencies and security that Sikker is showing, one could hope that it will be brought to market a bit faster than OpenFlow software-defined networking, which traces its roots back to work done at Stanford starting in 2006.

New Network Architecture Bridges Supercomputer, Cloud Divides

Sign up to our Newsletter

Be the first to comment

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Nvidia Proves The Enormous Potential For Generative AI

What To Do When You Can’t Get Nvidia H100 GPUs

Groq Says It Can Deploy 1 Million AI Inference Chips In Two Years

Be the first to comment

Leave a Reply Cancel reply