When it comes to machine learning training, people tend to focus on the compute. We always want to know if the training is being done on specialized parallel X86 devices, like Intel’s Xeon Phi, or on massively parallel GPU devices, like Nvidia’s “Pascal” and “Volta” accelerators, or even on custom devices from the likes of Nervana Systems (now part of Intel), Wave Systems, Graphcore, Google, or Fujitsu.
But as is the case with other kinds of high performance computing, the network matters when it comes to machine learning, and it can be the differentiating factor in getting good performance out of whatever compute is acquired to run a machine learning framework on a cluster.
No one is going to suggest that the network is the computer – that was a very loose but useful metaphor back in the dot-com days. But when it comes to machine learning, the network certainly can be a coprocessor, much as it has been in InfiniBand networks for traditional HPC simulation and modeling applications. It helps to have an architecture that has useful math units and is programmable, not hard wired down to the transistors. Barefoot Networks, thanks to its Tofino switch ASICs and its P4 programming language, has both, and it is leveraging these capabilities to accelerate machine learning training.
“I would like to say that we had the foresight to add machine learning algorithms to the switch, but we did not,” concedes Ed Doe, vice president of product, business, and strategy, to The Next Platform. “And that is the beauty of it. These machine learning routines are just another P4 application, and we do not have to build specialized engines for these functions.”
Putting More Network Into Neural Nets
The reason you want a programmable network in the first place is so the network can adapt to different workloads that share the network and that change over time. The network – meaning the physical infrastructure for switching, not the neural network kind – can provide what Barefoot Networks is calling computational networking. First and foremost, the network is privy to all traffic that goes from one node to another, and not only can it see bottlenecks in the sharing of data, it can also process data and transmit it in a multicast fashion. A switch ASIC is designed to process billions of operations per second, so it has the compute capability and the bandwidth to help with machine learning algorithms.
“Machine learning training is incredibly network intensive,” explains Doe. “You are throwing vast amounts of data, sharding it and sending it to a lot of different compute elements for training the model, and then all of the different training engines need to coordinate their neural network weights and resynchronize with each other to get a cumulative benefit. That resynchronization is a performance bottleneck. We have found that you can operate on this data directly in the network, so instead of having each engine passing around data to do the averaging on the cluster, you can send it all to the common network node, have it do the operations, and then share that result with all of the nodes in a training network. By doing this, you have taken an N squared problem down to linear. This accelerates training workloads.”
As it turns out, there are many different ways to exchange the neural network weights, and different training algorithms have various exponents and mantissas for weights, or different numbers of bits on either floating point or integer data. Sometimes the training algorithms use moving averages, sometimes they want actual averages, sometimes they want accumulate operations. So you need a lot of flexibility in the math, and that is why, with machine learning training moving so fast, it can be risky to lay all of those functions down as etchings in silicon.
Barefoot Networks has not embedded a special CPU or AI processor in the Tofino ASIC to help goose machine learning training algorithms. As we explained back in June 2016 when Barefoot Networks dropped out of stealth mode with its Protocol Independent Switch Architecture, or PISA, circuit components, the Tofino chip has a kind of RISC design with very simple components that are then aggregated to run the operations that are usually etched on more CISC-like portions of a typical switch ASIC. (That is an analogy with the transition that happened with CPUs in the 1990s, not something that should be taken literally.) But the fact remains, as we have said, that this is a very powerful idea. Rather than take a relatively static switch ASIC and mix it with an X86 processor, you make the switch ASIC itself a much more general and programmable device and that makes the Ethernet protocol – and now some routines related to averaging the weights across a cluster or neural network training – just another programmable data stream.
The atomic building block of the PISA architecture is a match action unit, or MAU, which as the name suggests can match data as it comes in over the network and then also do some action to transform that data in some fashion based on matching operations or add some data to it based on lookup operations or some other function. The same MAUs that can do processing on header information in a network packet can do operations on data within a packet.
The MAU is analogous to the arithmetic logic units (ALUs) that are embedded in digital signal processors (DSPs). In a sense, the P4 programming language and runtime is akin to the CUDA environment for Nvidia GPU accelerators in that both open up the possibilities for programming their respective devices above and beyond their original purposes.
The network offload model that Barefoot Networks has come up with for machine learning training is analogous to – but very different from – the scatter, gather, and other collective operations in a Message Passing Interface (MPI) protocol being offloaded to the network interface cards and the switch ASICs in an InfiniBand network, what Mellanox Technologies and others refer to as an offload model. And as we have previously reported, with the current Quantum ASICs and ConnectX-6 server adapters, which support 200 Gb/sec InfiniBand, approximately 70 percent of the MPI stack work is offloaded from the nodes in the cluster onto the network.
Computational Networking 101
This computational networking approach does not require a specific node in the cluster to be assigned as the repository of truth that all other nodes look to for the latest weightings for the machine learning algorithms. “The beauty of our approach is that since we are already doing operations on every packet that comes across the network, we can continue to accumulate those packets at line rate and at low latency, and then it becomes a multicast operation to disseminate that result to all of the other nodes in the network. Since we are already programmatically handling all of the packets in the network, this is not really an additional burden.
The network, however, is increasingly a burden and a bottleneck for these workloads, as it once was at a certain scale for traditional HPC workloads.
“Most of the models fit into a GPU or one of these specialty AI engines,” says Doe. “But as the models for training get larger and get beyond what can be held on any one node, this problem only gets compounded.”
Facebook published a paper earlier this year describing its machine learning hardware and also that when they moved from 64 nodes to 128 nodes for training, the network becomes the bottleneck and the social network (yes, that’s three kinds of network in one story. . . ) no longer got linear scaling for image processing.
The performance benefit that Barefoot Networks is seeing from computational networking depends on the machine learning frameworks and the nature of the applications that run atop of them, so mileage will vary here. Adding a new shard of data to the training network with 64 nodes in a cluster could, for instance, take about 50 percent of the time during a training run and the remaining 50 percent of the time could be the resynchronizing of the weights across a network of training engines. If that resynchronization could be reduced by an order of magnitude in terms of wall time, then the machine learning network would see a factor of 2X improvement in performance. (This was a contrived use case that Doe cooked up just to show how the math works.) On a 1,000-node machine learning training cluster, the calculating of new weights for the neural networks could take 10 percent of wall time and the resynchronization of those recalculations could take 90 percent of wall time on a training run, and in that case, and pulling this work onto the network could produce a 10X reduction in training times. This would have the added effect of utilizing the CPUs and GPUs in the clusters more fully.
There are machine learning training networks that are much heavier on compute and much lighter on network, so Doe does not want to give the mistaken impression that this will be universally true.
“Most of the folks that we are talking to are looking at somewhere between a 2X to 10X improvement in machine learning training using computational networking,” says Doe. “And this is just the beginning. What we are doing here is a relatively simple thing. We have the computational capability in the network to preprocess or postprocess more complex operations and transmit the data.”