Poor utilization is not the single domain of on-prem datacenters. Despite packing instances full of users, the largest cloud providers have similar problems.
However, just as the world learned by solving distributed computing problems across under-utilized PCs with SETI and other efforts, we might too be able to put spare cycles to work training large-scale AI models. A team from Rice University along with ThirdAI, a company trying to apply a SETI-type approach to AI training has shown that even with low core-count CPU nodes and low-end internode networks, large neural networks can be trained via unused compute on cloud infrastructure.
The team was able to show that even with poor network communication and low bandwidth interconnects, large-scale (almost one billion parameters) networks could be trained on simple four to sixteen core CPU-based nodes. The secret is leveraging sparsity inside the algorithm they developed, called D-SLIDE, a spin on the existing SLIDE algorithm.
The end result is training time outcomes that are “on par with some of the best hardware accelerators, even with a severely restricted platform,” D-SLIDE creators claim.
There is quite a bit to contend with in creating a distributed training platform on idle cloud compute beyond the obvious constraints of limited, scattered CPUs. While there are plenty of nodes, grabbing low-end, limited core CPUs and capturing the right distributed computing paradigm to grab them when they are available is a challenge. Further, the VMs themselves have limited memory—something even the highest-end AI accelerators face. There is also limited parallelism to exploit in distributed environments.
The solution to these problems was to shard the high number of parameters across distributed nodes using a federated learning approach, which leaves small “model-lets” right-sized for small compute capabilities.
That gets at some of the compute challenge but the networking issues are the most complicated. Low communication bandwidth between nodes is a major inhibitor in any kind of neural network training and on the cloud resources the team benchmarked with, typical bandwidth was in the 1-100Gbps range. Since using high-end networks isn’t an option in this case, they had to look to the model and execution itself to get around network bottlenecks, specifically leaving aside the centrality of matrix multiplication-based back propagation and other solutions that work elsewhere like dramatic compression.
The result is a huge reduction in the amount of computations required and memory access based on SLIDE, which emphasis extreme sparsity using hash tables. It is an MPU based implementation that “makes novel choices of determining how to shard neural networks and distribute hash table computation over multiple nodes to ensure load balancing and reduced communication.”
The team shows how D-SLIDE compares to model parallel training approaches like Horovod (and its implementation in cloud-based tools like AWS Sagemaker, Azure’s Databricks, and Databricks Spark.
“Deploying any reasonably sized model to multiple nodes with low internet bandwidth is prohibitive due to communication bottleneck. In contrast, with 99% communication compression, our method still shows reasonable speedup when deployed on multiple nodes, especially when the computation resource on each node is frugal. With 16 cores and 1 Gbps bandwidth, our method is 66 – 80x faster than Horovod when deployed on multiple nodes and is even an order of magnitude faster than Horovod when Horovod is deployed on a cluster with high bandwidth,” the team says.
Detailed benchmark results can be found here.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Be the first to comment