Many of the largest retailers and social networks are seeing a narrowing path for scaling of AI training capabilities.
Hardware and software optimizations are not providing the dramatic improvements we saw only a few years ago, pushing companies like Facebook to dig into ever-lower levels of their ML stacks, as we described last week. Even Google is delving into lower depths for how it arranges its models to keep training efficient and as one might imagine, Amazon is also seeing where optimization can give AI performance greater lift.
For Amazon’s AI teams, the limits of optimization are clear around graph neural networks (GNNs). These models are are emerging as one of the best options in social networks, recommendations and, outside of Amazon, in drug discovery. Still, despite the use case fit, the mesh with existing hardware is not as clear-cut and the company is looking for ways to keep adding greater depth without moving beyond computational (cost) limits.
Like other large companies pushing neural networks at scale, the barriers are getting higher and tougher to climb.
Unlike other deep learning frameworks, GNNs have not been the subject of optimizations until relatively recently. They first appeared in research in 2014 and interest exploded in 2018 through the present. Despite the use cases, getting GNNs to scale well on massive, complex graphs remains a challenge, especially for companies like Amazon whose AI division is trying to balance the need for deeper graphs with expensive training runs.
GNNs are quite a bit different from other neural networks in terms of computational demands and many of the architectures that do well for convolutional neural networks (CNNs), for example, aren’t a perfect fit. They need devices good at handling a lot of scatter-gather operations on sparse data with irregular access patterns. But for some use cases, graph neural networks are still the best option.
Graph Neural Networks (GNNs) are a family of neural networks that learn node, edge, and graph embeddings. An “ego-network” around each node is used to learn an embedding that captures task-specific information. The embeddings use both the structure of the graph and the features of the nodes and edges. These embeddings are learned in an end-to-end fashion, the predictions are a function of the target node’s ego-network.
Da Zheng and his team at Amazon AI are trying to sort how to most efficiently train by divvying up graph networks into batches and evaluating what stays on GPU and what has to train with both CPU and GPU in concert due to memory constraints.
At Hot Chips last week he explained that they can only use the full graph for some of their models on GPUs for faster training times when those fit in GPU memory. Otherwise, they have to divide graphs into mini-batches, which ends up being faster and runs on both CPU and GPU.
“For GNNs sparse and dense operations are important and training methods matter,” he explains. “For full graph training, both sparse and dense operations account for 50 per cent of runtime. For mini-batch training, dense operations dominate computation. Mini-batch sampling, however, can cause significant overhead, especially in the case of mixed CPU/GPU training.
“The reason we can’t go deeper with graph models from a social network point of view is because if we extend too much we need to gather a larger number of neighbors [points in neighboring graphs information is derived from] and that makes training very difficult. Also, when we use more hops we encounter a problem called oversmoothing: basically the power of graph neural networks weakens and the more you make a prediction, the more you always predict the same thing for every node.”
While Da Zheng and team have been able to show good results using a benchmark they devised and their mini-batch optimizations, what this tells us is that even the smartest companies doing the most challenging production ML work at scale are starting to see some hard limits.
Perhaps those limits aren’t preventing continued growth quite yet, but they are forcing ever-lower-level optimizations. While such alterations in batch sizes, device memory capabilities, and software tweaks used to provide 10–100x improvements for companies doing AI at scale, much of what we’ve seen is more on the order of 2x.
The point is, AI development is slowing down. Some models are blocked by hardware capabilities (memory most often), others simply by the cost of that hardware, and still others, like Amazon’s graph group, by finding workarounds for deep graphs that need to spread out without creating so much software complexity they are unusable.