Google has been at the bleeding edge of AI hardware development with the arrival of its TPU and other system-scale modifications to make large-scale neural network processing efficient and fast.

But just as these developments come to fruition, advances in trimmed-down deep learning could move many more machine learning training and inference operations out of the datacenter and into your palm.

Although it might be natural to think the reason that neural networks cannot be processed on devices like smartphones is because of limited CPU power, the real challenge lies in the vastness of the model sizes and hardware memory constraints. This, coupled with privacy concerns for inference in particular have kept deep learning services running over the network in large part.

Sujith Ravi from Google Research sees a faster, secure path to training and inference on-device that if successful, could change the way Google and others build out datacenters to deliver machine learning-based services. It relies on a dual-training approach that emphasizes mass memory reduction by using the lowest possible number of neural bits for a smaller network that can learn from a much more comprehensive network it trains with in tandem via a backpropagation strategy.

The larger network is a full training suite using feed-forward or LSTM recurrent neural networks matched with the more pared-down “projection” network that can make “random projections to transform inputs or intermediate representations into bits.” As Ravi explains, “The simpler network encodes lightweight and efficient to-compute operations in a bit space with a low memory footprint,” thus carving down the amount of memory significantly.

“Unlike high performance clusters running on the cloud, user devices operate at low-power consumption modes and have significant memory limitations. As a result, running state-of-the-art deep learning models for inference on these devices can be very challenging and often prohibitive due to high computation cost and large model size requirements that exceed device memory capacity,” Ravi says. “Delegating the computation-intensive operations from device to the cloud is not a feasible strategy in many real-world scenarios due to connectivity issues (like when data cannot be sent to the server) or privacy reasons. In scenarios, one solution is to take an existing trained neural network model and then apply compression techniques like quantization to reduce model size.” While this is becoming the de facto standard for on-device inference, doing this after training on a rich network erodes accuracy and is not always fast enough.

“The proposed learning method scales efficiently to large data sizes and high dimensional spaces. This is especially useful for natural language applications involving sparse high dimensional feature spaces. For dense feature spaces (e.g., image pixels), existing operations like fully-connected layers (or even convolutions) can be efficiently approximated for prediction without relying on a large number of parameters. Such operations can also be applied in conjunction with the projection functions to yield more complex projection networks while constraining the memory requirements.”

Once trained, the two networks are de-coupled and serve different purposes. As Ravi explains, “The trainer model can be deployed anywhere a standard neural network is used. The simpler projection network model weights along with transform functions are extracted to create a lightweight model that is pushed to device. This model is used directly on-device at inference time.”

Similar efforts to pare down the memory and time required is to reduce the overall model size but the accuracy is generally not high enough there. Hence Ravi’s assertion that combining the concepts from quantization (reducing the bits for memory’s sake) and letting a best-of-both-worlds approach to training two models with different strengths is the best path to accurate, high performance inference.

There are other potential uses for Ravi’s group’s work at Google Research. For instance, he adds that it is also possible to apply this framework to train lightweight models in other types of learning scenarios. “For example, the training paradigm can be changed to a semi-supervised or unsupervised setting. The trainer model itself can be modified to incorporate structured loss functions defined over a graph or probabilistic graphical model instead of a deep neural network. The projection model training can also be further extended to scenarios involving distributed devices using complementary techniques.”

More details on this work, including an interesting discussion of just how many neural projection bits are needed for the kind of accuracy required can be found here.

That’s really clever, the network trains to reduce the network basically. I wonder if a GAN-type approach would work as well.

That would be too much computational overhead is my guess. This seems like an efficiency tradeoff that does some of the better parts of GANN in a much smaller package.

Well maybe GANs are hard to tune probably yes. But a good Greedy-based learner could prune-down a network until the point it diverges too far off the big network. I think the key is not just playing with the weights and its bit precision the real issue is to slim the network down in the first place. Less layers, simpler nodes you gain far far more than reducing just the # of bit for weights and you can do that in a second pass anyway