Striking acceptable training times for GPU accelerated machine learning on very large datasets has long-since been a challenge, in part because there are limited options with constrained on-board GPU memory.
For those who are working on training against massive volumes (in the many millions to billions of examples) using cloud infrastructure, the impetus is greater than ever to pare down training time given the per-hour instance costs and for cloud-based GPU acceleration on hardware with more memory (the more expensive Nvidia P100 with 16 GB memory over a more standard 8 GB memory GPU instance). Since hardware limitations are not so easily won over, the tough work of training trims has fallen to clever algorithms that maximize GPU memory and let the CPU do some of the heavy lifting.
This is exactly the approach a team from IBM Research is using to deliver a reported 10X speedup on limited-memory GPU accelerated training for key machine learning algorithms. Note that we are not talking about neural networks here; the approach is focused on linear models, which oftentimes are the domain of choice due to the many terabyte, billion example training runs that are sometimes required. Training times on these frameworks using logical regression, support vector machines, and other models are still in the many days camp, even with a GPU boost, but the researchers say that by putting the most important data only in the GPU’s memory with an automated intelligent measurement algorithm can offer big improvements.
The IBM Research team’s work goes one step further. Developing and implementing the measurement is one step to make use of GPU memory intelligently, but as the training runs its course, the set of training examples with specific features that make them more important than others will change with time.
In other words, the iterative process of training gradually pares down the differences between examples—meaning the team’s measurement for GPU memory use needs to automatically pull those best-case examples into memory. Since the GPU is always running during training, the CPU can yank out these examples and send an update to the GPU to grab new data to fill its limited memory. This means faster training and convergence, even on GPUs that are not high-memory SKUs (the team used the M40 and a GTX 1080 card with 11 GB memory for its benchmarks).
“This scheme allows users to efficiently employ compute accelerators (GPUs and FPGAs) for the training of large-scale machine learning models, when the training data exceeds their memory capacity. Also, it provides adaptivity to any system’s memory hierarchy in terms of size and processing speed,” says Thomas Parnell, one of the lead authors of the full paper. He, along with colleague Celestine Dunner tell The Next Platform that this technique is built upon novel theoretical insights regarding primal-dual coordinate methods, and uses duality gap information to dynamically decide which part of the data should be made available for fast processing.
More specifically, the team explains that the speedups are achieved by deriving novel theoretical insights on how much information individual training samples can contribute to the progress of the learning algorithm. This measure heavily relies on the concept of the duality gap certificates and adapts on-the-fly to the current state of the training algorithm, i.e., the importance of each data point changes as the algorithm progresses.”
“This is not something the programmer has to touch or decide. While the algorithm is running on the GPU we have another part of the system that’s running on the CPU in parallel evaluating this theoretical measure of importance and dynamically ranking data points,” Parnell explains. The “dynamic” part of this is the shifting state of which data points are most important and thus pushed to GPU memory.
Dunner says that while it might seem easy to think this comes with a high overhead cost, the team can hide the computation so there is none introduced. Further, it can be extended to any size GPU memory or accelerator device, including FPGAs.
“We consider this scheme most valuable to applications with really huge datasets. At NIPS we will demo this scheme using a dataset from advertising click prediction with one billion training examples to predict if a user will click based on many things like time of day, their history, and other factors,” Parnell explains. “For these datasets, it’s hard to train neural network like models because it would take weeks to train with terabytes of data. In those situations people generally turn to generalized linear models as these are faster to train but still a long time with large scale data. That’s where we think our technique can come into play, where you want to use accelerators to train but the data doesn’t fit in the memory.”
Here is a video with the authors: https://youtu.be/10HRdZg48sA