As Google’s batch sizes for AI training continue to skyrocket, with some batch sizes ranging from over 100k to one million, the company’s research arm is looking at ways to improve everything from efficiency, scalability, and even privacy for those whose data is used in large-scale training runs.
This week Google Research published a number of pieces around new problems emerging at “mega-batch” training scale for some of its most-used models.
One of the most noteworthy new items from the large-scale training trenches is around batch active learning in the million-batch size ballpark. In essence, this cuts down on the amount of training data (thus compute resources/time) by automating some of the labeling, which is great for efficiency but comes with downsides in terms of flexibility and accuracy.
Google Research has developed its own active learning algorithm to layer into training sets called Cluster-Margin, which they say can operate at batch size scales “orders of magnitude” larger than other approaches to active learning. Using the open images dataset with ten million images and sixty million labels across 20,000 classes they found Cluster-Margin only needed 40% of the labels for the same targets.
In active learning, labels for training examples are sampled selectively and adaptively to more efficiently train the desired model over several iterations. “The adaptive nature of active learning algorithms, which allows for improved data-efficiency, comes at the cost of frequent retraining of the model and calling the labeling oracle. Both of these costs can be significant. For example, many modern deep networks can take days or weeks to train and require hundreds of CPU/GPU hours. At the same time, training human labelers to become proficient in potentially nuanced labeling tasks require significant investment from both the designers of the labeling task and the raters themselves. A sufficiently large set of queries should be queued in order to justify these costs,” explain the creators of Cluster-Margin.
The efficiency payoff, especially at that scale, is not hard to imagine but as Google paces through ever-larger scale training there are other, more ethereal issues to contend with, especially when massive batches means pulling (possible personal) data for training.
Getting language model behemoth, BERT, to scale using huge batch sizes has been its own uphill giant for Google and the few others operating at the million-plus batch size scale. Now the impetus is to keep efficient scale while adding in privacy measures that do not hinder performance, scalability, or efficiency.
Another Google Research team this week has shown they can scale BERT to batches sizes in the millions with a layer of privacy, called differentially private SGD, which is a heavy-duty step during pre-training. The implementation of this layer does sacrifice some accuracy with the masked language model accuracy in this BERT implementation at 60.5% on a batch size of two million. The non-private BERT models Google uses hit an accuracy rate of around 70%. They add that the batch size they used for their results is 32X larger than the non-private BERT model.
As the algorithm creators explain, “To mitigate these [privacy] concerns, the framework and properties of differential privacy (DP) [DMNS06, DKM+06] provide a compelling approach for rigorously controlling and preventing the leakage of sensitive user information present in the training dataset. Loosely speaking, DP guarantees that the output distribution of a (randomized) algorithm does not noticeably change if a single training example is added or removed; this change is parameterized by two numbers —the smaller these values, the more private the algorithm.”
Accuracy and privacy are going hand in hand in other spheres for large-scale training at Google Research. Larger models, more massive batch sizes means increasing difficulty managing the consistency of results and avoiding under- or overfitting. Google is working on developing new calibration techniques that can keep up with the scale of increasing training runs. Another Google Research team this week published results on soft calibration techniques that cut down on calibration errors of existing approaches by 82%.
The team explains that a comparison of soft calibration objectives as secondary losses to existing calibration-incentivizing losses reveals that “calibration-sensitive training objectives as a whole (not always the ones we propose) result in better uncertainty estimates compared to the standard cross entropy loss coupled with temperature scaling.” They also show that composite losses obtain state-of-the-art single-model ECE in exchange for less than 1% reduction in accuracy for CIFAR-10, CIFAR-100 and Imagenet, which served as baselines.
In the past pure scalability of models was at the heart of what we saw coming out of Google Research on the training front. The fact that what we’re seeing more recently, including just the last few days, is evidence that model scaling itself is giving way to more nuanced elements to large-scale training, from improving/enhancing results to adding privacy. That means the models themselves are proving out at million-plus batch size scale, leaving room for making more efficient neural networks.