Nvidia DGX1-V Appliance Crushes NLP Training Baselines

A research team from Nvidia has provided interesting insight about using mixed precision on deep learning training across very large training sets and how performance and scalability are affected by working with a batch size of 32,000 using recurrent neural networks.

The big batch size was parallelized across 128 Volta V100 GPUs in the Nvidia DGX1-V appliance for unsupervised text reconstructions over 3 epochs of the dataset in four hours. This time to convergence is worth noting but so is the complexity of scaling a recurrent neural network on such a large batch size, which has implications for the overall learning rate compared to other approaches to training models on large natural language datasets.

With the DG1-V system comes NCCL2 (Nvidia Collective Communications library) for intranode communications which uses NVLink and InfiniBand connections to allow the GPUs to talk to one another.

Also noteworthy is that the team trained the recurrent models using mixed precision (FP16 and FP32) with sped training on a single V100 GPU by 4.2 over training in exclusively FP32.

“The relationship between batch size and learning regime is complex and learning rate scaling alone is not always enough to converge a model. Also, even with the largest public text corpus available, it may not be feasible to satisfy the batch size requirement needed to effectively train with the largest batches that modern hardware allows.”

This mixed precision model using a batch size of 32,000 across 128 Volta GPUs which they say produced a 109X increase in training data throughput relative to using a single GPU but did hit some delays with the very large batch size, bringing the total training time to four hours—a major advance over the month of more of training that might have been required with other approaches, even those that used GPU acceleration.

Training in mixed and single precision both produce similar training curves and converge to similar numbers for both language modeling and transfer evaluation. The team found that moving to mixed precision not only achieves similar training results, but it also provides a 3x speedup in training. By taking advantage of the reduced memory footprint of FP16 storage they could increase the batch size two-fold to 256 per GPU, better saturating the GPU, and achieving an additional speedup of 40% on top of our original speedup. This provides approximately a 4.2x speedup when switching from single precision arithmetic to mixed precision. Overall, this yields a speed up from one month of training to 18 hours using 8 Tesla V100 GPUs, larger batch size, and mixed precision arithmetic.

“Crucial to reducing the necessary communication bandwidth, the library also supports communication of FP16 values natively with no FP16 emulation overhead when reducing FP16 parameter gradients across FP16 is not only useful for reducing communication overhead, it also plays a key role in directly accelerating training on processors like the V100 that support higher throughput mixed-precision arithmetic,” the team notes.

“The V100 provides 15.6 TFlops in single precision, but 125 TFlops with mixed-precision arithmetic (FP16 storage and multiplication, FP32 accumulation). Using FP16 reduces the dynamic range and precision of the computations being performed. This presents a unique set of training difficulties, which, if not addressed, lead to convergence issues while training” There is more discussion about how they worked around these in the full paper describing results.

Nvidia researchers used the Amazon Reviews dataset as the basis, noting that given the size, training is extremely time consuming. They note that running this on a single GPU is not practical since models are large and can only fit into a modest training batch size per GPU. To scale, they employ multi-GPU parallelism, which means they do not partition the batch during training across multiple GPUs.

The team explains that they do “not use model parallelism, which partitions the neural network itself across multiple processors because it is less flexible and places more constraints on software,” although  they add it is ripe for adding more parallelism in other cases. In the synchronous data parallelism approach the 32,000 batch size set is distributed evenly across all the worker processes with each worker process running forward and backward propagation and feeding the gradients back and forth to the model before fetching a new batch.

Even with these advances, the team adds that training with very large batches leads to somewhat worse generalization, requiring more data to converge to a similar validation BPC and transfer accuracy as small batch training. “Learning rate schedule modifications are necessary to help with convergence. Without such techniques evaluation quality begins to decline as batch size increases, or the model fails to converge if the learning rate is scaled too high.”

The full paper from Nvidia researchers can be found here.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.