Running TensorFlow at Petascale and Beyond

TensorFlow, probably the most popular of the dozen or so deep learning frameworks, is typically used to develop neural networks on small or medium-sized clusters, and sometimes on just a single GPU-accelerated node. But supercomputing technology is now being used to train these models on thousands of nodes.

Back in September 2018, the National Energy Research Scientific Computing (NERSC) at Berkeley Lab announced a cosmological research project, known as CosmoFlow, which employed TensorFlow-based training using 8,192 nodes on Cori, one of world’s most powerful supercomputers. The effort involved a three-way collaboration between Cray, Intel, and NERSC to develop a convolutional neural network (CNN) to determine some key cosmological parameters of the universe based on 3D N-body simulations.

The project represented the first instance of a really big science applications that used TensorFlow technology to train a neural network on a leading-edge supercomputer. The trained model was able utilize the results of the simulations to predict three key cosmological parameters that describe the nature of the universe with “an unprecedented level of accuracy.” According to the research team, the trained model could estimate the values of two of the parameters with same degree of accuracy as existing experiments, and third parameter significantly better than previous deep learning attempts.

The entire CosmoFlow run on the 8,000-plus Cori nodes took about 9 minutes, 8 minutes of which was used for training. Sustained performance during the training phase was 3.5 petaflops. Note that the Cori nodes contain no GPUs, relying instead on a single Intel “Knights Landing” processor, in this case, the 68-core Xeon Phi 7250.

Intel tweaked its Math Kernel Library for Deep Neural Networks (MKL-DNN) software to optimize Xeon Phi performance at the node level, while the Cray PE Machine Learning Plugin was used to improve the scalability of the TensorFlow training across nodes (primarily by replacing TensorFlow’s socket-based gRPC communications with an MPI-optimized one.) I/O performance benefited from Cray’s DataWarp technology, which acts as a high-performance storage cache.

As we all know by now, Intel has jettisoned the Xeon Phi product line, so some of the CosmoFlow work may go by the wayside. But since Intel will presumably continue to support its MKL-DNN libraries on its future high performance processors – the Xeon Scalable Processor (SP) line and its Advanced Processor (AP) variant – HPC users who are partial to CPU-based training should continue to have an outlet.

A month after the CosmoFlow announcement, Lawrence Berkeley National Laboratory (LBNL), Oak Ridge National Laboratory (ORNL), and Nvidia revealed they had developed a TensorFlow-based deep learning model that was trained on ORNL’s Summit supercomputer. As we previously reported, the training achieved a sustained performance of 999 petaflops and a peak performance of 1.13 exaflops. This particular neural network was taught to identify extreme weather patterns from high-resolution climate simulations.

The training application used 4,608 compute of Summit’s nodes, but unlike Cori, each of these nodes is accelerated with six Nvidia V100 GPUs. And since each of the V100 processors can deliver 125 teraflops of raw machine learning performance, that pretty much explains why the Summit deep learning application was able to tap into so many flops.

In this case, supercomputing scale was achieved by enhancing the underlying deep convolutional neural network software, dubbed DeepLabv3+. Those modifications centered around improving the software’s scalability and performance, which included enhancing its ability to extract pixel-level features and perform per-pixel classification.

Other enhancements that helped boost overall performance included an optimized data ingestion pipeline and multi-channel segmentation. The latter was used to expand the three-channel image segmentation (based on red/blue/green images), to 16 channels that incorporated additional climate dataset parameters like temperature, wind speed, and so on.

According to the researchers, all the software tweaks “dramatically improved the accuracy of the models.” It’s noteworthy that this work garnered the ORNL, LBNL, and Nvidia researchers the 2018 ACM Gordon Bell Prize.

While it might be tempting to think that instilling machine learning with this level of scalability is merely an exercise in computer science, being able to train complex models in minutes rather than days or weeks has potentially profound ramifications on how this technology can be used. Of course, not every data scientist is going to have a supercomputer at his or her disposal, but they will have access to public clouds that are increasingly being outfitted with HPC-capable processors, networking, and I/O. And it’s here where we would expect this research work from the national labs to find wider application.

Jeff Layton says:

February 4, 2019 at 5:55 pm

I always think there is a little irony in new technologies. Many of them don’t look at the past or even current, state. instead, they invent their own until people discover it’s not as good as hoped. Case in point from the article:

“… Cray PE Machine Learning Plugin was used to improve the scalability of the TensorFlow training across nodes (primarily by replacing TensorFlow’s socket-based gRPC communications with an MPI-optimized one.)”

Maybe they should have looked at MPI to being with and saved themselves lots of trouble (Uber did that with Horovod).

Running TensorFlow at Petascale and Beyond

Sign up to our Newsletter

1 Comment

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Is Mojo The Fortran For AI Programming, Or More?

Deep Dive On Google’s Exascale TPUv4 AI Systems

Clearing the TensorFlow to FPGA Path

1 Comment

Leave a Reply Cancel reply