China Pushes Breadth-First Search Across Ten Million Cores

There is increasing interplay between the worlds of machine learning and high performance computing (HPC). This began with a shared hardware and software story since many supercomputing tricks of the trade play well into deep learning, but as we look to next generation machines, the bond keeps tightening.

Many supercomputing sites are figuring out how to work deep learning into their existing workflows, either as a pre- or post-processing step, while some research areas might do away with traditional supercomputing simulations altogether eventually. While these massive machines were designed with simulations in mind, the strongest supers have architectures that parallel the unique requirements of training and inference workloads. One such system in the U.S. is the future Summit supercomputer coming to Oak Ridge National Lab later this year, but many of the other architectures that are especially sporting for machine learning are in China and Japan—and feature non-standard processing elements.

The Sunway TaihuLight supercomputer, which is the most powerful on the planet according the Top 500 rankings of the world’s top systems, might be a powerhouse in China for traditional scientific applications, but the machine is also primed for moving the country along the bleeding edge in deep learning and machine learning.

Back in June, 2016, we described the wide range of applications set to run on China’s top system, noting significant progress in adding deep learning libraries and tooling. In the meantime, other efforts to spur machine learning developments on the system among Chinese researchers have cropped up, with similar emphasis on machine learning for current and future exascale systems in Japan, as well. In short, supercomputing in Asia has taken a turn toward AI—and as it turns out, the Sunway TaihuLight system might be the right tool for doing double-duty on both scientific and machine learning applications.

Last summer, when we described the architecture of the Sunway TaihuLight supercomputer, we noticed a few interesting things when it came to real-world application performance. It turns out, it’s architecture was well-suited (as large supercomputers go, anyway) to graph and irregular algorithms—something that makes it prime for the next generation of neural network and other non-traditional HPC applications. This observation was confirmed by the system’s performance on the Graph 500 benchmark, which measures performance and efficiency of graph traversals—something important for data-intensive HPC workloads.

While the results of the benchmark were published last year, researchers and system engineers have just released a detailed paper describing the various optimizations on the unique architecture to achieve those results. In doing so, they also provided more insight into the architecture than we have seen to date—and provided rationale for why the machine performs well on workloads with irregular accesses and other features well-aligned with machine learning.

The Sunway TaihuLight machine has a peak performance of 125.4 petaflops across 10,649,600 cores. It sports 1.31 petabytes of main memory. To put the peak performance figure in some context, recall that the current (by far top) supercomputer until this announcement had been Tianhe-2 with 33.86 pea petaflop capability. One key difference, other than the clear peak potential, is that TianhuLight came out of the gate with demonstrated high performance on real-world applications, some of which are able to utilize over 8 million of the machine’s 10 million-plus cores.

As a refresher, the processors in the Sunway TaihuLight system have a highly heterogeneous manycore architecture and memory hierarchy. Every processor in TaihuLight contains four general purpose cores, and each of which has 64 on-chip accelerator cores. Each accelerator core has a 64 KB on-chip scratch pad and memory. The four general purpose cores and the 256 accelerator cores can both access 32GB shared off-chip main memory. The system designers say that for chip design and manufacturing reasons the machine does not have cache coherence in exchange, opting instead for more area of the chip for computing. This creates challenges for implementing BFS across over ten million cores with accelerators added to the mix, but it does bode well for future machine learning workloads on TaihuLight.

The researchers who implemented BFS across the system note that this is the best-performing heterogeneous architecture on the Graph 500—something that has both benefits and challenges. The algorithm itself has many aspects that aren’t well-suited to traditional supercomputing approaches, including frequent and random data accesses, which plug up I/O, heavy data dependencies as well as irregular data distribution. However, even with accelerators on the system, it was able to achieve 23755.7 giga-traversed edges per second, the top among other accelerated systems on the list.

While the Graph 500 benchmark is not necessarily a measure of a system’s capability to do large-scale machine learning, it is an indicator. The current top spot on the Graph 500 list is the K Computer at RIKEN in Japan, which will be getting its own upgrade at a time when leaders for the system are talking a great deal about how deep learning might fit into traditional HPC realms.

In the supercomputing world, the dominant benchmark is still the Top 500, which ranks the top machines based on double-precision floating point performance. A companion benchmark, which is getting more attention for its focus on how actual applications behave on these machines is HPCG—which the TaihuLight system performed well on performance-wise, but with abysmally low system utilization relative to the other top tier machines. In short, what the machine might offer in the way of exceptional machine learning performance on a unique supercomputing architecture, it might lack in terms of efficiency. Of course, this is true with almost all of the machines on all of the benchmarks listed–efficiency and utilization figures pale in comparison to the performance figures.

What might be needed is yet another benchmark for the supercomputing community. Dr. Jack Dongarra (behind the Top 500, HPCG, and an emerging ranking that looks at single precision performance) might weep at the thought of yet another metric, but with deep learning entering the HPC sphere in such a dramatic way, adding a machine learning-centric set of baselines could make sense. Such a metric could be based on a classification problem or existing neural network benchmark and provide results for both training and inference. With so many GPU accelerated machines on the Top 500 (and more coming that feature the latest Pascal and future Volta GPUs) it would be interesting to gauge their performance at scale and more important, get a sense of how well these models can actually scale across some of the world’s largest machines. Multi-GPU scaling is one problem the HPC world has figured out–but what about multi-custom accelerator scaling as shown by the Chinese engineers for the TaihuLight system, or better still, of ARM, SPARC, and other architectures?

The point is, Graph 500 has worked well for measuring data-intensive computing performance on top supercomputers to date. It arose during the wave of interest in “big data” a few years. That interest has now given way to machine learning as the next level of analytics and should have a benchmark that applies to the largest systems to compare and asses performance.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.