Future Cray Clusters to Storm Deep Learning
April 18, 2016 Nicole Hemsoth
Supercomputer maker Cray might not roll out machines for deep learning anytime in 2016, but like other system vendors with deep roots in high performance computing, which leverages many of the same hardware elements (strong interconnect and GPU acceleration, among others), they are seeing how to loop their expertise into a future where machine learning rules.
As Cray CTO, Steve Scott, tells The Next Platform, “Deep learning and machine learning dovetails with our overall strategy and that overall merger between HPC and analytics that has already happened,” Scott says. HPC is not just one single application or set of them, there are many, all with different needs with respect to interconnect, compute, memory systems, storage, and IO. Deep learning is just another set of applications and market, with a lot of dense computation that GPUs are good at—and a lot of communication globally during training. So this is right in our wheelhouse.”
Scott has shifted with the architectural tides since beginning his career at Cray in 1992 and has been one of the technical leads behind some of the company’s most successful supercomputers, including the Cray X1, the Cascade machines, and the Cray XC series of systems. The rising tide of machine learning interest has lifted the internal boat at Cray engineering-wise, with R&D teams looking at everything from how to choose and integrate core deep learning packages onto Cray platforms, to looking at how the architecture might evolve to fit the needs of both the supercomputing set and the new batch of machine learning users. As it turns out, these requirements are similar in many ways for the heavy compute side of deep learning, which is training. And high bandwidth for multi-GPU systems are something the company has been doing, so it’s a matter of firming up the software environment and retailoring around future GPUs, including Pascal with Nvidia’s intra-node interconnect, NVlink.
The emerging machine learning market is in the company’s crosshairs as well because they already have a product designed for the high bandwidth needed for compute-intensive applications. Scott points to the Cray CS Storm as the architecture that is both a fit for machine learning training and high performance computing applications and says that in many ways, it is not so dissimilar to Nvidia’s DGX-1 boxes that have been targeted at deep learning training (and can also serve as ready-to-roll HPC nodes, according to the GPU maker’s VP of Solutions Architecture and Engineering).
If you look at the DGX-1 from Nvidia, you’ll see it looks a lot like our Storm blades, very much like the next generation Pascal version of the CS Storm and while we haven’t announced the follow-on to Storm, you can imagine how it might look similar,” Scott says. The key difference he highlights is that while Nvidia emphasized DGX-1 beyond in single node context, with Storm, especially in the future with Pascal and Nvlink, they can put this together as a fully engineered, configured, and tested system with many nodes, each sporting eight GPUs (and perhaps more eventually). “Many of our systems on the Top 500 that are highest placed are Storms—we can scale from a single box like DGX-1 to a system with racks and racks of those, scaling up to very large aggregate systems.”
Aside from using Pascal or Nvidia Tesla K80s to outfit such boxes for HPC, there is also the possibility that Nvidia might make a dedicated system that looks a lot like a CS Storm, but uses the new M40 GPUs for training at lower price and thermal points and with a GPU that’s available now. Whatever the case, however, Scott says if they tackle the machine learning market, “it won’t be a far departure architecturally, but there will be more software that fits well on well-engineered GPU and CPU systems.”
Although there are companies that have rolled out 16-GPU systems already, some of which have found their way into deep learning shops, the question is whether Cray would focus on this to suit both the machine learning and HPC crowds. Scott says that indeed, Cray is considering this in the lead up to Storm’s follow-on, noting that the key to doing this the right way is in the interconnect. The mix of workloads is pushing this forward as well, with some users of the Storm now running a combination of capacity workloads where jobs are kept on a single node and scalable workloads where all GPUs are being used against a capability problem. In short, developments toward a 16-GPU follow-on to Storm could benefit their core users of those systems in both HPC and machine learning.
With NVlink and Pascal, comes the opportunity for very high bandwidth among the GPUs in a node. The CS Storm now has 8 GPUs and 2 CPUs connected by PCIe 3.0, which means lower bandwidth between those 8 GPUs—something that will be dramatically altered for Cray and other OEM partners when Pascal becomes widely available later this year. “In Pascal with multi-GPU nodes hooked with NVlink, we expect to see a lot of applications for those; some will be distributed across GPUs so that each kernel just runs within a single GPU and others that need to just communicate data between the GPUs in a node. We will have a mix, but deep learning will be one of those that can take advantage of this.”
For future Storm machines, “When we build with Pascal those systems will have NVLink based intranode interconnects for more bandwidth. The system interconnect will continue to be Infiniband and first generation OmniPath,” Scott says. For HPC in particular, namely, the XC machines, which are based on Aries, nothing will change there, although in 2019 with its “Shasta” systems, Cray will have the second generation of OmniPath with higher network bandwidth and more flexibility because, as Scott explains, “it will be possible to choose either high-density scale optimized cabinets or standard rackmount cabinets for the high performance interconnect and the system software stack. In other words, you can build high performance systems with either type of cabinet with a high performance interconnect, which means it will be possible to build high-memory nods, multi-GPU nodes, and others, offering more flexibility than what we have with the XC system.”
The short version here is that with an architecture like Storm, which already has a strong customer base in HPC and all the right plumbing (and soon, software stacks) to support its base as well as deep learning, Cray could turn a tried and true supercomputer line into a deep learning powerhouse, both with the swapping in of less expensive M40 GPUs for training, and at the high end, with Pascal plus NVlink for something that looks a like a system version of the single-node DGX-1 boxes Nvidia launched at GTC last month.
And just as Cray saw a market opportunity to roll its supercomputing expertise into what was an emerging “big data” space in 2010 and beyond, the company is seeing that multi-billion dollar chance to move their engineering into deep learning as well. The thing to note is that they won’t be the only company to take existing systems and push new software into it and further, but like the other handful of HPC specific system vendors (SGI, HP, Dell, and others) who know HPC and are definitely eying deep learning, there’s potential to take existing HPC customers and show them how deeper insight from their massive wells of data might be sifted through using machine learning approaches on architectures such centers are already familiar with. Either way, it will be an interesting year for both the GPU computing space and the HPC system ecosystem as both converge to prop up new applications.