Future Cray Clusters to Storm Deep Learning

Supercomputer maker Cray might not roll out machines for deep learning anytime in 2016, but like other system vendors with deep roots in high performance computing, which leverages many of the same hardware elements (strong interconnect and GPU acceleration, among others), they are seeing how to loop their expertise into a future where machine learning rules.

As Cray CTO, Steve Scott, tells The Next Platform, “Deep learning and machine learning dovetails with our overall strategy and that overall merger between HPC and analytics that has already happened,” Scott says. HPC is not just one single application or set of them, there are many, all with different needs with respect to interconnect, compute, memory systems, storage, and IO. Deep learning is just another set of applications and market, with a lot of dense computation that GPUs are good at—and a lot of communication globally during training. So this is right in our wheelhouse.”

Scott has shifted with the architectural tides since beginning his career at Cray in 1992 and has been one of the technical leads behind some of the company’s most successful supercomputers, including the Cray X1, the Cascade machines, and the Cray XC series of systems. The rising tide of machine learning interest has lifted the internal boat at Cray engineering-wise, with R&D teams looking at everything from how to choose and integrate core deep learning packages onto Cray platforms, to looking at how the architecture might evolve to fit the needs of both the supercomputing set and the new batch of machine learning users. As it turns out, these requirements are similar in many ways for the heavy compute side of deep learning, which is training. And high bandwidth for multi-GPU systems are something the company has been doing, so it’s a matter of firming up the software environment and retailoring around future GPUs, including Pascal with Nvidia’s intra-node interconnect, NVlink.

The emerging machine learning market is in the company’s crosshairs as well because they already have a product designed for the high bandwidth needed for compute-intensive applications. Scott points to the Cray CS Storm as the architecture that is both a fit for machine learning training and high performance computing applications and says that in many ways, it is not so dissimilar to Nvidia’s DGX-1 boxes that have been targeted at deep learning training (and can also serve as ready-to-roll HPC nodes, according to the GPU maker’s VP of Solutions Architecture and Engineering).

If you look at the DGX-1 from Nvidia, you’ll see it looks a lot like our Storm blades, very much like the next generation Pascal version of the CS Storm and while we haven’t announced the follow-on to Storm, you can imagine how it might look similar,” Scott says. The key difference he highlights is that while Nvidia emphasized DGX-1 beyond in single node context, with Storm, especially in the future with Pascal and Nvlink, they can put this together as a fully engineered, configured, and tested system with many nodes, each sporting eight GPUs (and perhaps more eventually). “Many of our systems on the Top 500 that are highest placed are Storms—we can scale from a single box like DGX-1 to a system with racks and racks of those, scaling up to very large aggregate systems.”

Aside from using Pascal or Nvidia Tesla K80s to outfit such boxes for HPC, there is also the possibility that Nvidia might make a dedicated system that looks a lot like a CS Storm, but uses the new M40 GPUs for training at lower price and thermal points and with a GPU that’s available now. Whatever the case, however, Scott says if they tackle the machine learning market, “it won’t be a far departure architecturally, but there will be more software that fits well on well-engineered GPU and CPU systems.”

Although there are companies that have rolled out 16-GPU systems already, some of which have found their way into deep learning shops, the question is whether Cray would focus on this to suit both the machine learning and HPC crowds. Scott says that indeed, Cray is considering this in the lead up to Storm’s follow-on, noting that the key to doing this the right way is in the interconnect. The mix of workloads is pushing this forward as well, with some users of the Storm now running a combination of capacity workloads where jobs are kept on a single node and scalable workloads where all GPUs are being used against a capability problem. In short, developments toward a 16-GPU follow-on to Storm could benefit their core users of those systems in both HPC and machine learning.

With NVlink and Pascal, comes the opportunity for very high bandwidth among the GPUs in a node. The CS Storm now has 8 GPUs and 2 CPUs connected by PCIe 3.0, which means lower bandwidth between those 8 GPUs—something that will be dramatically altered for Cray and other OEM partners when Pascal becomes widely available later this year. “In Pascal with multi-GPU nodes hooked with NVlink, we expect to see a lot of applications for those; some will be distributed across GPUs so that each kernel just runs within a single GPU and others that need to just communicate data between the GPUs in a node. We will have a mix, but deep learning will be one of those that can take advantage of this.”

For future Storm machines, “When we build with Pascal those systems will have NVLink based intranode interconnects for more bandwidth. The system interconnect will continue to be Infiniband and first generation OmniPath,” Scott says. For HPC in particular, namely, the XC machines, which are based on Aries, nothing will change there, although in 2019 with its “Shasta” systems, Cray will have the second generation of OmniPath with higher network bandwidth and more flexibility because, as Scott explains, “it will be possible to choose either high-density scale optimized cabinets or standard rackmount cabinets for the high performance interconnect and the system software stack. In other words, you can build high performance systems with either type of cabinet with a high performance interconnect, which means it will be possible to build high-memory nods, multi-GPU nodes, and others, offering more flexibility than what we have with the XC system.”

The short version here is that with an architecture like Storm, which already has a strong customer base in HPC and all the right plumbing (and soon, software stacks) to support its base as well as deep learning, Cray could turn a tried and true supercomputer line into a deep learning powerhouse, both with the swapping in of less expensive M40 GPUs for training, and at the high end, with Pascal plus NVlink for something that looks a like a system version of the single-node DGX-1 boxes Nvidia launched at GTC last month.

And just as Cray saw a market opportunity to roll its supercomputing expertise into what was an emerging “big data” space in 2010 and beyond, the company is seeing that multi-billion dollar chance to move their engineering into deep learning as well. The thing to note is that they won’t be the only company to take existing systems and push new software into it and further, but like the other handful of HPC specific system vendors (SGI, HP, Dell, and others) who know HPC and are definitely eying deep learning, there’s potential to take existing HPC customers and show them how deeper insight from their massive wells of data might be sifted through using machine learning approaches on architectures such centers are already familiar with. Either way, it will be an interesting year for both the GPU computing space and the HPC system ecosystem as both converge to prop up new applications.

What is Cray’s direction for sustainable differentiation? They superficially appear to be dangerously close to becoming just another integrator, with worse economics than their competitors.

E.g. the future machines sound like Intel boxes or a not-Intel option.

Intel Xeon or Xeon Phi with Intel Omni Path (Cray having sold off their interconnect team and IP to Intel). Cray provides custom sheetmetal and cooling infrastructure?

Alternatively, POWER9 + Mellanox IB + Nvidia GPUs. Again Cray provides “just” dense packaging and a commoditizing software stack? Also, on that path, one wonders how long it’ll take Dally et al to extend NVLink beyond the node and bypass Mellanox.. And whether they harbor ambitions to bypass IBM if they can ever produce a fast enough ARM processor to use as the host CPU.

On the storage side, Cray Sonnexion looks a lot like repackaged Seagate/Xyratex.

On the custom analytics side, I strongly suspect Cray won’t do another custom processor in the MTA/XMT family.

So, to belabor the question, whither Cray differentiation? Given the work ODMs are doing to build ML oriented clusters for clouds, things seem brighter for them than Cray in that market..

BlackDove says:

April 23, 2016 at 10:00 pm

Someone(I don’t remember who) from Cray said that they will definitely not be making another Threadstorm for their future Urika like system.

What you said about Cray seems to be true for much of the industry these days too. The never talked about Fujitsu PrimeHPC and NEC SX(and Aurora) custom architectures are about the only really interesting things being extensively developed. Western media almost never covers them though.

When thry hit the useful(not just HPL) exaflop first, they might get some coverage.

Shasta is probably going to be a huge seller for Cray, and at a large scale it will be good for converged HPC and analytics.

Reply

bill says:

April 18, 2016 at 11:40 am

What is Cray’s direction for sustainable differentiation? They superficially appear to be dangerously close to becoming just another integrator, with worse economics than their competitors.

E.g. the future machines sound like Intel boxes or a not-Intel option.

Intel Xeon or Xeon Phi with Intel Omni Path (Cray having sold off their interconnect team and IP to Intel). Cray provides custom sheetmetal and cooling infrastructure?

Alternatively, POWER9 + Mellanox IB + Nvidia GPUs. Again Cray provides “just” dense packaging and a commoditizing software stack? Also, on that path, one wonders how long it’ll take Dally et al to extend NVLink beyond the node and bypass Mellanox.. And whether they harbor ambitions to bypass IBM if they can ever produce a fast enough ARM processor to use as the host CPU.

On the storage side, Cray Sonnexion looks a lot like repackaged Seagate/Xyratex.

On the custom analytics side, I strongly suspect Cray won’t do another custom processor in the MTA/XMT family.

So, to belabor the question, whither Cray differentiation? Given the work ODMs are doing to build ML oriented clusters for clouds, things seem brighter for them than Cray in that market..

- BlackDove says:
  
  April 23, 2016 at 10:00 pm
  
  Someone(I don’t remember who) from Cray said that they will definitely not be making another Threadstorm for their future Urika like system.
  
  What you said about Cray seems to be true for much of the industry these days too. The never talked about Fujitsu PrimeHPC and NEC SX(and Aurora) custom architectures are about the only really interesting things being extensively developed. Western media almost never covers them though.
  
  When thry hit the useful(not just HPL) exaflop first, they might get some coverage.
  
  Shasta is probably going to be a huge seller for Cray, and at a large scale it will be good for converged HPC and analytics.

Future Cray Clusters to Storm Deep Learning

Sign up to our Newsletter

2 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Intel Pits New Gaudi2 AI Training Engine Against Nvidia GPUs

Building A Hassle-Free Way To Port CUDA Code To AMD GPUs

MGX: Nvidia Standardizes Multi-Generation Server Designs

2 Comments

Leave a Reply Cancel reply