One of the first Knights Landing, Omni-Path supercomputers will be hitting the floor in Colorado in the coming months, and while one of the lead decision-makers for the system says they are expecting to see it in May (well ahead of when Knights Landing and Omni-Path were expected to appear, even for early ship programs), that buffer time provides a chance to make the necessary tweaks and optimizations to ensure that a scientific computing software stack is primed and ready for the changes Omni-Path in particular will bring about.
According to Peter Ruprecht, Senior HPC analyst and lead for the KNL, Omni-Path machine at the University of Colorado, Boulder, where the new cluster will live, much the system architecture was developed with the goal of suiting as many varied user applications as possible. While the majority of the system’s floating point capability will come from 380 Dell-built Haswell blades, ten nodes will sport Knights Landing and ten will sport two Nvidia Tesla K80 GPUs per node for acceleration. In addition to the 380 standard Haswell-based nodes and Knights Landing with ten dual GPU-backed nodes, the center also invested in a small number four-socket 2 TB high memory nodes to round it out.
This will mark a significant performance improvement over the previous system, which will be retired in the next few months. The “Janus” machine there, which sported 1360 dual-socket Westmere processors, also put together by Dell, worked well when it debuted over four years ago (it was in the top 35 on the Top 500 then) but as the needs for a more balanced architecture driven by data demands grew, the teams realized looking to a new approach would be required.
The twenty Knights Landing-based machines will arrive as an add-on to the system in summer/fall of this year and will be the socketed version with dual on-package Omni-Path connections. Although the teams expect there to be application speedups, it is still too early to tell what to expect since there have been no benchmarking efforts yet given the lack of test hardware. “We had earlier versions of the Xeon Phi but we are interested in that socketed version versus having a co-processor.”
When the machine arrives, the team is expecting “trial by fire” as Ruprecht says. There is no way to prepare for a lot of the changes that will hit the software stack before Omni-Path arrives and since the old cluster will be ripped out, the show must go on. The teams plan to continue using SLURM and are not anticipating changing anything in the stack that they don’t need to, so it will be a harried time in Boulder, albeit one Ruprecht is looking forward to. When asked where the greatest amount of concern lies with that change-over, he says seeing how GPFS interacts is a big unknown. The forthcoming storage environment, consisting of a SFA14K appliance, is GPFS-based. While IBM has been working hard to fully integrate GPFS with Omni-Path (in the same way they did to integrate with Infiniband years ago) there are still a few unknowns. It might seem like a natural thing to consider switching to Lustre since it will likely have much tighter out of the box integration with Omni-Path but Ruprecht says although it makes sense in the abstract, they chose GPFS because for their workloads, it has better underlying file operation performance—something they have gauged between using GPFS and Lustre both at various times in the past.
The university is home to one of Intel’s Parallel Computing program centers, which might explain the architectural drive, but as Ruprecht tells The Next Platform, even without that direction, Knights Landing and Omni-Path were interesting to the teams. Given the general purpose nature of the application base they serve, specialized architectures were not given weighty consideration and ultimately, the decision on Dell for the machine (he said most of the other vendors also weighed in with a deal) came down to the level of integration and, of course price. As a side note, he says that the bids they received in response to the RFP were all quite close on price—generally no more than 10% or 15% different from one another.
The forthcoming cluster has a name, by the way. It’s called “Summit” which is obviously not to be confused with the grand-scale system coming to Oak Ridge National Lab. Ruprecht had a great sense of humor when we said that for reader purposes we’ll be calling this “Lil’ Summit” but wanted to have the final word that…”we may not be the biggest, but we’ll be the first” and then hinted at the size differences between the Rockies and Oak Ridge’s Smokies. Ahem.