Accelerating HPC Investments In Canada
March 5, 2018 Jeffrey Burt
Details about the technologies being used in Canada’s newest and most powerful research supercomputer have been coming out in a piecemeal fashion over the past several months, but now the complete story.
At the SC17 show in November, it was revealed that the HPC system will use Mellanox’s Dragonfly+ network topology and a NVM Express burst buffer fabric from Excelero as key part of a cluster that will offer a peak performance of more than 4.6 petaflops.
Now Lenovo, which last fall won the contract for the Niagara system over 11 other vendors, is unveiling this week that it is bringing 1,500 highly dense ThinkSystem SD530 high performance nodes as the foundation for the supercomputer, which is one of four new HPC systems that SciNet is bringing into its network of computers that help power a lot of the scientific research that happens in Canada. The other systems are designed as essentially general purpose supercomputers for small and midsize workloads; by comparison, Niagara will be SciNet’s system for larger parallel jobs that need 1,000 or more nodes to get the work done.
Niagara will replace two older IBM systems, including the General Purpose Cluster (GPC) – where most of the computations for workloads in the country have been done – that was built from 3,780 nodes of IBM’s iDataPlex dx360-M2 systems that held 30,240 Intel Xeon “Nehalem” E5540 cores. Niagara, which will be housed at the University of Toronto, will also replace the Tightly Coupled Cluster (TCS), a 104-node IBM Power6 cluster with 3,328 cores that ran its first job in December 2008. The TCS system was taken down last fall, and the computing center will keep about half of the GPC in place while Niagara gets up and running.
In a presentation, Scott Northrup, an analyst at SciNet, in February said Niagara is becoming operational.
“Most of the configuration and setup is pretty much done,” Northrup said. “The contractor, Lenovo, is just finishing up the last bit of testing . . . and we’re going to have the next couple of weeks to have our configuration set up, and those hopefully have some early-user access so you can get in and actually start working on it.”
According to Lenovo, the system is now available to researchers throughout Canada and is aimed at big data workloads for such tasks as artificial intelligence (AI), climate change, modeling, and astrophysics. Niagara will provide more than 3 petaflops of normal performance through Lenovo’s SD530 compute nodes and DSS-G storage combined with Mellanox’s high-speed InfiniBand in a Dragonfly+ network topology. An SD530 node comes with two of Intel’s “Skylake” Xeon SP processors, each with 20 cores, bringing the total to 60,000 cores for the Niagara cluster. Each node has 192 GB of RAM and 3 teraflops of performance.
At only one row of systems, Niagara will use less space than the GPC, which was housed in three rows, according to Northrup. It will run the CentOS 7 Linux operating system and use the SLURM job scheduler.
Lenovo has become a larger player in the HPC space since buying IBM’s x86 business for $2.1 billion in 2014. In the latest Top500 list released in November, Lenovo had 81 systems on the list – behind only Hewlett Packard Enterprise and its 122 – including two in the top 20 and four in the top 100. Most recently, Lenovo last month talked about the growing market for water-cooled HPC systems as the demand for greater density, performance and power and cost efficiencies grows. Lenovo’s 6U NeXtScale n1200 Direct Water Cooling enclosure is aimed at HPC and uses warm water to keep the systems inside cool.
In his presentation, SciNet’s Northrup said what separates Niagara from other HPC systems is the use of Mellanox’s Dragonfly+ network topology and the burst buffer fabric from Excelero. (We have detailed the Excelero NVM-Express storage fabric here.) The Dragonfly+ topology is inspired by the “Aries” interconnect from Cray, which is used with its XC40 and XC50 systems that are deployed in various clusters, he said. Mellanox has brought the Dragonfly+ topology used by Cray and incorporate it into its own InfiniBand offering, he said. The Dragonfly+ is designed to give users like those running Niagara an efficient and highly scalable solution that can run dense workloads. It can dynamically manage network traffic and bandwidth and, paIred with Lenovo’s ThinkSystem servers, can provide up to 600 kilowatts of energy savings on cooling needs, according to Mellanox. This Dragonly+ topology uses only edge switches; it does away with core switches.
Niagara will include 10 nodes of Excelero’s NVMesh software that Northrup said can be used for temporary shared storage when more capacity is needed for particular workloads. The nodes will hold 80 NVM Express flash drives to create a peta-scale storage system that includes NVM Express SSDs that create a unified and distributed pool of NVM Express flash storage that provides about 148 GB/s of write burst, 230 GB/s of read throughput, and more than 20 million random 4K IOPS.
Lenovo built Niagara in conjunction with not only SciNet and the University of Toronto, but also Compute Ontario and Compute Canada, organizations that were created to help coordinate, accelerate, and promote supercomputing efforts in the Ontario and around the country. Compute Canada is overseeing the deployment of the four new systems, which include not only Niagara at the University of Toronto but also the “Arbutus” system at the University of Victoria, which is an OpenStack cloud designed to host virtual machines and other cloud workloads. It is also a Lenovo system with 6,944 cores running across 248 nodes. The system became operational in 2016 as an expansion of Compute Canada’s “Cloud West” system.
The other two systems, the 3.6 petaflops “Cedar” supercomputer at Simon Fraser University and “Graham,” at the University of Waterloo, went online last year and offer similar designs, including small OpenStack partitions and local storage on nodes.