It is hard to say for sure, but there is probably as much aggregate computing capacity in the academic supercomputing centers of the world as there are in the big national labs. It is just spread out across more institutions, and it is also supporting a wider array of workloads across a larger number of domains. This presents its own challenges, and is analogous to the difference between what happens at large enterprises compared to hyperscalers.
The University of Michigan is the largest public research institution in the United States, with a stunning $1.48 billion in research funded across all domains of science in 2017, and it is second only to Johns Hopkins University, which shells out about $1 billion more on research. (You can see trend data on research funding in this National Science Foundation study.) Driving this research takes a certain amount of supercomputing horsepower, and up until now, the main cluster at Michigan, called “Flux,” has grown organically over the course of eight years and has a mix of different kinds of compute, Brock Palen, director of the Advanced Research Computing – Technology Services division of the university, tells The Next Platform. But starting next year, with the installation of the “Great Lakes” supercomputer, the machine will all be based on a single generation of compute, namely the “Skylake” Xeon SP processors from Intel, with a little help here and there from some Tesla GPU accelerators.
The final configuration of the Great Lakes machine has not been set yet, but interestingly, even though the new system will have only approximately 15,000 cores when it is done, it is expected to have significantly more performance than the hodge-podge of machinery in the current Flux cluster. Brock has not done a calculation to figure out the aggregate peak petaflops of the Great Lakes machine yet, so we have to guess.
If the Great Lakes system uses middle bin Xeon SP parts and reasonable core counts (say 16 or 18 per chop) and frequencies, it will probably come in at between 1 petaflops and 1.5 petaflops. We doubt very much that Michigan is shelling out the dough for top bin 28-core parts that cost $13,011 a pop. The processors alone at list price would cost close to $7 million, and even with a 25 percent discount, the Xeon SP-8180M Platinum processors would be $5.2 million – more expensive than the $4.8 million that Michigan is spending on the PowerEdge servers in the cluster from Dell, the InfiniBand networking and InfiniBand-to-Ethernet gateways from Mellanox Technologies, and the flash-enhanced storage from DataDirect Networks.
The resulting relatively homogenous substrate and the new network topology will allow ARC-TS to better scale larger jobs on the system, even if the average job on the machine currently runs on a single node or less of capacity, according to Palen. Up until now, Michigan researchers have been encouraged to run large jobs on the XSEDE network of supercomputers that are funded by the National Science Foundation. The university has over 2,500 active users pushing the computing needs of over 300 research projects through Flux and the other systems in its datacenter. (More on those in a moment.)
The Flux machine is truly diverse, which is counterbalanced somewhat by the fact that the Message Passing Interface (MPI) parallel processing protocol can be equipped with one rank per core and balance it all out over the cluster somewhat. The mix of servers on the standard Flux part of the system includes 5,904 “Nehalem” Xeon cores from 2009; 1,984 “Sandy Bridge” Xeon cores from 2012; 2,520 “Ivy Bridge” Xeon cores from 2013; and 3,912 “Haswell” Xeon cores from 2014. If you do the math on the Flux configuration shown here on the ARC-TS site, you get 14,936 cores against 927 server nodes. The Nehalem and Sandy Bridge Xeon nodes are based on Dell’s PowerEdge C6100 hyperscale-inspired servers, which cram four two-socket server nodes into a 2U chassis. The Ivy Bridge Xeon nodes are based on IBM/Lenovo’s NextScale M4 nodes in the n1200 chassis, which also pack four two-socket nodes into a 2U space. It is unclear who manufactured the Haswell Xeon nodes, but the far memory nodes are a mix of Dell PowerEdge R910 and PowerEdge R820 machines and IBM/Lenovo X3850 servers.
This is a pretty sizable setup, mind you, but it ain’t 27,000 cores as some portions of the ARC-TS site says. (That may be the aggregate number of cores across all HPC systems.) No matter. What seems clear is that by moving to Skylake Xeons with even 15,000 cores, Michigan is going to get a lot more computing oomph. Our rough guess is that Flux is around 500 teraflops at double precision and that Great Lakes will have about 3X times the performance. None of this was made clear in the announcement by Dell and Michigan, but the performance jump matters. When the configuration settles out, we will be able to do some math on the price/performance, which is why we bother.
All of the Flux nodes are linked to each other using Voltaire GridDirector 4700 switches, which predate the acquisition of Voltaire by Mellanox Technologies in 2011. This are 40 Gb/sec QDR InfiniBand director switches, which are based on Mellanox InfiniBand switch ASICs. The Flux setup also has an Ethernet network for accessing storage.
According to Palen, a big chunk of the Flux setup was owned and funded by different faculty members and their projects, and ARC-TS just managed it for them. But looking ahead to the Great Lakes system, Michigan wants to be able to drive up the utilization across the cluster by doing more timesharing across the cluster, and also allowing some jobs to grab more computing than was possible when Flux was all partitioned by faculty and research. It means changing it from a collection of clusters doing their own thing to a true shared compute utility.
Palen says that the vendors who pitched machines for the Great Lakes cluster put together bids for both Intel Skylake and AMD Epyc processors, and that the competition was close between the two architectures on the suite of benchmarks that Michigan uses to buy machines. Ultimately, the Skylake machines won out. But don’t think for a second that Michigan is somehow a Xeon shop exclusively. The university has a slew of different systems, including a 75-node IBM Power8-Nvidia Tesla CPU-GPU hybrid called Conflux, which is exploring the intersection between HPC and machine learning, all linked by a 100 Gb/sec InfiniBand fabric from Mellanox. The university also has a cluster with 4,600 cores based on Cavium’s ThunderX processors that has 3 PB of disk capacity and that is set up to run the Hadoop data analytics platform.
The Great Lakes machine will be comprised of three different types of nodes, just like the Flux machine that preceded it, including a large number of standard nodes plus some large memory machines and some GPU-accelerated machines. The main compute will be based on the Dell PowerEdge C6420 machines, which put four two-socket nodes into a 2U enclosure as well, with hugh memory nodes being based on the PowerEdge R640 and GPU nodes based on the PowerEdge R740.
For storage, Michigan is opting for DDN’s GridScaler 14KX arrays and their Infinite Memory Engine (IME) cache buffer, which will weigh in at 100 TB. The university is shifting away from Lustre to IBM’s Spectrum Scale (formerly GPFS) parallel file system, and Palen explained that the university was not interested in doing a separate burst buffer sitting between the parallel file system and the cluster. ARC-TS is using InfiniBand-to-Ethernet gateways from Mellanox to link the Great Lakes cluster to the GPFS clustered storage and to other storage available at the university, starting with 160 Gb/sec pipes and expanding to 320 Gb/sec pipes at some point in the near future.
“We support a large variety of workloads, and this is what drove us to an IME buffered memory setup,” says Palen. “Even if individual applications are doing nice with I/O, if you have hundreds of workloads trying to use the scratch storage, then scratch will essentially see a random load and thus performs horribly compared to its peak. The goal here was to have 5 percent of the scatch disk capacity available in flash buffers in an intelligent way so that hot data and random writes are absorbed by the flash for current activity. Burst buffers that you have to consciously activate the flash and then transfer it back to primary storage, that was not something we are interested in. The big labs have a few people only running a handful of workloads, but I have thousands of people with thousands of workloads, I need to train them. And they come and go every three or four years, which means the training never ends. So we are looking for a technical solution rather than a perfect one.”
As for the network used to link the compute nodes on the Great Lakes cluster, it is true that Michigan is going to be the first academic research institution to install 200 Gb/sec HDR InfiniBand, using Quantum switches and ConnectX-6 adapters from Mellanox. But here is the twist on that. Instead of looking for very high bandwidth per port, Michigan was looking to cut down on the cost and the oversubscription of the network, and so it is using cable splitters on the 40-port Quantum switches to turn them into a higher-radix virtual 80-port switch, allowing for one switch to have downlinks to all 80 servers in a rack. Prior InfiniBand switches had 32 ports per switch, so you needed three of them to cover a rack, and you had 16 stranded ports, which is wasteful. The fat tree network will be set up so Michigan can add 50 percent to the server node count without having to rewire the physical InfiniBand network.
The Great Lakes machine will be installed during the first half of 2019. About 70 percent of the users on the system are doing hardcore engineering – combustion, materials, and fluid dynamics simulations; the GPU-accelerated workloads tend to be molecular dynamics.