Every supercomputing center in the world is wrestling with the issues of power, cooling, and compute density, but some have tighter constraints than others and need to have more energy efficient machines than they can get with standard clusters of rack servers. In a blast from the mainframe and supercomputing past, water cooling is coming back in vogue again, and the HPC division of Eurotech, an Italian maker of IT equipment, is ramping up its super-dense Aurora Hive hybrid compute machines with a wider selection of processing and networking options.
The company is also releasing a single node development machine based on the Hive architecture to give tire kickers a chance to see how the machine performs as they evaluate the high-end Eurotech system against competitors.
Like many modern machines, the Aurora Hive system assumes that customers will be using accelerators for the bulk of the computation inside the system. The system launched at last year’s SC14 supercomputing conference with Intel’s single-socket “Haswell” Xeon E3-1200 v3 processor as the compute side of the hybrid compute complex, with promises of support for faster X86 and also ARM processors coming in the future. At this year’s SC15 supercomputing conference, Eurotech rolled out support for Intel’s “Broadwell” Xeon E5-1200 v4 processors, which launched in June. Both Xeon E3 chips have four cores and a modest amount of L3 cache memory – 8 MB for the Haswell and 6 MB for the Broadwell, and yes it went down – and have variants that run at 3.6 GHz or 3.5 GHz, respectively, making them suitable as host processors where the CPU side of applications need fast clocks and not a lot of threads because they are really focused on serial compute work. The Xeon E3 compute node has 64 GB of memory on it and a 256 GB, 1.8-inch flash SSD for local storage, and any Xeon E3 chip that runs with 84 watts or lower can be used in the node.
The HPC and hyperscale communities are both experimenting with 64-but ARM server chips, as we have discussed at length recently, and Eurotech is one of them. As of the SC15 conference, the Hive compute nodes now support the X-Gene 1 processor from Applied Micro, as an alternative to the Xeon E5. The “Storm” X-Gene 1 is 64-bit ARM processor based on a custom core designed by Applied Micro, and it provides eight cores running at a more modest 2.4 GHz, which is not too shabby for the 40 nanometer chip manufacturing processes that it employs, but which is not as impressive as the now sampling “Shadowcat” X-Gene 2 and the just-announced “Skylark” X-Gene 3 that Applied Micro told The Next Platform about last week. It would be interesting to see Applied Micro disable four of the eight cores on the X-Gene 2 and crank up the clocks if it can to meet the Xeon E3 head on, clock for clock and core for core. But given that the vast majority of computing in a Hive machine is done by accelerators, the performance of the CPU doesn’t matter quite as much. The precise configuration of the X-Gene 1 node is not known, but what we can tell you is that Eurotech had planned to launch its ARM support with X-Gene 2 but Applied Micro was delayed in its rollout. The X-Gene 1 system board has 64 GB of memory and the same 256 GB flash drive.
Peeling Back The Aluminum Skin
The Hive node, which is sometimes called a brick, is a truly minimalist design. “You don’t get stuff that you don’t use,” Fabio Gallo, HPC business unit managing director at Eurotech, tells The Next Platform.
The node enclosure is 3U high, which is 130 millimeters, and is 105 millimeters wide and 325 millimeters deep. The Aurora Hive rack has 64 compute nodes in the front (four across and sixteen up and down), and 64 in the back, with a central midplane for connecting to switching, power, and cooling. The node has a single system board that has five riser card slots that come out of it for supporting network interface cards and coprocessors.
All of the components in the Hive node are equipped with aluminum cold plates that extract heat from the components on the compute and networking modules and have it whisked away by the laws of thermodynamics, which like transferring heat to metal and water a whole lot more than to air. The Hive system can be cooled with water that is as warm as 50 degrees Celsius (that’s 122 degrees Fahrenheit, or about as warm as you ever want a hot tub to get). The coprocessors and network interfaces are linked to the system board through an integrated PCI-Express switch made by PLX Technology, a unit of Avago Technologies and the company that is also trying to buy ARM server chip wannabe and network ASIC juggernaut Broadcom.
This time last year, the only available coprocessors for the Hive system were a “Knights Corner” Xeon Phi 7120X from Intel and a slightly modified Tesla K40 coprocessor from Nvidia, both of which were tweaked to work with the Hive cold plates. Despite the fact that the dual-GPU Tesla K80 cards have a power connector on the top of the card, Gallo says that it can make modifications to the coprocessor card that allow it to be used in the Hive node. Eurotech is also supporting the Nvidia Tesla M60 coprocessor and the GeForce GTX980 graphics card and the AMD FirePro S0150 and S9170 graphics cards as accelerators. The company is also looking at supporting the new Tesla M40 coprocessors, which Nvidia just announced two weeks ago, for customers looking for single-precision floating point oomph. (The M40, at 250 watts, offers 7 teraflops of single-precision floating point math compared to 5 teraflops for the K40, at 235 watts; the K80 does a much more impressive 8.74 teraflops, but at 300 watts. The K40 and K80 have respectable double precision capability, but the M40 does not.)
Basically, anything that can plug into a PCI-Express 3.0 x16 slot can be dropped into the Hive module. Up to four of these coprocessors can be put in a single module, and two-port 56 Gb/sec InfiniBand ConnectX adapter cards drop in for networking. If customers want 100 Gb/sec EDR InfiniBand adapters and switches, Gallo says Eurotech can swap these in.
If you totally load up a Hive node with the Xeon E3 system board, four Tesla K80 accelerators, and InfiniBand networking, it draws about 1.5 kilowatts. But here is the interesting thing that you won’t find in the spec sheets. The Hive node and its hot water cooling system (yes, that sounds like a paradox, but it isn’t) is designed to draw off the heat from electronic components that in the aggregate consume 3 kilowatts juice. What that means is that Eurotech could easily accommodate Hive compute nodes using beefier processors and coprocessors, should customers desire this. It would not be surprising to see nodes with some modest compute and an NVM-Express fabric linking multiple flash drives together to create a fast storage node.
At the moment, the big deal that Eurotech has landed is the QPACE 2 system at Universitaet Regensburg, which has a total of 15,872 cores across its 64 nodes and which delivers 316.6 peak and 206.4 sustained teraflops on the LINPACK Fortran benchmark test, all within a 78 kilowatt power envelope. That makes it number 500 on the Top500 rankings of supercomputers, but it is number 15 on the Green500 rankings of the most energy efficient supercomputers – and, as it turns out, that 64-node QPACE 2 machine the most energy efficient machine using the “Knights Corner” coprocessors from Intel, at 4,059 gigaflops per watt.
“This is something that we are particularly proud of because Knights Corner gives 1.2 teraflops double precision and whatever its gigaflops per watt as a processor, and we cannot change that,” says Gallo. “But being the highest ranked Knights Corner system on the list means that everything else around it works particularly well in this system. To make a long story short, you could plug Tesla K80s into the Aurora Hive, and you would probably get the most efficient K80 system on the planet. A processor or coprocessor will give you what it gives you, and on top of this you need to make sure you have an extremely efficient system. The design is really minimalistic, built around acceleration.”
Gallo says that Eurotech is interested adding “Knights Landing” Xeon Phi compute to the Hive modules and is also working on support for Intel’s Omni-Path interconnect, too. The company will also be adding Broadwell Xeon E5 processors and Knights Landing processors to its classic Aurora blade server HPC systems next year, too, and is also working with Nvidia on future “Pascal” accelerators and with both Altera (soon to be part of Intel) and Xilinx (now aligned with the OpenPower collective) for FPGA accelerators for both families of Aurora systems.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
A “Knights Landing” Xeon Phi processor clocked high with relatively small numbers of FP/AVX units compared to a Nvidia’s or AMD’s GPU accelerators with large numbers of FP units clocked low, and the GPUs making up the difference in clocks with much larger numbers of parallel FP units. With current Server/HPC GPUs on a 28nm node giving Knights Landing something to aim for but not completely matching, what do you think will happen when the Arctic Islands(for AMD) and the Pascal(for Nvidia) are released at 14nm/16nm! The GPU accelerators will win over the CPU many core accelerators because the GPU cores will have so much more parallel FP units, even with Intel’s AVX able to break their AVX/512 units into 32 bit registers.
Intel would do better simply developing its own GPU like Accelerators for the heavy FP work that can be clocked lower like the GPU cores can, and simply provide the massive numbers of FP units. Those faster clocks are going to eat more power, and shed more heat at the square of the amount of power and frequency applied, so only massive parallelism on the order of many times the Xeon Phi’s is going to offer the performance/watt for the exascale systems.
Sure the Xeon Phi will have some usage running algorithms that require CPU cores, but with AMD’s ACE units on its GCN GPU micro-architecture doing more CPU like tasks across those ACE units, Intel needs to look into developing a more robust GPU micro-architecture that can compete with AMD’s GPU’s and their HSA abilities. Intel’s current GPU micro-architecture lacks the same much higher numbers of FP unit counts compared to AMD’s or Nvidia’s GPU accelerators!
Intel’s GPUs may be good enough for gaming’s low polygon count mesh requirements, but for raw compute and ultra high resolution gaming AMD’s and Nvidia’s much larger numbers of FP/Other parallel execution resources have a definite advantage. There is simply no replacing the more massive parallel numbers of GPU accelerator FP compute unit counts with the lesser numbers of FP unit counts on the Xeon Phi systems that have to be clocked higher to make up the difference in FP unit counts and waste/leak more power.