Not every HPC or analytics workload – meaning an algorithmic solver and the data that it chews on – fits nicely in a 128 GB or 256 GB or even a 512 GB memory space, and sometimes the dataset is quite large and runs best with a larger memory space rather than carving it up into smaller pieces and distributing across nodes with the same amount of raw compute.
And that is why Lawrence Livermore National Laboratory, which has its share of very big iron and has been at the top of the Top 500 tops for during the long history of that supercomputer ranking, is also building a memory-intensive machine called “Mammoth” that has fairly modest compute but a big memory footprint.
The Mammoth system, which was announced this week by Lawrence Livermore, has been running for a while, and as you might expect given the substantial momentum that AMD is enjoying with its second generation “Rome” Epyc 7002 processors in the HPC space, is based on the top-end 64-core variant and thanks to the eight memory controllers on the Rome Epycs, the memory footprint is larger than is possible on a Xeon SP processor from Intel, which tops out at six controller per socket. This distinction is not just important for memory bandwidth, which is 33 percent higher at the same DRAM clock speed, but it is also important for memory capacity at a certain bandwidth since more memory controllers mean skinner – and therefore cheaper – memory sticks can be used to get a desired capacity and also provide that increased bandwidth at a lower clock speed and, presumably, lower heat. All of this leaves more room to buy more compute, or in the case of Mammoth, we suppose to buy more nodes than you might otherwise do.
According to information provided to The Next Platform by Lawrence Livermore, each Mammoth node has a pair of AMD Epyc 7742 processors. This is the original top bin Rome part when these chips were launched in August 2019 and, interestingly, are the same processors that Nvidia has chosen for its DGX A100 hybrid CPU-GPU systems. In February of this year, AMD jacked up the clock speeds on the Rome chips, putting a 64-core variant called the Epyc 7H12 that clocks at 2.6 GHz but which burns 280 watts and costs $9,000. The Epyc 7742 that Lawrence Livermore and Nvidia have both chosen spins at a slower 2.25 GHz but only burns 225 watts and only costs $6,950. That savings can be plowed back into the system to boost the main memory capacity, which is of course the whole point of the Mammoth system.
To be precise, each of the Mammoth nodes has 2 TB of DDR4 memory (1 TB per socket) and delivers around 410 GB/sec of peak theoretical memory bandwidth across the pair of Epyc chips. In contrast, using two dozen 2.93 GHz DDR4 memory sticks across a pair of 18-core “Cascade Lake” Xeon SP processors running at 2.6 GHz – a low bin part from Intel – the Xeon SP processors deliver about 140 GB/sec of memory bandwidth on the STREAM Triad benchmark test, but peak memory bandwidth was 282 GB/sec peak for a pair of these Intel chips when 2.93 GHz DDR4 memory was used on higher bin parts, as we previously wrote about. Clearly there is a big memory bandwidth advantage for the AMD option.
For local scratchpad capacity, each Mammoth node has 3.84 TB of flash storage, and an Omni-Path network interface card that runs at 100 Gb/sec from Cornelis Networks, which recently took over the Omni-Path InfiniBand networking business from Intel. Each node has 4.6 teraflops of peak double precision performance from the 128 cores on the node. (Remember when a teraflops per CPU seemed like a dream and required exotic architectures?)
The Mammoth system has an Omni-Path network that runs at 100 Gb/sec in a fat tree topology with a 2:1 taper. This is a measure of how many links are between the top of rack and aggregation switches in the network, and speaking very generally, having fewer links and therefore fewer physical switches and therefore a less expensive network does not usually affect performance by that much as long as small messages are being transmitted between the nodes in a distributed compute system. But for large message sizes, the effective bandwidth of a tapered network will be lower than for a fully provisioned fat tree network. (Here’s a good paper on this.) If you think a 2:1 oversubscription on the network is a lot, at the hyperscalers, which use Clos architectures, a 3:1 oversubscription is common, but that Clos architecture (it’s more of a spirograph ring than a fat tree) generally means there is maybe 15 percent utilization of network bandwidth so any spikes that cause network congestion can be easily accommodated and application snap-to instantaneously for all but the longest tails. (This is what happens when your networks make money instead of having to pay for a low latency network that has to help make progress.)
Add it all up, and Mammoth has 8,192 cores and delivers 294 teraflops of peak double precision performance, which is balanced out by 128 TB or main memory across those fat nodes and 245 TB of flash storage. As far as we know, the memory sharing, such as it is, is being done by the Message Passing Interface (MPI), and there is no additional layer such as TidalScale or ScaleMP that is ganging up the memory into larger aggregates or a single shared memory space. (But that would be cool, now that we think about it.)
The cost of the Mammoth system was not announced by Lawrence Livermore because that information was “procurement sensitive and not releasable,” as the lab put it to us, but we did find out that Mammoth was part of the $8.7 million in CARES Act funding that Lawrence Livermore received to help fight the coronavirus pandemic earlier this year. These funds were used to build out the “Corona” supercomputer, which we wrote about here last month and which was named long before the pandemic even started, to pay for a chunk of another machine, and to expand the file systems based on Vast Data’s Universal Storage NFS flash system as well as Green Data Oasis ZFS file systems. If you go out to the major OEMs and configure up a similar node as used in Mammoth, you are in the ballpark of $40,000 after some pretty hefty 35 percent discounts for online sales. (We are citing Dell specifically here because their configurators are public and broad as well as deep.) That would be somewhere around $2.5 million for the nodes, assuming Dell’s discounted prices are representative; there would probably be another $300,000 to $400,000 for the networking, depending on the switches and adapters used.
As it turns out, the Mammoth nodes are based on components brought in from motherboard and whitebox server maker Supermicro and were integrated by a family-run small system integrator business with expertise in HPC systems called MNJ Technologies, which is based in the Chicago suburb of Buffalo Grove, Illinois.
Here is the important thing. We are fighting COVID-19 with these machines at Lawrence Livermore, and the idea is to accelerate the performance of genomics workloads – specifically drug screening simulations and graph analytics – without having to split datasets in pieces and run them across multiple nodes. This has boosted the performance of Rosetta Flex calculations by a factor of eight since a single node can run 128 calculations instead of the 16 calculations possible on skinnier memory nodes. But as for overall COVID-19 research, the spending that Lawrence Livermore has done with its CARES Act funding is really about balancing the whole workflow across its systems and storage, getting the right amount of CPU and GPU compute, big memory compute, and ZFS and NFS storage, and having 64 fat nodes on Mammoth balanced out the raw compute on the skinnier Corona skinny nodes, some of which have AMD Instinct GPUs as well as a mix of “Naples” Epyc 7001 and Rome Epyc 7002 processors that, all told, provide 11 petaflops of 64-bit number crunching after the upgrade.