If you look back at it now, especially with the advent of massively parallel computing on GPUs, maybe the techies at Tera Computing and then Cray had the right idea with their “ThreadStorm” massively threaded processors and high bandwidth interconnects.
Given that many of the neural networks that are created by AI frameworks are themselves graphs – the kinds with vertices with data and edges showing the relationships between the data, not something generated in Excel – or output what amounts to a graph, maybe, in the end, what we need is a really good graph processor. Or, maybe millions of them.
Gasp! Who speaks such heresy in a world where the Nvidia GPU and its wannabes are the universal unguent to solve – salve, surely? – our modern computing problems? Well, we do. While GPUs excel at dense matrix high precision floating point math that dominates HPC simulation and modeling, a lot of the data that underpins AI frameworks is sparse and lower precision to boot. And given this, maybe there are better ways to do this.
The US Defense Advanced Research Projects Agency, the research and development arm of the Department of Defense, explores just such cutting edge questions, and has been looking into creating a massively parallel graph processor and interconnect since establishing the Hierarchical Identify Verify Exploit (HIVE) project back in 2017. Intel was chosen to make the HIVE processor and Lincoln Laboratory at MIT and Amazon Web Services were chosen to create and host a trillion-edge graph dataset for a system based on such processors to chew on.
At Hot Chips 2023 this week, Intel was showing off the processor it created for the HIVE project, which was originally codenamed “Puma” in relation to the Programmable Integrated Unified Memory Architecture (PIUMA) that underpins it. In August 2019, Intel gave an update on the PIUMA chip at DARPA’s ERI Summit, and at the IEEE’s High Performance Extreme Computing 2020 event in September 2020, Intel researchers Balasubramanian Seshasayee, Joshua Fryman, and Ibrahim Hur gave a presentation called Hash Table Scalability on Intel PIUMA, which is behind an IEEE paywall but which gave an overview of the processor, and a paper called PIUMA: Programmable Integrated Unified Memory Architecture, which is not behind a paywall. These were vague about the architecture of the PIUMA system. But this week, Jason Howard, a principal engineer at Intel, gave an update on the PIUMA processor and system, including a photonics interconnect that Intel has created in conjunction with Ayar Labs to lash an enormous number of processors together.
In the IEEE paper, the PIUMA researchers made no bones about the fact that they were absolutely inspired by the Cray XMT line. The XMT line from a decade ago culminated with a massive shared memory thread monster that was perfect for graph analysis, which had up to 8,192 processors, each with 128 threads running at 500 MHz, plugging into an AMD Rev F socket used by the Opteron 8000 series of X86 CPUs all lashed together with a custom “SeaStar2+” torus interconnect that delivered 1.05 million threads and 512 TB of shared main memory for a graph to stretch its legs upon. As far as Linux was concerned, this looked like a single CPU.
What’s old is new again with the PIUMA project, and this time the processor is more modest but the interconnect is much better. And presumably the price/performance is as well, and for the love of all that is holy in heaven, perhaps Intel will commercialize this PIUMA system and really shake things up.
Taking Smaller Bytes Of Memory
When Intel started out designing the PIUMA chip, according to Howard, the researchers working on the HIVE project realized that graph jobs were not just massively parallel, but embarrassingly parallel, which means there were probably some ways to exploit that parallelism to drive up the performance of graph analytics. When running on standard X86 processors, graph databases had very bad cache line utilization, with only 8 bytes or less of a 72 byte cache line being used over 80 percent of the time a graph database was running. Having lots and lots of branches in the instruction flow put stress on the CPU pipelines and the memory subsystem also had a lot of pressure on it from long chains of dependent loads, which thrashed the caches on the CPUs.
The PIUMA chip has some big-little ideas embedded with it, and has four pipelines that have 16 threads per pipeline (called MTPs) and two pipelines that have a single thread each (called STPs) that deliver 8X the performance of one of the threads within the MTPs. The cores are based on a custom RISC instruction set, which Howard did not identify and neither have his research peers at Intel or Microsoft, which has also had a hand in the PIUMA effort.
“All the pipelines use a custom ISA, it’s RISC-like, it’s a fixed length,” explained Howard in his Hot Chips presentation. “And each of the pipeline has 32 physical registers available to it. We did this so that you could easily migrate compute threads between any of the pipelines. So maybe I start out executing on one of the multithreaded pipelines and if I see that it’s taking too long, or maybe it’s the last thread available, I can migrate that quickly over to my single thread pipeline to get better performance.”
The STP and MTP units are connected by a crossbar and have a combined 192 KB of L1 instruction and L1 data cache, and they link to a shared 4 MB scratchpad SRAM memory that is simpler than an L2 cache.
Each PIUMA chip has active eight cores, and each core has its own custom DDR5 memory controller that has 8 byte access granularity instead of 72 byte granularity like normal DDR5 memory controllers do. Each socket has 32 GB of that custom DDR5-4400 memory.
Each core has a pair of routers that link the cores in a 2D mesh to each other, to the eight memory controllers, and to four high speed Advanced Interface Bus (AIB) ports. AIB is a royalty-free PHY for interconnecting chiplets that Intel announced back in 2018. There are 32 optical I/O ports, eight per AIB, that come off the die complements of Ayar Labs, which deliver 32 GB/sec of bandwidth in each direction.
Here is a drilldown on the on-chip routers that implement the 2D mesh on the PIUMA package:
This is a ten port, cut-through router The 2D mesh runs at 1 GHz and it takes four cycles to traverse the router. It has ten virtual channels and four different message classes, which Howard says avoids any deadlocks on the mesh and which yield 64 GB/sec for every link within the router.
The router and core packaging on the PIUMA chip is a bit more complex than you might expect. Take a look:
It is more like there are sixteen core/router units on the die, and only eight of them have the cores activated because the on-die mesh needs twice as many routers to feed into the AIBs, which in turn feed into the Ayar Labs silicon photonics. The silicon photonics links are only used as a physical layer, and they are used specifically to extend the on-die network between multiple sockets.
And when we say multiple, we mean a freaking huge number. Like this:
A sled of sixteen these PIUMA chips using the silicon photonics links can be linked together in a 4×4 grid in an all-to-all configuration. Each PIUMA chip burns around 75 watts at nominal voltage and workloads, which means a it burns around 1,200 watts. More than a Xeon SP socket, but not more than three of them.
Building The Perfect Graph Processing Beast
The PIUMA chip has 1 TB/sec of optical interconnect coming out of it, and aside from the on-sled links, some of them can be used to connect up to 131,072 sleds together to make a massive shared memory graph processing supercomputer. The routers are the network, and everything is connected by a HyperX topology outside of the directly connected-all-to-all topology within a rack, which has sixteen sleds.
Let’s walk through this. A single sled with sixteen sockets has 128 cores with 8,448 threads and 512 GB of memory. The first level of the HyperX network has 256 sleds, 32,768 cores, 270,336 threads, and 128 TB of memory. Step up to level two of the HyperX network, and you can build out a PIUMA cluster with 16,384 sleds, 2.1 million cores, 17.3 million threads, and 8 PB of shared memory. And finally, up at level three of the HyperX network, you can expand out to 131,072 sleds, 16.8 million cores, 138.4 million threads, and 64 PB of shared memory.
Admit it. You want to see what one of these beasts could do. The US National Security Agency and the Department of Defense in the United States and their peers around the world, who have funded a lot of the AI research in the past fifteen years, no doubt are interested, too.
As you chew on that scale for a minute, let’s go over a few more things. First, the latency of that optical network:
The PIUMA nodes are linked together with single-mode optical fiber, and it is interesting that the achieved bandwidth of the PIUMA network design, at 16 GB/sec per direction, was only have the of the theoretical design point. But even still, this is a huge bandwidth beast, with a theoretical 16 PB/sec of uni-directional bisection bandwidth across the full HyperX network.
The PIUMA chip is implemented in 7 nanometer FinFET processes from Taiwan Semiconductor Manufacturing Co, and it has 27.6 billion transistors on it with 1.2 billion of them being dedicated to the relatively small cores. The AIB circuits take up a lot of the transistor count, apparently.
Here is what the PIUMA chip package looks like:
And here is what the package and test boards look like:
So far, Intel has built two boards with a single PIUMA chip each and interlinked them to runs its tests and to prove to DARPA that it works.
The question now is, what will such a machine cost at scale? Well, at $750 per node, which ain’t much at all, it is $3.1 million for a system extended to HyperX level one with 4,096 PIUMA chips, close to $200 million for one with 262,144 chips at HyperX level two, and $1.57 billion for one with 2.1 million chips stretched to HyperX level three.
As the generative AI explosion shows, there are several dozen companies and then another several dozen government agencies in the world that won’t even blink at $1 billion for a system anymore. The number didn’t even raise my pulse when I wrote it and then read it.
That’s just the time we live in now.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.