HPC Gets A Reconfigurable Dataflow Engine To Take On CPUs And GPUs

No matter how elegant and clever the design is for a compute engine, the difficulty and cost of moving existing – and sometimes very old – code from the device it currently runs on to that new compute engine is a very big barrier to adoption.

This is a particularly high barrier with the inevitable offload approach that was new in the supercomputing racket when the “Roadrunner” supercomputer at Los Alamos National Laboratory was developed in the early 2000s by IBM under the auspices of the US Department of Energy’s National Nuclear Security Administration. Roadrunner was an architectural trailblazer, and gave Nvidia and others the idea that they, too, could build offload engines with powerful calculating capabilities.

Roadrunner’s architectural design, lashing four of IBM’s “Cell” PowerXCell floating point vector engines with a pair of AMD Opteron X86 server CPUs, broke through the petaflops barrier in 2008 and set the pace for future hybrid compute architectures that are familiar today for both HPC and AI applications.

But no one ever said that programming for these hybrid devices was easy, and Roadrunner was particularly hard to program because it was the first of its kind. There was much complaining, and it took some of the smartest people on Earth to get work out of the machine. (Luckily, Los Alamos has a fairly large number of the smartest people on Earth.) Nvidia GPUs are only relatively easy to program today because of the enormous amount of work that Nvidia has done to create the CUDA programming environment and a vast trove of libraries, frameworks, and algorithms. Remember: 75 percent of Nvidia’s employees work on software, even if the lion’s share of its revenues – it is 90 percent, is it 95 percent, is it 99.5 percent? – come from hardware.

What if none of this was necessary? What if you just threw your C++ or Fortran code at a massive dataflow engine that could reconfigure itself for your code, and do so automagically as it runs, constantly tuning and retuning itself as different chunks of code are activated?

That is the dream of Elad Raz and the team he has built at NextSilicon, which is dropping out of stealth mode this week with its second generation Maverick-2 dataflow engine and taking on the HPC market with a new approach to both hardware and software.

It is hard to believe, isn’t it? How many innovative and novel architectures with “magic compilers” have we heard about over the years? More than we can count. But like the HPC market itself, we remain hopeful that with the right level of abstraction and the right amount of automation the job of executing code across a complex of different kinds of compute engines can get easier. Perhaps this is the time. It is either this or to leave to job of porting code and creating new codes to GenAI bots because there just are not enough people in the world who want to do this very difficult task even if you pay them hundreds of thousands of dollars a year.

NextSilicon, which was founded in 2017, way before the GenAI craze but when it became apparent that HPC and AI compute engine architectures were going to diverge – and not in favor of the HPC simulation and modeling crowd that is focused on 64-bit and 32-bit floating point calculations. And even without an initial plan to go after the AI market directly, as Cerebras Systems, Graphcore, Groq, Habana Labs, Nervana Systems, SambaNova Systems, and others have done, NextSilicon has been able to raise $202.6 million in funding in three rounds, with its Series C round coming in June 2021 at $120 million.

At that time, that gave NextSilicon a valuation of around $1.5 billion, and the funds and the prototyping work completed meant that the US Department of Energy could listen to what NextSilicon was up to. Sandia National Laboratory and NextSilicon collaborated on the design and testing of the Maverick-1 dataflow engine, and Sandia is now building a novel architecture supercomputer nicknamed “Spectra” as part of its Vanguard-II program. Presumably this will be built using the Maverick-2 dataflow engine that is being revealed today – Sandia has not said, and NextSilicon is not at liberty to say. We expect for Spectra to be installed in Q1 2025 and for a deeper dive on the Maverick-2 chip and systems using it at that time.

What Raz can say is that the Department of Energy and the Department of Defense are working with it, as are a number of other organizations in the United States and Europe.

Why We Need Another HPC Accelerator

The good news for the HPC centers of the world is that Maverick-2 is aimed at them, and NextSilicon is not going to try to chase the AI training and inference market just yet.

“There isn’t an accelerator just for high performance computing,” Raz tells The Next Platform. “We have hundreds of companies doing acceleration for AI and machine learning, and most of the big vendors are pivoting away and going into AI machine learning. You can see what the big supercomputers mean for them – they just build a new GPU cluster that is twice as expensive, has twice as much power consumption, and you get the same FP64 flops. NextSilicon is an HPC-first company.”

Raz adds that in the long run, NextSilicon will create compute engines suitable for AI work, which makes sense because you cannot ignore a market that will drive more than half of system sales in the coming years if all of the prognostications are correct. (Here are Gartner’s most recent take and IDC’s most recent forecasts, both with our riffs on their data.)

For the moment, NextSilicon is not saying much specific about the internals of compute engine it has created, and that is intentional. The company wants to get people to focus on the software problem it is solving first and then early next year get into the precise feeds and speeds of the guts of the Maverick-2 dataflow engine.

“The point of NextSilicon is to use software to accelerate your application,” explains Raz. “At the heart of this is a sophisticated software algorithm that understands what matters in the code and accelerates that. By contrast, most CPUs and GPUs are banks of processor cores in one form or another. They are getting instructions and they are trying to build a sophisticated pipeline and vector instruction set, with out of order execution, to reduce latency. We think this is the wrong approach. The better approach is to apply the Pareto Principle and look where 20 percent of your code is talking up 80 percent of the runtime. Why aren’t we applying the 80/20 rule for compute and memory? Why can’t we automatically identify the computational kernels that matter and try to focus solely on them?”

The quick answer is that this is hard, but that is the secret sauce of the Maverick platform. The dataflow engine etched in transistors is just how it is manifested physically once a graph of the flow of the data and operations embodied in a program is created by the Maverick compiler and scheduler.

And Raz then describes the secret sauce: “The application starts running on the host, and then we automatically identify those compute intensive portions of the code. We stay in the intermediate representation of the computational graph. We don’t change the graph to instructions. You need to think about this as a just in time compiler for hardware. We are keeping the graph for the program and we are placing it on the dataflow hardware. We are getting telemetry from the hardware, and we are doing that in a recursive way so we always keep optimizing compute and memory as the program is running.”

Conceptually, this is what the process looks like converting a C, C++ or Fortran program that ran on a CPU host or a GPU accelerator to a Maverick dataflow engine. The first step is to identify what Raz calls the “likely flows” in the graph of the running program:

The likely flows and unlikely flows in the code are projected down onto the grid of processing and memory elements in the Maverick dataflow engine, like this:

As the code is running on the dataflow engine, performance bottlenecks in the first pass of on-the-fly compilation are identified and with telemetry passed back up to the Maverick compiler, the flows are iteratively rebalanced in an asymmetric way to create more “hardware” to favor the likely flows and to give less “hardware” to the unlikely flows. Certain kinds of serial work are offloaded to local and more traditional cores on the Maverick dies, and very heavy serial work is passed back to the host where it can run faster. In this case, the host CPU is actually an offload series engine for the parallel Maverick dataflow engine.

At the point where the Maverick compiler has fully optimized the configurations of the hardware to run the likely and unlikely flows, the system creates what is called a mill core:

The mill cores aim to use as much of the resources in the dataflow engine as possible given the application snippets being offloaded to it, and they are, in essence, a software-defined core, created on the fly to run the most probable portions of the HPC code to accelerate them. The mill cores are optimized for throughput more than latency, and emphasize power efficiency over brute force. Importantly, a mill core can run hundreds or thousands of data streams in parallel, and they can be replicated across the dataflow engine to do work in parallel, just like real CPU cores and real GPU streaming processors are in physical devices. Thusly:

When you replicate hundreds to thousands of data streams over hundreds of mill cores, you get massively parallel processing that can improve runtimes by orders of magnitude.

The other central idea is to get the work and data onto the Maverick dataflow engine and keep as much of it there as possible to minimize data movement, which is the killer in any hybrid architecture. If you do that, you get a distribution of likely flows, reasonably likely flows, and unlikely flows that looks like a futuristic skyscraper across the three tiers of compute:

Pretty, isn’t it?

So is this spider graph that shows how the salient characteristics of the Maverick dataflow engine stacks up against CPUs, GPUs, and FPGAs:

The idea is to get the flexibility, portability, and programmability of a CPU and better power efficiency and throughput than a GPU and FPGA, and sacrificing on single threaded performance that can be done with the Maverick embedded cores (E-cores) or the host CPU (likely an Arm or X86 device these days).

At this point, Raz is not revealing what the E-cores on the Maverick-2 chip are, but we are pretty much certain that they are not an Atom-inspired E-core of the same name from Intel, and we are pretty sure that they are either a licensed Arm core or a licensed or homegrown RISC-V core. There are not really other practical options in 2024. (No one is going to use a heavy Power core from IBM here.)

Now that we have turned our whole way of looking at the world on its head by talking about the software first, let’s finally take a look at the Maverick-1 and Maverick-2 hardware. Here is the Maverick-1 die:

We don’t know a lot about the Maverick-1 chip, but we count what looks like two four banks of 24 E-cores and what looks to our eye as 256 compute elements – four elements in a block, four blocks in a row, and sixteen rows in a chip. We do not know the process used to make the Maverick-1 chip, but we presume it is either 16 nanometer or 7 nanometer processes from Taiwan Semiconductor Manufacturing Co.

Here is the Maverick-1 PCI-Express card:

The Maverick-2 chip is etched in 5 nanometer processes from TSMC, and is, like many startups, hyperscalers, and cloud builders that are designing chips using a third party to shepherd the compute engine through the TSMC foundry and packaging partners to a finished product. The chip has an area of 615 mm2, which is not reticle busting but which is not small either:

The Maverick-2 has a total of 32 E-cores and what looks to our eye as 224 compute elements – it looks like four blocks, with each block having a grid of seven by eight elements.

The frequency of the device, according to the spec sheet below, is 1.5 GHz, and we presume that both the E-cores and the dataflow processing elements are running at the same speed. (This could turn out to not be true. You might want to run the E-cores a lot faster, say 3 GHz, to tackle the serial work and knock it down. If we were designing the E-cores on the Maverick-2, they would have a base frequency of 1.5 GHz and turbo up to 3 GHz when they are running.)

As you can see from the specs below and the package shot above, the Maverick-2 chip has four banks of HBM memory, in this case HBM3E memory.

The dual-chip version of Maverick-2 supports the Open Compute Accelerator Module (OAM) form factor created by Microsoft and Meta Platforms to get a universal accelerator socket for datacenter accelerators. Intel and AMD use the OAM socket for their accelerators; Nvidia does not. The OAM version only exposes a total of 16 lanes of PCI-Express 5.0 I/O to the outside world instead of 32 lanes, too.

Each chip has a thermal design point of 300 watts, so the OAM unit specs out at 600 watts and is only available in a liquid cooled version. This is the new norm for supercomputing, so no big D.

Here is a conceptual diagram showing these two form factors:

Don’t get hung up on the physical layout in the diagram above. We do not think it is the same as the actual chip shot shown above. What you will note is that each Maverick-2 chiplet has a 100 Gb/sec Ethernet port to link the accelerators together.

It is not clear how Maverick-2 devices can be linked in a shared memory cluster. This is something we expect to learn more about early next year.

The Maverick-2 chip supports C, C++, and Fortran applications with OpenMP and Kokkos frameworks. Raz says that NextSilicon will eventually support for Nvidia CUDA and AMD HIP/ROCm environments and popular AI frameworks.

The peak theoretical performance of the scalar, vector, and tensor units on the dataflow engine are not being revealed until Q1 2025, and Raz doesn’t put a lot of stock in these numbers anyway given that CPUs and GPUs do not come close to their peak performance in the field.

“Peak performance doesn’t matter when you are seeing a lot of hardware vendors adding lots of teraflops, but you cannot reach them because it’s only in the GEMM, only in the matrix multiplication, only when you have local data,” says Raz. “The point with real applications is to utilize the hardware more efficiently rather than to add a lot of floating point that no one can reach.”

Could not have said that better ourselves. And given this, we would not be surprised to see a Maverick-2 chip come in with peaks in the tens of teraflops for vector and tensor FP64 and more fully utilize them on actual codes than GPUs can. This is, in fact, the idea that NextSilicon has based its company upon.

In its backgrounder document, NextSilicon says that the Maverick-2 will deliver 4X the performance per watt over the Nvidia “Blackwell” B200 GPU. We know that the B200 comes in at somewhere between 1,000 and 1,200 watts, and that a single Maverick-2 comes in at 300 watts and a pair in an OAM package comes in at 600 watts.

The document goes further and says that on a mix of HPC simulations, the Maverick-2 delivers over 20X the performance per watt of a 32-core Intel “Sapphire Rapids” Xeon SP-8352Y Platinum processor, which is rated at 1.8 teraflops at FP64 precision on its AVX-512 vector units and that burns 205 watts.

But as Raz points out, in this case in particular, you are testing out the strength of the Maverick compiler and its ability to create and replicate mill cores that are laid down on the dataflow engine as much as you are the raw FP64 and FP32 oomph of the chip. What we need to know as benchmarks run is what share of the device’s total capability is executing as it runs at a given rate, and then adjust that for cost.

We look forward to seeing results out of Sandia on the Spectra supercomputer. This could be a lot of fun, and hopefully it will upturn a lot of apple carts. HPC compute could use some love and not the floppy seconds of the AI crowd.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

7 Comments

  1. I’m looking forward to seeing what types of sample codes they have that work well on this. My concern is that so many of the codes that perform poorly on GPUs and CPUs, do so because of memory latency, or because existing optimizations can’t be used due to ordering/correctness constraints (often false, but unclear due to the language). Often taking care of the ‘unlikely flow’ cases is very very costly to the likely flow.

  2. Way to go NextSilicon! And I’m glad to see Sandia’s Vanguard program testing out this emerging tech for NNSA viability ( https://www.sandia.gov/research/news/sandia-partners-with-nextsilicon-and-penguin-solutions-to-deliver-first-of-its-kind-runtime-reconfigurable-accelerator-technology/ ).

    It seems that there’s a lot of interest in using reconfigurable connections between computational units to allow systems to adapt flexibily to workloads that range from “standard flow” (or even no flow) to dataflow. Google’s reconfigurable optical interconnect for TPUs would be a large scale example, while SambaNova’s Reconfigurable Dataflow Units (RDUs), Groq’s Software-defined Scale-out TSP/LPU, or RipTide’s Coarse-Grained Reconfigurable array (CGRA) would be finer scale examples. In my mind, such reconfigurability should mean (as a goal) that the resulting machine performs as well on dense matrix-vector workloads as on graph-oriented workloads (hopefully), through those in-between (HPCG?), by reconfiguring itself accordingly ( https://www.nextplatform.com/2018/08/30/intels-exascale-dataflow-engine-drops-x86-and-von-neuman/ ).

    That the compiler (software) is an important aspect of this was certainly stressed by Tenstorrent’s Jim Keller in an excellent interview last year ( https://www.nextplatform.com/2023/08/02/unleashing-an-open-source-torrent-on-cpus-and-ai-engines/ ): “the graph needs to be lowered with interesting software transformations and map that to the hardware”. Here, with NextSilicon’s Maverick, the compiler additionally raises the (reconfigurable) hardware to the graph, which demands some cool extra sophistication (for extra performance), if I understand well.

    In time, the reconfigurable NoC between processing units might be advantageously implemented using high-bandwidth reconfigurable optical interposers, with integrated controllers (eg. https://www.eetimes.com/lightmatter-raises-400-million-series-d/ ) … and it should prove worthy to evaluate RAM amounts and distribution, as suggested in the Cerebras article from the day before yesterday, where SwarmX and MemoryX are used to provide supplemental memory where needed (to reap the full benefits of near- and in-memory computing).

    Cool stuff!

  3. “And now for something completely different” (as Monty Python may have titled this article)…Seems that others have tried to move beyond von Neumann’s fetch/execute/write-back loop in various ways over the years (1984 Yale-infused Multiflow with its very wide VLIW and a decade later with Intel’s light VLIW Itanic)…Always looking to a “super compiler” to draw out the latent parallelism in the Universe’s serial IF/THEN/ELSE code base…Hmmm and yet in a half-century+ only has the GPU risen to be on par with the serial CPU…Well maybe this time will be different.

  4. Thanks a lot Mr. Morgan for this very interesting insight.
    Still curious how Tachyums approach looks like in comparison…perhaps not much different, as they also talk about any general “faster in everything at lower power consumption” – in Germany this is called “Eierlegende Wollmilchsau” (a pig, that produces eggs, milch and wool)?
    Let’s wait and see.

    • I would be happy to talk abiout Tachyum as soon as they show me a real chip. I still see this ICAm as NextSilicon calls it, as an accelerator. My understanding is Prodigy is an actualy host processor plus accelerator wrapped in one, and like NextSilicon, has a compiler that accelerates the most common routines in code. That was the idea, anyway.

  5. …at least Tachyum says, it has now the last FPGA emulator before tape-out next year…time will tell…
    (btw. typo in my text…should be “milk” and not “milch” of course, unless “milch” should have made it (without my knowledge) into the Englisch vocabulary as “Kindergarten”, “Angst” or “Sauerkraut” did somehow 🙂

    • Oddly enough, I have been to a kindergarten event where I experienced angst because my child is too chatty sometimes and also had sauerkraut on a frankfurter. HA!

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.