It has taken eight years and $303 million in seed and three rounds of venture funding, but NextSilicon is today delivering several incarnations of its 64-bit dataflow engine, called Maverick-2, which was revealed this time last year when the company dropped out of stealth mode.
The company is also unveiling a very interesting homegrown RISC-V processor called Arbel that, we presume, will be paired with Maverick-2 to create “superchip” host-accelerator combinations (as Nvidia calls them), representing a true novel architecture that will be appealing to the HPC centers of the world, who still care very much about 64-bit floating point computing. First up to deploy a production Maverick-2 system will very likely be Sandia National Laboratory, which assisted with development of the Maverick-1 proof of concept processor that launched in 2022.
As we pointed out last year in our deep dive on the company last October, NextSilicon is interesting for a number of different reasons.

First, it is unabashedly an HPC-first company, which is something we have not seen in a compute engine maker for a long, long time. Second, NextSilicon has created a multi-tier computing architecture that has a reconfigurable dataflow engine at the central and most important tier where the bulk of computing for any HPC simulation or model is expected to run. (And by the way, nothing is preventing anyone from running AI applications on this processor.) And third, the Maverick architecture has automated the way code is ported, run, and continuously optimized as it is moved off of a CPU and run across NextSilicon’s own host processor, the special RISC-V cores embedded on the Maverick socket, and the banks of arithmetic units that do the bulk of computing and that represent most of the transistor count in a Maverick chip.
This last bit of the hardware architecture, which enables the software architecture to work, is the new revelation as the Maverick-2 chip comes to market today. So let’s start with the hardware and work our way up to the software.
Ilan Tayari, NextSilicon co-founder and vice president of architecture and formerly director of software at Mellanox (now the networking arm of Nvidia), walked through the differences between the Von Neumann architecture of a classical CPU and the new dataflow engine, which quite frankly for the visual thinkers in the crowd makes a lot more sense and is visually pleasing. With a dataflow engine such as NextSilicon has invented, the hardware literally maps itself to the software rather trying to be a short order cook with too many orders but with 27 arms and eyes in the back of the head, flailing around to make up for the difficulties of getting the right data and the right instructions to collide billions of times per second.
In terms of area, Tayari says that around 2 percent of the typical CPU is dedicated to arithmetic logic units, or ALUs – ya know, the things that actually do math to transform data. (When Tayari points this out, it does sound a little stupid.) Everything else on the chip is just to stage the instructions and the data so they collide way down in the ALU.
With the Von Neumann architecture, created back in the 1940s with the first vacuum tube systems, there is a single, unified memory space that holds instructions and data. An instruction fetch unit reads instructions from memory in an order determined by a program counter, and an execution unit does the computation. Memory is accessed using registers and a memory management unit translates memory addresses. Over the years, various levels of SRAM cache have been added to processors, all to increase the odds that the right instructions and data can be found when needed, and circuits for branch prediction, speculative execution, and out of order processing have been added even as instructions moved from CISC (fat) to RISC (skinny) to push more bits through the goose.
“These solutions can increase performance, but they also have high costs,” Tayari explained at the Maverick-2 launch. “They take a lot of silicon real estate, and the mechanism to revert mispredictions has a large negative impact on performance. Here is the shocking reality. Today’s high end processors have become complicated and chunky, both physically and practically. They dedicate 98 percent of their silicon to overhead, traffic management, data shuffling – not actual computation – and add more complexity. Clock speeds suffer, and slower clocks means slower execution. Some GPUs devote up to 30 percent of their silicon to compute, but because the blocks are mutually exclusive, only a few can run at once. All of these efforts to improve processor design are merely reactive solutions – workarounds. End users pay three penalties. The chip costs more. Power consumption increases. Cooling requirements grow. Everything becomes more expensive.”
In the Intelligent Computing Architecture, or ICA, approach developed by NextSilicon, the idea was to create logic blocks that were comprised of hundreds of interlinked ALUs, with instructions called in an application literally mapped to each ALU. (In this sense, an ALU is akin to an instruction.) NextSilicon is not releasing the specific number of ALUs per compute block, but we do know a few things. First look at this chip shot of the monolithic Maverick-2 die:
There are four compute regions, with the 32 RISC-V E-cores on the outside edges on the left and right of the chip. By our count, there is a grid of seven columns of compute blocks that each have eight compute blocks, for a total of 224 compute blocks on the die. At hundreds of ALUs per compute block, you can easily get many tens of thousands to close to a hundred thousand ALUs. This doesn’t seem unreasonable for a Maverick-2 chip that weighs in at 54 billion transistors at 5 nanometer processes from Taiwan Semiconductor Manufacturing Co. If you do a 14 by 14 grid as NextSilicon shows in its charts, then there are 196 ALUs per compute block; we do not know how many floating point units are in a compute block. It would make sense that every ALU had an FPU.
The “Ampere” A100 GPU from Nvidia was etched in 7 nanometer processes from TSMC, had 54.2 billion transistors, and 6,912 FP32 CUDA cores, while the “Hopper” H100 and H200 GPUs were made in 4 nanometer processes, have 80 billion transistors, and have 18,432 FP32 cores. The Blackwell B200 socket has two chiplets, each with 104 billion transistors but with only 16,896 CUDA cores each, made in a 4 nanometer process. We surmise that the ALUs are smaller than CUDA cores and that there are more of them on a Maverick-2 die than there are CUDA cores on an Nvidia GPU.
Ultimately, the ALU count is not as important as the thread count that a collection of mill cores can support. Tayari says that a typical CPU has two threads, a GPU has between 32 and 64 threads, but a mill core can support hundreds of threads at once. The size and shape of the mill cores change, of course, but with maybe tens of mill cores per compute block and 224 compute blocks per Maverick-2, you are easily up to thousands of threads, all running at 1.5 GHz – about the speed of a slow CPU or a normal GPU – and all linked to HBM3E memory for fast bandwidth.
What we really wanted to know is what the average utilization rates for the ALUs and FPUs were as applications start running and then as the optimizations are done and the resources on the Maverick chip get more fully utilized. We think that utilization rate doesn’t have crazy high to match a CPU doing branch prediction, speculative execution, out of order processing, and other kinds of unnatural acts to make a serial processor run faster. Still, after tuning and across a wide variety of HPC applications, we would not be surprised if the ALU and FPU utilization got up to 75 percent or 80 percent of the potential blocks across thousands of threads flitting in and out of the chip as mill cores come and go.
Anyway, you pour data into the flow at the top of the Maverick-2 chip, it flows through all of the transformations and calculations, and out pops the answer at the bottom.
This main logic unit, as you see in the chart above on the right, is attached to a memory bus, which has a reservation station to temporarily store data before an ALU calls for it. (It is this combination of reservation station, dispatcher, and dataflow compute block that NextSilicon has patented.) Like regular CPUs, the Maverick ICA uses memory management units and a table lookaside buffer, but these are used sparingly and only when an ALU calls for specific data. There is no speculation or prediction, just fetching.
“NextSilicon’s dataflow architecture allows us to achieve significantly lower overhead compared to traditional CPUs and GPUs,” brags Tayari. “We pivot the silicon allocation ratio. We dedicate the majority of the resources to actual computation rather than control overhead. Our approach uniquely eliminates instruction handling overhead. We minimize unnecessary data movement, and the result is compute units stay fully utilized. We’re not trying to hide latency, but to tolerate and minimize it by design.”
When an application is compiled for the dataflow engine, it is literally mapped onto it, into something called a mill core (it looks like a graph). It looks like the intermediate representation graph of a program before it is compiled, and it is laid down on the ALUs. Many mill cores can be laid down on the same compute block, Tetris style, and the mill cores can be loaded up and deleted as needed in a matter of nanoseconds as needed by the workload, according to Elad Raz, co-founder and chief executive officer at NextSilicon.
While this is all interesting, that is not the magic of NextSilicon’s ICA. (Wait for it – this is the magic compiler moment.) You can take existing C, C++, or Fortran code, grab its intermediate representation and plunk that down onto the ICA, and Maverick-2 will not only map it to its ALU blocks and thus compile it, but has algorithms that constantly analyze how that resulting dataflow is functioning and changes it on the fly, without human intervention, to improve it. The longer the code runs, the better it gets, and if you change the code at a higher level, you just run it and let it self-tune for a while. There is no porting to Nvidia CUDA-X or AMD ROCm or anything else – just an automated way to take CPU code and make it run on a massively parallel dataflow engine without you having to figure it out.
As we explained a year ago, the entire application is not ported to the Maverick chips. Just the parts of the code that are commonly used and represent 80 percent or so of the instruction runtime. Any code that does not benefit from running on the ALU blocks in dataflow mode can run on 32 RISC-V E-cores that are embedded on the Maverick-2 package, and for those parts of the code that would benefit from a beefier set of cores and memory, execution will take place on the CPU host processor, which for now is an unnamed X86 chip.
Here is what the entire workflow looks like:
One of the tricks in the Maverick architecture is to know what code to leave on the host CPU, what code to leave on the RISC-V E-cores (short for embedded cores), and what code must be moved to the ALU blocks. The other trick is that if there is a particular routine that could be parallelized and replicated to speed up the overall application, then the Maverick compiler can just plunk down more copies of the mill cores created for any specific routine. The compiler is, in effect, creating software cores in the shape and number that you need for this application at this time rather than taking a specific number of static compute units with specific integer and floating point precisions – and there are many combinations of static compute elements on CPUs and GPUs these days. This the dataflow ICA is almost as malleable as an FPGA in this regard, but it programs itself. (As many FPGA tools claim to do, by the way, converting C, C++, or Fortran code to HDL automagically.)
This spider graph helps illustrate some things about Maverick-2:
The story this chart tells is that the dataflow processor created by NextSilicon has the compatibility, portability, programmability, and flexibility of the CPU, better power efficiency and throughput than a GPU, but abiut the single threaded performance of an FPGA. Nothinbg has the single threaded performance of a CPU, which is why you put CPU cores inside of FPGAs and now DFPs and also allow host CPUs to pick up the slack on serial tasks.
Here is an updated salient characteristics table for the single-die and dual-die implementations of the Maverick-2 chip:
The chip max power is higher than what was expected when we did our story last year, with the TDP on the single chip Maverick-2 now at 400 watts (up from 300 watts) and on the dual-chip version for OAM sockets now at 750 watts (instead of 600 watts). Everything else is the same.
What is also different this time around is that we have peak flops performance at different floating point precisions, as you can see in the table above. We are not sure how to configure the Maverick-2 as a matrix/tensor unit, but that sounds neat. Clearly it can be done and it does boost performance.
These peak numbers for Maverick-2 are not very high compared to recent GPUs. With an Nvidia H100, for instance, the peak FP64 performance on dense data is 33.5 teraflops on the vector units and 67 teraflops on the tensor units. A single Maverick-2 is about a third of that. But Raz wrote about the gaming that happens with peak theoretical performance and the fact that sustained performance is what matters in the real world back in February of this year in a column in The Next Platform. And we agree.
Still, peak flops also gives us a ceiling from which to measure against, so it is important to know both sustained and peak performance. That way we can judge the computational efficiency of a compute or networking engine. The limited evidence we have suggests that Maverick-2 is very efficient, and GPUs are not. Take a look:
The GUPS benchmark, short for Giga Updates Per Second, is designed to stress test the bandwidth and latency of the memory subsystem of a compute engine, and frankly it is not one that we are very familiar with. On the GUPS test, the Maverick-2 was rated at 32.6 GUPS running at 460 watts. This was, according to NextSilicon, 22X faster than a CPU and nearly 6X faster than a GPU, but we have no idea what CPU or GPU was tested. (This should be in the presentation notes, like AMD, Intel, and Nvidia do.)
The STREAM benchmark is the better known memory bandwidth test that is commonly in the suite of benchmarks used by HPC shops. The chart seems to suggest that Maverick-2 is getting peak bandwidth as measured bandwidth, but we do not believe this is possible. It could be close, but it can’t be perfect. In any event, in different notes, NextSilicon says it delivered 5.2 TB/sec, which implies it was on the dual-chip OAM module, which is 83.9 percent of peak and 1.86X better performance per watt than a GPU. (Again, which one?)
Here is what the Maverick-2 OAM package looks like:
On the HPCG test, which shames all HPC systems large and small and is the best indicator of how a system will work on very tough HPC problems, a single Maverick-2 was rated at 600 gigaflops running at 600 watts (by which we presume it was actually a pair of Maverick-2s in the OAM socket). NextSilicon says that this was “matching the leading GPU performance” while consuming half the power.
And finally, on the PageRank graph analytics benchmark based on the web page ranking algorithm created by Google, Maverick-2 did 10X better than “leading GPUs.”
We want to get more clarification on this, and look forward to fuller performance and price/performance benchmark comparisons. We also want to understand how NextSilicon is going to be able to scale performance beyond a single socket with scale up and scale out networks.
A Dataflow Engine Still Needs A CPU Host
That leaves us with the Arbel RISC-V processor, also announced by NextSilicon today.
NextSilicon is calling Arbel a “test chip,” but we think that the company is looking for a companion to its Maverick dataflow processors (DFPs?) that allows customers to have a fully integrated CPU that is not just a off-the-shelf X86 or Arm CPU from someone else – and does not come with a licensing fee owed to Arm Ltd.
Here are the feeds and speeds of the Arbel CPU:
The Arbel chip has a completely homegrown RISC-V core, just as the Maverick-2 does. (It is not clear how similar these cores are, but don’t assume they are they same.)
The Arbel core has a 10-wide issue decoder and six ALUs on the integer side and four 128-bit FPUs on the vector side. The core can support 16 scalar instructions in parallel. The core has 64 KB of L1 instruction cache and 64 KB of L1 data cache close to the ALUs and a 1 MB L2 cache close to the FPUs. (Both caches are obviously cross-linked to all the compute elements.) There is a 2 MB cache per core, but again, we don’t know how many cores are on the Arbel chip.
What we do know is that NextSilicon says that the Arbel core can “stand toe-to-toe” with Intel’s “LionCove” Xeon core and AMD’s “Zen5” Epyc core.


Question to reply… That company NextSilicon its the answer to electric lunatic consumption (HPC MW/H). Some Houston HPC company it had investment (12 MM $) how priority purchase on company networking business. Well now, that decision its common sense o Craso error. ??
NextSilicon hasn’t shown their performance per Watt on any real-world HPC applications yet so it remains to be seen how they compare to GPUs and CPUs in this regard. A fair test would be to measure the power and performance of SPECaccel 2023 and SPEC CPU FP 2017 for Maverick-2 and the best available GPUs and CPUs when Maverick-2 becomes available. The results for each individual test in these suites should be provided so that it’s clear where Maverick-2 is a good solution. SPEC has developed a power measurement methodology that could be used.
The GUPS and HPCG benchmarks only stress the DRAM system. These tests don’t provide any information about computationally intensive applications that are not limited by the DRAM system. HBM3E is much higher performance and more power efficient than GDDR DRAM. It’s ridiculous to compare an accelerator with HBM3E to a GPU with GDDR DRAM on a test that only stresses the DRAM system because that just shows the difference between HBM3E and GDDR DRAM, not the difference between Maverick-2 and a GPU with HBM. Nvidia and AMD are introducing new GPUs every year. I wouldn’t believe any claim that’s not backed up by hard data. NextSilicon should provide hard data showing where they are better or potential customers will just assume they are not better.
I agree that sustained performance, rather than peak performance, is what matters. I would like to know the sustained performance of Maverick-2 on some real-world HPC applications and how the price/performance ratio in real-world HPC aplications compares to the best available GPUs and CPUs. A startup’s product needs to be significantly better than products from established companies to make up for the risk that the startup will go bankrupt.
I hope NextSilicon will be competitive in supercomputers but there is not much profit to be made there. In smaller systems, I’d like to know how the price/performance ratio of a Maverick-2 system compares to a DGX Station. A DGX Station with a Xeon CPU having an NVLink connection to an Nvidia GPU will be able to run existing commercial x86 software. The CPU and GPU in a DGX Station will have cache coherent access to each other’s DRAM. The description of Maverick-2 on NextSilicon’s website has no mention of CXL type 2. Someone should make a liquid-cooled workstation, like a DGX Station, but with one OAM socket for a module like Maverick-2 or an AMD Instinct accelerator.
My fear is that this sounds a lot like the case for VLIW that led to Itanium. “The compiler can find the parallelism” was hard then, and it’s probably hard now. At least it’s targeted at HPC, which is also where Itanium was relatively competitive.
I heard those echoes, too.