The AI model makers of the world have been waiting for more than a year to get their hands on the Trainium3 XPUs, which have been designed explicitly for both training and inference and which present a credible alternative to Nvidia’s “Blackwell” B200 and B300 GPUs as well as Google’s “Trillium” TPU v6e and “Ironwood” TPU v7p accelerators.
But the minute that Matt Garman, chief executive officer of Amazon Web Services, started talking about the future Tranium4 XPUs that are expected to be delivered in maybe late 2026 or early 2027, everyone who was queuing up to buy EC2 Capacity Blocks based on Trainium3 were bracing for a bad case of buyer’s remorse. Because as good as Trainium3 is compared to prior generations of the Inferentia and Trainium XPUs put together by the company’s Annapurna Labs chip design arm, Trainium4 looks like it is going to bust the size of a socket wide open and present not just a very powerful device, but much more scalable UltraServer clusters that will be much better at running mixture of expert, chain of thought reasoning models.
Before drilling down into what Trainium4 might be, let’s take a moment and actually review what the Trainium3 XPUs are, especially since many of the technical specifications of that chip and its predecessor, the Tranium2, have only just recently been made available. And let’s start with the spec chart about Tranium3 Garman spoke to during his keynote address at the opening of the re:Invent 2025 conference in Las Vegas this week:
This is an update of the slide that AWS showed off this time last year previewing the three points of data it was willing to talk about. Tranium3 uses a 3 nanometer process node from Taiwan Semiconductor Manufacturing Co, a shrink from the 5 nanometer technology that most of us think Annapurna Labs used for the Tranium2 chip. The Tranium3 was expected to deliver 2X the compute (and that can mean a lot of different things) and offer 40 percent more energy efficiency (which is not a valuable metric since no one seems to know what the wattages are for Tranium1, Trainium2, or Trainium3). But clearly, the process shrink was used to cut back on the power more than cram new features into a chip and the socket was made larger to boost performance, with a net gain in performance per watt of 40 percent.
Amazon sells regular server instances with based on Tranium2 as well as UltraServer cluster configurations with a total of sixteen Trainium3 sockets in a shared memory domain, but thus far has only delivered Trainium3 UltraServers with 64 Tranium3s in a single memory domain.
The 4.4X in improvement in compute throughout compared for the Trn3 UltraServers, as the instances are called on AWS, therefore makes sense compared to the Trn2 UltraServers, which have four times fewer XPUs. The latest UltraServers, according to Garman, have 3.9X the aggregate HBM memory bandwidth as the Trn2 UltraServers and most importantly for those worried about the cost of inference – which is the gating factor for the commercialization of GenAI – can generate 5X the number of tokens per megawatt.
Here are the Pareto Curves that Garman shared for this performance claim, which shows the interplay of output tokens per megawatt on the Y axis against the interactivity of the output, expressed in tokens per second per user:
Shifting that curve up and out is the whole game of getting inference business in 2025 and beyond. This particular set of charts compared a Trn2 UltraServer cluster against a Trn3 UltraServer cluster running OpenAI’s GPT-OSS 120B model.
What this chart also shows, but which Garman did not talk about, is that you can get about an order of magnitude more interactivity for the same amount of energy if that is important to your inference workload.
Somewhere along the way when the Trainium2 instances were ramping on its cloud, AWS updated the specifications for this XPU and we also found some specs for the Tranium3 that remove some of the mystery and fill in a lot of the blanks about how the components are stacked up in the Trainium sockets to make each successive XPU.
Let’s start with the NeuronCores and work our way out.
All of the NeuronCore designs put four different kinds of compute into the core, much as CPU cores have long since mixed integer (scalar) and vector units and occasionally (Intel Xeon 5 and 6 and IBM Power10 and Power11) have tensor units as well. And starting with the Trainium line, Annapurna Labs added collective communications cores, or CC-Cores, to the architecture to handle the processing specific to collective operations common in HPC and AI workloads, so that really makes five.
With the NeuronCore-v1 architecture, which was only used in the Inferentia1 chips, there is a Scalar Engine for integer math (two integer inputs and a single integer output), a Vector Engine for vector math (two floating point inputs, one floating point output), a Tensor Engine for tensor math (multiple matrix floating point inputs and a single matrix floating point output).
According to AWS documentation, the Scalar Engine in the NeuronCore-v1 could process 512 floating point operations per clock cycle and handled FP16, BF16, FP32, INT8, INT16 and INT32 data types. (We think AWS meant to say it handled 512-bit data). The documentation also says the Vector Engine could handle 256 floating point operations per cycle (and again, we think it is 256 bits of data) and also worked with FP16, BF16, FP32, INT8, INT16 and INT32 data formats. You can calculate the operations per cycle based on the width of the data and how many you can pack into each unit.
The dimensionality of the TensorEngine for NeuronCore-v1 was never revealed, but we do know that it handled FP16, BF16, and INT8 inputs and FP32 and INT32 outputs and delivered 16 teraflops of FP16 or BF16 tensor processing.
With that first NeuronCore-v1 design discussed, let’s lay them all out side by side up to where we think Trainium4 might be:
AWS started talking about Trainium1 in December 2020 at re:Invent and took two years to fully ramp it, which is understandable given the fact that this was Amazon’s first homegrown, datacenter-class training accelerator. Trainium1 was etched, we think, with TSMC 7 nanometer processes; we know it had 55 billion transistors and that it ran at 3 GHz. This chip used the same NeuronCore-v2 architecture as the Inferentia2 chip that followed it later to market in April 2023 with a shrink to 5 nanometer processes and about the same transistor count but with some tweaks for inference-specific workloads, such as half as many NeuronLink chip interconnect ports.
With Trainium2, divulged in November 2023 and shipping in volume in December 2024, AWS moved on to the NeuronCore-v3 architecture and stopped making Inferentia chips because inference started becoming more like training. The number of cores per socket was quadrupled with Trainium2 and the total of NeuronCores in a single memory domain went up by a factor of 16X as the number of sockets per instance also went up by a factor of four. As far as we can tell, AWS also boosted the clock speed on the Trainium2 with the shrink to 5 nanometer from 7 nanometers with Trainium1. Interestingly, the peak scalar and vector performance of each NeuronCore went down with v3 by about 60 percent, and peak tensor core throughput went down by 12 percent. But AWS added 1:4 sparsity support to the chip for tensor operations and that combined with the higher number of cores boosted the effective throughput of Trainium2 by 3.5X compared to Trainium1 at FP16 or BF16 precision. In fact, NeuronCore-v3 supports a bunch of different sparsity patterns: 4:16, 4:12, 4:8, 2:8, 2:4, 1:4, and 1:2.
The SRAM memory for the NeuronCore-v3, shared by the three compute units, was boosted to 28 MB per core, but we do not know from what amount. HBM memory was finally boosted to 96 GB, a 3X improvement, and bandwidth was increased by 3.5X to 2.9 TB/sec. This was, arguably, the first competitive Trainium chip, and it is not a coincidence that Anthropic has been using these Trainium2 devices for its model development and inference and that most of the inference from the AWS Bedrock model service have been done with Trainium. We suspect that most of the millions of units of Trainium that Garman spoke of in his keynote were Trainium2 devices.
That brings us all the way to Trainium3, which is now shipping in volume in UltraServer instances. With the NeuronCore-v4 architecture that is the heart of the Trainium3 device – yes, it would have been better if the core name synched up with the device name – there are a few big changes. First, the Vector Engine has been tweaked to do fast exponential function evaluation, which has 4X the performance of the Scalar Engine doing this job, which is part of the self-attention algorithms of GenAI models. And second, the FP16 and BF16 data formats can be quantized into MXFP8 formats, which AWS says is useful for data quantization between multi-layer perceptron (MLP) layers in a GenAI model. The NeuronCore-v3 design also boosted SRAM to 32 MB per core. Clock speed seems to have changed nominally but not significantly between Trainium2 and Trainium3, but the big change in the device is the doubling of the bandwidth with NeuronLink-v4 XPU interconnect ports to 2.5 TB/sec, the 1.5X increase in HBM memory capacity to 144 GB, and the 1.7X increase in HBM bandwidth to 4.9 TB/sec.
We think the changes in the Trainium3 design are meant to get compute, memory, and interconnect back into a better balance to boost not the peak theoretical performance but the effective performance of the Trainium3 socket. The memory domain of the Trn3 Gen1 UltraServer stayed at 64 devices, the same as Trainium2, but with Trn3 Gen2 UltraServers, which are shipping now, the domain size was increased to 144 sockets. That yields a 2.25X increase in the number of cores that can be thrown at an AI training or inference job.
Which brings us all the way to Trainium4, which is expected to start rolling out about this time next year.
With what we presume is going to be called the NeuronCore-v5 architecture, AWS will be adding proper FP4 support to the Trainium processing, not just stuffing MXP4 into its FP8 slot in a tensor and leaving a lot of empty space. Garman said in his keynote that Tranium4 would have 6X the performance of Tranium3 by this adoption of FP4 native formats, which implies that FP8 processing will go up by 3X. Garman said further that Trainium4 will have 2X the HBM memory capacity and 4X the HBM bandwidth of Trainium3.
In the monster table above, we have tried to suss out what this Trainium4 might look like and how the memory domain might be extended further for a coupled set of Trainium4 devices.
There are many different ways to get there, and we think that at best AWS will move to a 2 nanometer process and save some power or stick with the 3 nanometer process and save some money and make slightly larger and hotter XPUs. It is a tough call, but we think it will lean into 2 nanometer etching for Trainium4.
If you look at Garman’s chart above, you will not that it says that Tranium4 will support both NVLink and UALink ports on the device – Nvidia made a big deal about AWS picking up NVLink technology, but we have a hunch that AWS will be making variants of its Graviton family of chips with NVlink ports and getting something that Nvidia has been loathe to talk about: The ability glue custom CPUs and custom XPUs into a giant shared memory domain with NVLink ports and NVSwitch memory fabric switches. Thus far, Nvidia has been happy to let customers have custom CPUs that link to Nvidia GPUs or custom XPUs that link to Nvidia GPUs, but it has not allowed this third option.
We think that AWS buys enough GPUs that it can ask – and receive – such a thing, and at a fair price. We also think that AWS will support the Nvidia NVFP4 data format as well as the MXFP4 format for FP4 processing, and that this was probably part of the exchange to make it easier for work on Tranium4 chips to be moved to “Blackwell” and “Rubin” GPUs from Nvidia. These are just hunches, of course. We also think that AWS wants to be able to plug this into its own racks, which will essentially be clones of Nvidia’s racks.
But, it is interesting to note that UALink is also in the slide above from Garman. AWS is keeping its options open, and no doubt wants a chiplet architecture for the Tranium4 package that will allow it to switch out NVLink ports for UALink ports and a rack design that allows for NVSwitch switches to be swapped out for UALink switches when they come to market perhaps later next year. It could turn out that NeuronLink-v5 is tweaked to be compatible with UALink 2.0 and these will be switches from Annapurna Labs, not from Astera Labs, Upscale AI, Marvell, Cisco Systems, or others who are offering scale-up interconnect ASICs.
The easiest way to get 3X the performance in the same or a slightly smaller thermal envelope is to just triple up the cores and keep the clock speeds about the same along with the move to 2 nanometer processes. If the transistor shrink is larger (1.6 nanometer A16 process from TSMC), then the thermals can be taken down a bit or the clocks cranked a tiny bit. Our advice will be to take the thermal advantage and keep it all the same, just as AWS did between Trainium2 and Trainium3, and just add 3X the cores.
If you add 3X the cores, to 24 per socket and perhaps across four chiplets, that gets you 3X at constant precision and if you shrink from FP8 to FP4, that gets you 6X more oomph per socket.
Now here is where it gets interesting. If you also double up the number of devices to 288 per system (matching what Nvidia is doing), you can get 6,912 NeuroCores in the Trainium4 UltraServer cluster all in a single memory domain with 1,944 TB of HBM memory.
That is nowhere near the 9,612 Ironwood TPU v7p XPUs that Google can bring to bear in a single memory domain, of course. . . . But it is 13.5X better than the Trn2 Gen2 UltraServer clusters that are being sold today.