AMD has finished its acquisition of Xilinx, which ended up costing close to $49 billion instead of the original $35 billion projected when the deal was announced in October 2020 thanks to the rise of AMD’s shares over the past year and a half.
And now, with AMD getting the greenlight from regulators and having spent all of that “money” – diluted market capitalization is not the same thing as actual cash, but you can buy stuff with it – it is natural enough to wonder what the CPU and GPU designer will do with what they have acquired. Not only the FPGA programmable logic that is at the heart of Xilinx devices, but also with the hard blocks of transistors that have become common on all FPGA hybrids, things such as DSP engines, AI accelerators, memory controllers, I/O controllers, and other kinds of interconnect SerDes.
It would take AMD a very long time to build up a team of engineers that has the expertise that Xilinx has garnered with programmable logic and within in its aerospace, defense, telco/comms, industrial, and broadcast/media business sectors. That, combined with the Vitis software stack, is what make Xilinx worth more above and beyond the value of acquiring a company that has revenue and profit streams in other sectors with little overlap with the core AMD business. It immediately translates into a wider total addressable market that AMD chief executive officer Lisa Su now pegs at $135 billion, which is quite a bit larger than the $79 billion addressable market Su said that AMD had six months before the Xilinx deal was announced.
The increasing TAM is vital for achieving growth for AMD (and indeed, any semiconductor designer), and adding the Xilinx revenue and profit streams – $3.68 billion and $929 million, respectively, in the trailing twelve months – to the AMD revenue and profit streams – $16.34 billion and $3.16 billion, respectively, in 2021 has its own inherent value, too.
But to realize that value — and why Su & Company spent so much to get ahold of Xilinx in the first place — it will need to do a bunch of things to maximize that investment and drive revenues higher than would be possible simply through the combination, which affords some scale with foundries and consolidating of some back office functions and physical offices.
What is not clear with AMD, and indeed any of the major chip designers in the datacenter, is just how many IP blocks they license from third parties. This could turn out to be more of a cost than many of us are aware of, and assuming that Xilinx actually creates its own memory controllers, I/O controllers, network controllers and more generic SerDes, and on-chip interconnects, then AMD may be able to save some dough by shifting over time to Xilinx IP blocks. If the Xilinx IP blocks are better than the AMD alternatives or missing from the AMD stack entirely, there are all kinds of possibilities here for improving what AMD is putting into CPU and GPU sockets and how it might create its own new IP from it.
For instance, imagine a datacenter-scale Infinity Fabric switch fabric based on Xilinx SerDes and a packet processing engine co-created by the converged AMD and Xilinx teams? Imagine something akin to the memory area network that IBM has created for its Power10 processors, but running across racks and racks and rows and rows of Epyc CPUs and Instinct CPU accelerators. Imagine not caring at all about Ethernet or InfiniBand, except as entry points into the cluster. How cool would that be?
Take a look at a Xilinx FPGA hybrid device in the “Everest” generation of the Versal family:
Those AI matrix engines for machine learning inference processing and DSP engines for various kinds of signal processing are hard blocks that used to be implemented in programmable logic – what Xilinx has been calling adaptable engines in its Versal line – but because of space, thermal, and performance issues, it was far more efficient to implement these blocks as an ASIC and use a high-speed interconnect on the chip to connect all of these blocks to each other and the programmable logic.
Every one of those hard blocks, including the Arm cores, is available to AMD’s engineers to play with as they contemplate how to architect compute engines, systems, and clusters. And every computing device AMD designs, whether it is a monolithic chip or a collection of chiplets in a package, can have a smear of programmable logic added as AMD sees fit.
So what will AMD do with Xilinx, aside from running the business largely unchanged? It has not said yet, other than to say that AMD was already licensing some Xilinx IP before the deal went down and that whatever that IP is – and don’t assume it was programmable logic – is set to appear in an AMD chip sometime before the end of next year.
Let’s look at some of the possibilities, and if you have some ideas of your own, pipe up.
First, we think that single die hybrid implementations of whole CPUs and whole FPGA are very unlikely, but there is a chance that co-packaged CPU-FPGA hybrids could happen.
This is something that Intel was working on back in 2014 with FPGA maker Altera even before it acquired the company, and then announced as a product mixing a “Skylake” Xeon SP processor with an Arria 10 FPGA in a single package in 2018. We don’t think these took off in the datacenter, and the reason is the same as why we don’t see CPU-GPU hybrids in a single package in the datacenter except for very specific cases, such as when PC chips with integrated graphics are repurposed as media processing server engines, as both AMD and Intel have done in the past with their embedded product lines.
In its frankensocket CPU-GPU complex, Intel put a full-blown 20-core Xeon SP-6138P at 125 watts in the same package as a full-blown Arria 10 GX FPGA 1150 rated at 70 watts. They were connected by UltraPath Interconnect (UPI) links, the same ones that are used to make shared memory NUMA configurations with CPUs, which means Intel grafted UPI controllers onto the Arria 10. (It seems unlikely that this UPI controller was implemented in the programmable logic, but it is possible that the UPI protocol was implemented on top of the hard-coded SerDes geared to the timing of UPI with programmable logic filling in the gaps.) That Arria 10 GX did not have Arm cores activated on the FPGA complex (they might have been physically there, Intel was never clear on that).
The target application for the FPGA part of this frankensocket was to run Open vSwitch virtual switching on the programmable logic, making it run more than 3X faster and allowing the Xeon CPU to host 2X as many virtual machines because it was not running Open vSwitch in software on the Xeon cores. We estimated the combined device cost $6,500, with the Xeon portion costing around $2,600 at the time. As far as we know, this idea did not take the market by storm, and the conversation has switched to offloading virtual storage, virtual networking and switching, and encryption/decryption to DPUs (a kind of glorified SmartNIC, depending on what definitions you want to use).
AMD has been thinking about this hybrid CPU-GPU computing approach with its Heterogeneous Systems Architecture for more than a decade, and even implemented them in a few server parts and has obviously done it for PC and custom game console chips at high volume. To a certain extent, the Infinity Fabric interconnect is one implementation of HSA.
AMD could do integrated packages combining whole CPUs and whole FPGAs – the frankensocket comprised of for CPU compute, chiplets for FPGA programmable logic, and a shared memory and I/O hub for the two of them – is interesting, since it would provide coherently shared memory across CPU and FPGA capacity within the socket. And with Infinity Fabric links, it could be done across sockets, too. And with Infinity Fabric switching, as we suggest, it could be done across racks and maybe even rows. Which is a powerful idea.
The problem with any of this is locking down the configuration within any socket. The ratio of CPU to FPGA programmable logic will be different by application, industry, and customer use case. And if you throw GPUs into the mix, you have many different variables to sort for and, in effect, every chip becomes a custom part for a specific customer in time. You can do that for the hyperscalers and cloud builders, because the volumes warrant it, but if AMD wants to sell this to other service providers and large enterprises it would have to pick a few SKUs and whatever it does will probably be suboptimal.
Nvidia has no use for FPGAs except maybe for simulating its own chips (and maybe not even there if it does all of its simulations and verifications on its “Selene” supercomputer), and Jensen Huang, the company’s co-founder and chief executive officer, has not been shy about saying this. But the fact that Intel bought Altera and now AMD has bought Xilinx shows at the very least that FPGAs continue to be appealing in the borderlands between programming languages running on off-the-shelf CPUs and custom ASICs for implementing certain functions or software stacks. We have always been of the opinion that a balanced system would include all three compute engines, as, for instance, a modern switch does. You need CPUs for fast serial processing and large memory footprints, GPUs for fast parallel processing and high memory bandwidth, and FPGAs for accelerating hard-coded algorithms beyond that which is available in a software implementation on, say, an X86 or Arm processor but at a volume that does not warrant a custom ASIC because those algorithms change too much or because you cannot pay the heat or cost premiums.
We think it is definitely interesting to have FPGA programmable logic embedded in every CPU socket and maybe even every GPU socket as a kind of scratchpad for these devices so they can have hashing algorithms, encryption algorithms, security protocols, or elements of virtual switches being done (or partially done) in FPGA instead of in logic blocks on a CPU or GPU chip, in separate chiplets added to the CPU or GPU socket, or in higher level software running on the CPU. IBM has added such scratchpads (not implemented with FPGA logic, mind you) to its System z and Power processors over the years, allowing them to implement new instructions, or create composite instructions, that were added on the fly to the architecture long after the chips taped out. This would not be a big part of the chip/socket real estate.
We definitely think that there will soon be Versal FPGA hybrids that will be delivered using Xen X86 cores, and we think that the Vitis stack will be tweaked to be able to compile code to those cores as well as to the other elements of that Versal compute complex. We think it is not likely that AMD will pull X86 or Arm cores onto its GPUs, but we do think that the company could create a line of SmartNICs and DPUs that have a mix of FPGA and X86 cores – and maybe even baby GPUs if it makes architectural sense. AMD is new to SmartNICs, but Xilinx is not, particularly after its Solarflare acquisition in April 2019.
That leaves us with one more thought in this thought experiment, and it is something that we have been encouraging compute engine makers to do since the beginning of this hybrid journey. What seems clear is that we are going to have chiplet components within a socket or across sockets with some kind of interconnect between it all. With AMD and Xilinx, it will be Infinity Fabric. Many, many generations of it, and maybe supporting the CCIX or CXL protocol on top of it, which should be possible if Infinity Fabric is indeed a superset of PCI-Express with AMD HyperTransport features woven into it. Don’t get hung up on that. There are good latency reasons for wanting to package up many things into a hybrid compute engine and make a big socket. But maybe the best answer, in the post-Moore’s Law era, is to stop wasting so much silicon on functions that are not fully used.
So, what we would like to see AMD do is this. Create a high performance Zen4 core with all of the vector engine guts ripped out of it, and put more cores on the die or fatter faster cores on the die. We opt for the latter because on this CPU, we want screaming serial performance. We want HBM3 memory on this thing, and we want at least 256 GB of capacity, which should be possible. And a ton of Infinity Fabric links coming off the single socket. Top it at 500 watts, we don’t care. Now, right next to that on the left of the system board we want a killer “Aldebaran” Instinct GPU, and half of an MI200 might be enough – the Instinct MI200 has two logical GPUs in a single package – or a full MI300, due next year with four Aldebaran engines, might be needed. It will depend on the customer. Put lots of HBM3 memory around the GPU, too. To the right of the CPU, we want a Versal FPGA hybrid with even more Infinity Fabric links coming off of it, the Arm cores ripped out, the DSP engines and AI engines left in, and all of the hard blocked interconnect stuff also there. This is an integrated programmable logic engine that can function like a DPU when needed. Infinity Fabric lanes can come off here to create a cluster, or directly off the GPUs and CPUs, but we like the idea of implementing an Infinity Fabric switch right at the DPU.
Now, take these compute engine blocks and allow customers to configure the ratios they need on system boards, within a rack, and across rows. Maybe one customer needs four GPUs for every CPU and two DPUs for every complex with a single Infinity Fabric switch. In another scenario, maybe the GPUs are closer to the DPUs for latency reasons (think a modern supercomputer) and the CPUs hang off to the side of the GPUs. Or maybe CPUs and GPUs all spoke out from the DPU hub. Or maybe the CPUs are in a ring topology and the GPUs are in a fat tree within the rack. Make it all Infinity Fabric and make the topology changeable across Infinity Fabric switches. (Different workloads need different topologies.) Each component is highly tuned, stripped down, with no fat at all on it, with the hardware absolutely co-designed with the software. Create Infinity Fabric storage links out to persistent memory, pick your technology, and run CXL over top of it to make it easy.
There is no InfiniBand or Ethernet in this future AMD system except on head nodes into the cluster, which are just Epyc CPU-only servers.
If we were AMD, that’s what we would do.