A system is more than its central processor, and perhaps at no time in history has this ever been true than right now. Except, perhaps, in the future spanning out beyond the next decade until CMOS technologies finally reach their limits. Looking ahead, all computing will be hybrid, using a mix of CPUs, GPUs, FPGAs, and other forms of ASICs that run or accelerate certain functions in applications.
Obviously, connecting these disparate, distinct, and discreet compute components together requires some kind of data interconnect – a bus, in the old lingo but also sometimes called a link or an interconnect – so they can share data, in many cases, offload work from the CPUs to the accelerators and consolidate answers back onto the CPUs when they are done.
IBM started opening up the bus on its Power8 and Power9 chips with the CAPI protocol riding atop PCI-Express, and then created its own “Bluelink” SERDES for running the NVLink and OpenCAPI protocols to link to Nvidia Tesla GPU accelerators and other types of accelerators or even flash storage, respectively. Nvidia added NVLink to gang up its own GPUs into a shared memory cluster of sorts, and Xilinx with a bunch of friends (notably AMD and the Arm collective) put forth the CCIX protocol has another protocol with memory-style operations to glue accelerators to processors and, in the case of several Arm server chips, glue CPUs to CPUs in NUMA fashion within a chassis. The Gen-Z memory-cebtric fabric is more about linking multiple nodes across racks, rows, and datacenters, but there is certainly some overlap here as well in the way it can be implemented to link elements within a single system.
Over the past several years, as these protocols all emerged and were specified by their promoters, we have been wondering if the Bus Wars from days gone by – there were several of them back in the late 1980s and early 1990s when each system maker controlled its own system bus and several alternatives to linking compute complexes to peripherals, including ISA, MCA, EISA, VLB, and PCI buses and InfiniBand, as originally intended as a switched bus fabric, as well as PCI-X and PCI-Express emerged from the carnage.
There was a lot of fighting back in those days, and Jim Pappas, director of technology initiatives at Intel, remembers them all because he was in the trenches, fighting. If there had already been a universal interconnect for linking two CPUs together, none of this would have been necessary, but each processor has its own NUMA interconnect scheme and there is no changing that now, although we will point out that there is a universe where CXL could have, technically speaking, become a standard that vendors all implemented in future chips for both CPU NUMA and accelerator interconnects.
This seems unlikely to happen, but as we have pointed out in detailing the Compute Express Link, or CXL, protocol that Intel has put together to link processors to accelerators, there is definitely a rapidly evolving consensus with regard to processor-to-accelerator links. We talked about this quite a bit during The Next I/O Platform event we held last September in San Jose, when all of the key people behind these protocols were on stage with us and, in fact, had just that week formed an independent consortium and expanded its board of directors to include not only chairman Pappas, but Barry McAuliffe of Hewlett Packard Enterprise as president and Kurtis Bowman of Dell as secretary. Other notable board members include Nathan Kalyanasundharam of AMD, Steve Fields of IBM, and Gaurav Singh of Xilinx, who head up their respective Infinity Fabric, OpenCAPI, and CCIX initiatives, and Dong Wei of Arm Holdings is also present among chip designers, as is Alex Umansky at Huawei Technology. Facebook, Alibaba, Microsoft, and Google are also present, and what we have heard through the grapevine is that these hyperscalers abd cloud builders have been leaning on Intel pretty heavily to provide something akin to CCIX and OpenCAPI and to open it up so the entire industry would get behind it – and relatively quickly at that.
Now, Pappas tells The Next Platform, a total of 96 companies are now members, and this includes some pretty important additions such as Nvidia, Cisco Systems, Fujitsu, Inspur, Lenovo, Marvell, Supermicro, Wistron, Jabil, H3C, and Broadcom. Those are key OEMs and ODMs plus compute engine makers Marvell, Nvidia, and Fujitsu.
“People were really expecting a reprise of the Bus Wars, and they were not expecting singing around the campfire,” says Pappas. “But this has come together very well, and we don’t need all of these other initiatives to fail for CXL to succeed. This is about getting the ecosystem together to make CXL grow.”
The CXL 1.1 specification has been available since July last year, and it is for directly attached devices running over the PCI-Express 5.0 bus that is not expected to come out in processors until either late this year or early next year. The PCI-Express 5.0 protocol was just finalized in early 2019 and the PCI-Express 6.0 specification is moving through its subreleases to its 1.0 release for a ratification maybe in early 2021. While CXL is on an annual cadence, more or less, it seems likely that it will eventually slide into phase with the PCI-Express roadmap, which itself is trying to get into a reasonable and steady cadence. At seven years in the field, PCI-Express 3.0 was the top bus speed for far too long, and now that systems are going hybrid, the PCI-Express bus and all of these protocols really matter for performance. In any event, both the target and host sides of the CXL 1.1 interface have been published and companies are building to that specification now, according to Pappas. And the 2.0 specification will come out in the second quarter of 2020, and the expectation is to have a 3.0 specification in discussion soon by all the new consortium members.
The important things as far as industry adoption and innovation is concerned is that CXL rides on top of PCI-Express and that the PCI-Express roadmap is back on a proper iterative cycle after the long delay in bringing PCI-Express 4.0 into the field. Being based on PCI-Express means that system makers will have more flexibility so long as processor makers keep goosing the PCI-Express controllers they embed on their chips, or more likely, put into the I/O hubs of multichip modules that will comprise the processor socket of the future. There are coherent and non-coherent ways to use CXL as well, and this also provides flexibility because sometimes cache coherency is overkill for the job. That’s why Intel intentionally created an asymmetric coherent protocol for CXL, but does not require that it be used.
“Some customers won’t need any kind of coherent interface,” explains Pappas. “Maybe they are building cold storage devices and all that they want is as many PCI-Express lanes as they can get to attach as many SSDs as they can.”
The interesting bit to watch is how CXL and Gen-Z, which is a very different kind of interconnect beast, will interplay in system designs. The way that Pappas sees it, a CXL port, which will support a kind of memory semantics as does OpenCAPI and CCIX, potentially gives Gen-Z fabrics a universal mounting point inside of systems.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
I think Intel would have naturally had a CXL five years ago or more except they were so afraid of NVIDIA. The result was Cray trying to figure out how to effectively lash together huge numbers of single processor Xeon Phi nodes. Now Intel are making a GPU so they have no choice but to release an accelerator bus. But maybe they wouldn’t have opened it up to industry standard five years ago, so maybe it all turned out OK.
Bus wars: Also see VME, vme64, Sun’s mbus, sbus, SGI’s gio, xio, motorolla nubus, HP sgc, Dec futurebus, SCI. Lots of innovation in those days, but all proprietary with vendor lock-in.
No question, I had the PC Server on my mind when I was making my list. But clearly, every machine had its own bus, and nothing could plug into anything else, and small wonder that this business used to be so profitable.
As open software opens hardware to operate across open system interface Intel will have no choice but to open its own x86 CPU bus to platform co-development
In terms of future concede or be left out.
In terms of past secondary market capital investment in Intel processor reuse will necessitate Intel opening its own CPU bus to independent development. Corrects that FTC v Intel Docket 9288 error regulating the closure of Intel’s CPU bus to Intel by disregarding competitive and complimentary participation on standards contribution.
Opening Intel x86 CPU bus whether Xeon or Core product is now a world wide industrial financial necessity. Preserves open market investment in Intel processors supporting new evolutionary system logic and board development by independent design producers. Makes by putting back to use, back in time Intel processors that are not obsolete for many ‘utility’ applications whether industrial or consumer.
Keeps the world at work and capitalism humming along.
Old CPUs can be salvaged with new independently designed that is evolutionary system logic (Northbridge/control hub) for new board production. World market compute needs are very different from the leading edge of platform innovation whether commercial or consumer.
Secondary x86 processor market on volume and value is today the primary market of exchange financing new procurements. Drives innovation for economic renewal across traditional design manufacturers and design producers including Intel.
Back nine year Core and Xeon CPU surplus alone not including their remaining system enclosures (multiply by x5) held as accumulated capital values by secondary ‘open market channels’ are valued by this analyst at Intel CPU 1K retail in excess $3.7 T, at Intel wholesale in excess of $1.2 T and CPU fire sale in excess of $300 B. These are conservative estimates based on Xeon v2/V3/v4 volume totaling 1.5 B units of supply presents a 15 year supply on today’s Scalable volume. Core easily adds another 2.7 B units of back nine years production volume.
Whether surplus CPU value or x5 and some might chose x3 system surplus value, its easy to see why secondary market holdings on their capital value are the primary financial enabler of new production on new procurements. Selling one results in the purchase of the other.
Intel will open the x86 CPU bus for back in time processor reuse because to refrain sends the world technical economy into a financial chaos by eliminating the primary source of funding today’s innovations on secondary market’s primary procurements.
Mike Bruzzone, Camp Marketing
Will be interesting to watch interconnect over coming decades.
We have an entrenched model of what a server is, evolved and standardized from the IBM PC of 35 years ago, with a feedback loop which has hyper-optimized the components from which that model is deployed. CXL is an evolution of PCI in support of that model, normalizing and standardizing the asymmetric coherency of CCIX and CAPI, and providing an attach point to the CPU’s memory pipeline.
Various attempts to significantly cost-reduce (and back down on some diseconomies of scale in) the best practice server model have failed over the past 20 years. Generally these eliminate the I/O system of the best practice server, replacing it with a fabric connection through which other nodes, “real” networking, and storage are reached. This concept is not new: see IBM mainframe ESCON channels. But the application to improving price/performance of a large pool of servers never got market acceptance, as was intended by the original use of what became InfiniBand, some evolution of Xeon PHI for HPC around Omni-Path which never materialized, and microservers like AMD SeaMicro and HP/HPE Moonshot.
The CPU-centric model, which is a basic design assumption of PCI inherited by CXL, technically imposes the complexity of each CPU being the center of an autonomous world, and economically imposes a minimum node size to amortize the costs of the I/O system and other devices necessary to make that world autonomous.
I think it’s awesome that Intel is opening up its memory pipeline to outside devices via CXL. The gap is now one of symmetric vs asymmetric access, that is, the CPU centric model vs the network centric model.
The network-centric model (which Sun correctly articulated the vision for 40 years ago) allows a node to be as small as a DIMM, or more practically a single low-end CPU package with a few stacks of HBM and vertical flash mounted on that package. The CPU is a peripheral of the network. If Intel yielded to this model they’d be yielding most of the architectural control at the core of their business success in servers over the last 30 years.
I’m retired now (minus helping a particular startup); touched on the bus wars and had tiny impact on some of the survivors but had no major hand in any of the major winners of the past 30 years (yet); think that for hyperscale a network-centric model of much smaller nodes would carry a much lower CPU tax and be noticeably more cost effective than the local optimum which is today’s best practice; and think that enabling that network centric model requires not just what was done for the microservers which failed in the market because they did not fit in Cisco’s and Intel’s world of a decade ago, but also a rethreading of low level networking and addressing to connect user space software endpoints together regardless of location (ie a real solution to container networking…but containers and VMs aren’t needed in a world of bare metal microservers).
Steve Chalmers predicts a move away from CPU-Dominated processing and yearns for the open Internet-Like user-spaces of network-wide computing. Proposals for that sort of networking were once two-a-penny but all remained embryonic because the chunky-CPU model was too successful; it kept getting faster. Perhaps it IS time to put it to bed, but how to now replicate its vast social universe?
A global system-replacement would demand:
$ A new software philosophy – but not as drop-dead as universally assumed;
$ PC-Efficient local (node) inter-processing – much more than just a block of PCs;
$ Massively-Parallel interconnects out of chips, modules, racks and warehouses – hierarchical software helps a lot there;
$ Internet-Efficient global processing – well, at least we’ve got that, though NOT as a uniform bottom-to-top-ology;
$ No worse aggregated power-dissipation – now there’s a thing, though the Great-and-Mighty reckon they can suffer the power-consumption itself;
$ Maybe hardest of all, how to seamlessly software-scale from local computational resource through to the whole global society (upwards and downwards), while recognising secured segments like a corporation or the US mainland;
$ Finally, how do I stop my minding YOUR business? This may be the opportunity to hardware-solve that thorny, data-burglary disaster. (Obviously, I really mean: how do I stop your minding MY own business, but I’m sure we all have this shared interest)
Intel hinted at ‘Exascale MPA FPGAs’, and that started out along the right lines, but there was no systems-level network-topology to go with it; they maybe have a their own broader view by now. Steve should be right – it makes sense – global networking IS the Next Platform. Is anyone out there heads-up on all this or only heads-down on the interconnect-spaghetti?
This all begs for a Systems view. Sure, good interconnects help though . . .