Cisco Systems may still be the biggest supplier of switches and routers in general, but it has long since been surpassed by Broadcom when it comes to suppling the silicon that does the switching itself and sometimes even a little bit of routing in the datacenter in particular. (Meaning not including the campus and the edge.)
While there is plenty of competition in the merchant Ethernet switch chip market, including Nvidia (Mellanox), Innovium, Intel (Barefoot Networks), Marvell, and a handful of other upstarts are in there, and they are getting an increasingly large share of switching, particularly among the hyperscalers and cloud builders that set the networking pace.
But it is Broadcom, not Cisco, who is the switch chip designer that is the one to beat, regardless of all of the noise that Cisco has been making about its Silicon One router and now switch chips that are aimed at the top-end of the merchant chip space and mark that company’s entry into the merchant market. Cisco could have chosen to make merchant silicon anytime since the late 2000s to blunt the attack coming from Broadcom, but didn’t until last year. We will be diving into the Cisco Silicon One devices soon so we can get a better handle on them.
As 2020 is starting to wind down – thank heavens – Broadcom is rounding out the “Tomahawk” family aimed largely at the hyperscalers and big public cloud builders and the “Trident” family of chips aimed at enterprise switching, taking on the six new Silicon One chips that Cisco revealed back in October as well as the ASICs coming out of the rivals mentioned above.
The first thing that Broadcom has done is widened the Tomahawk 4 lineup, whose initial member was announced in December last year, a 25.6 Tb/sec device that can drive 64 ports running at 400 Gb/sec that is now shipping in volume, Peter Del Vecchio, product manager for the Tomahawk and Trident lines at Broadcom, tells The Next Platform. This is Broadcom’s second switch ASIC etched in 7 nanometers (at Taiwan Semiconductor Manufacturing Corp), the first being the Trident 4 chip that was launched in June 2019. Both of these are monolithic chips, and Broadcom is not going to switch to chiplet designs until it absolutely has to, according to Del Vecchio, because of the performance implications of trying to tie multiple chip blocks together. (Others, such as Barefoot Networks, have implemented SerDes circuits and switch engines in different blocks and assemble them into a package using high speed links between the chips. This increases complexity and package costs, but also increases yields on the chiplets and therefore lowers costs for the aggregate chippery in the package.)
As we have pointed out before, Broadcom is trying to address the switching and sometimes routing needs of a wide variety of customers, and the company believes that it cannot cram everything into one ASIC and be done with it. It bears repeating what these distinct markets are, what their needs are, and what Broadcom chips address them. Here is the breakdown by customer set:
Enterprises don’t generally push the bandwidth limits like service providers and hyperscalers/cloud builders do, according to Del Vecchio. But they have a lot of more devices and a plethora of styles of devices on their networks, and because users move around in ways that servers don’t, there needs to be a lot more policy-based control of access and security.
Service providers, by contrast, tend to have a lot of long-haul, backbone networking and as such they need deep buffering in their switches and routers to help mask the latencies in those backbones. Service providers also tend to have more oversubscription in their networks to keep the switch count down, which controls the networking budget to a certain extent. This is what Del Vecchio calls a “hyper shared” network, and quality of service – including the hardware like large access control lists and switching or routing tables and the software to deliver it – is key.
With the hyperscalers and the cloud builders, most of the traffic is between devices within a datacenter – so-called East-West traffic – and the bandwidth requirements are very high as a means of being able to link many microservices together across on the order of 100,000 servers in a datacenter to compose the applications they run. They need low latency, but predictable latency across a wide variety of network conditions is what matters most, which is also why hyperscalers and cloud builders tend to have very overprovisioned networks, rather than oversubscribed ones that assume that at any given time most users (or application components) won’t need much bandwidth.
Here are the different Broadcom ASICs map to the different customer sets:
These are not hard boundaries, but general behaviors. For instance, the “Jericho” ASICs that came from Broadcom’s acquisition of Dune Networks a zillion years ago, have very deep buffers but not as high aggregate bandwidth per ASIC as the Trident or Tomahawk devices. And hyperscalers like to use Jericho switches at the core and backbone of their networks, perhaps linking regions or datacenters within a region and making good use of the deep buffers, but using Trident ASICs at the edge where programmability is more important and using Tomahawk chips in the main fabric within the datacenter.
The important thing is that all of these ASICs support Broadcom’s SDK and API stack as well as the Switch Abstraction Interface (SAI) that was created by Microsoft and picked up by the open source community to provide a layer of API virtualization across switch ASICs made by different vendors so its network operating systems could run across any of these ASICs. (For many uses cases within Microsoft’s Azure cloud, its SONiC network operating system, which is also open sourced, runs atop SAI; in other cases, Microsoft uses a different NOS.)
As best as we can figure, there are more than a dozen network operating systems that are used in any appreciable volume available from the switch makers or the open source community or developed in-house by a hyperscaler or cloud builder and kept proprietary. In the past couple of years, there has been a lot of development with these NOSes, but it remains to be seen if the market will accept so many choices in the long run. Servers surely didn’t. When we started out in this racket there were probably two dozen server architectures and three or four dozen operating systems for them in the corporate datacenters of the world, and now we are basically down to X86 architectures running Linux or Windows Server for the vast majority of the compute installed.
We don’t think the hyperscalers and cloud builders will stop building their own NOSes, any more than they stopped creating their own Linux distros, because both are critical to the performance and security of their vast platforms. But we do think portable operating systems matter, and there is a good chance that ArcOS from Arrcus will emerge as a cross-platform NOS that the market can stand behind, and we also think that Nvidia will do all in its power to extend the work that Cumulus Networks has done with its eponymous platform and mash it up with the several platforms that Mellanox has created. It remains to be seen what Arista will cook up with the combination of its EOS and the SDN stack from Big Switch Networks, and Cisco is the default in a lot of datacenters with IOS and NX-OS. The software that is most portable, offers the most performance, and covers the most scenarios will win. Just like Broadcom has won in the merchant silicon market by offering precise ASICs for specific use cases.
That background provides the context for what Broadcom is announcing now. As we said above, the Tomahawk 4 ASIC, which was announced nearly a year ago, started sampling in early 2020 and is now shipping in volume less than a year after being announced. That is pretty fast for a switch ASIC. The Tomahawk 4 ASIC, to refresh your memory, uses the same “Blackhawk” SerDes that were deployed on the Trident 3, which ran at 25.8 GHz but thanks to PAM-4 modulation (which can do two bits per signal) it provided an effective 50 Gb/sec of bandwidth per SerDes lane. The big bad Tomahawk 4-50G chip had 512 of these Blackhawk SerDes etched around its edges, for a combined switching bandwidth of 25.6 Tb/sec. The densest switch configuration supported on this device of 64 ports running at 400 Gb/sec.
But now, the Tomahawk 4 line is being extended with two new members:
The first is the Tomahawk 4-100G, which implements the same Blackhawk SerDes, but cranks their clocks up to 51.6 GHz and adds the same PAM-4 modulation to it to drive 100 Gb/sec per lane. This new chip has only 256 of these SerDes on the die, however, due to thermal issues and therefore stays at the same 25.6 Tb/sec aggregate switching bandwidth. However, because thermals increase at an exponential rate with clock speed, even with half the SerDes as the Tomahawk 4-50G, the Tomahawk 4-100G runs at a slightly higher wattage of around 400 watts compared to around 350 watts.
The second new chip in the Tomahawk 4 line is the Tomahawk 4-12.8T, which has half the number of SerDes (128) running at 100 Gb/sec per lane with PAM-4 modulation and it burns 200 watts. The number did not drop below 175 watts because the SerDes represent only about a third of the power on a given switch ASIC. The packet processing engines consume a lot of juice, even as you shrink them.
Both of these new Tomahawk4 switch ASICs are etched in 7 nanometer processes from TSMC, as you might expect. Transistor counts and die sizes were not given, and it would be a rare day indeed it we saw a block diagram or a die shot of a switch ASIC. (It happens, but not very often.) These new Tomahawk 4 ASICs have been sampling for around a quarter and will be shipping in volume next year, which means another fast ramp.
With the 100 Gb/sec per lane in the two new Tomahawk 4 chips, obviously it takes half as many lanes to build up a port running at any given speed, and there is value to that. It all comes down to what optics the hyperscaler and cloud builder customers want to deploy, and the 100 Gb/sec PAM-4 optics are more power efficient, so there is a net gain in power efficiency moving to the faster switch ASIC that is slightly hotter per port and has lower switch radix because the optics power draw is a lot lower.
Because Cisco is the new kid in the merchant silicon market, and perhaps the biggest threat that Broadcom faces in datacenter switching and routing, the company appeared to pick on Silicon One in its presentation for the Tomahawk 4 chips that was put out concurrent with the launch of the new ASICs. The example below is illustrative of how many chips it would take to create a switch with an aggregate of 25.6 Tb/sec aggregate switching bandwidth comprised of 12.8 Tb/sec ASICs that could be carved up into ports compared to just running a single 25.6 Tb/sec device:
We think this comparison is targeting Silicon One, but the lesson applies to other 12.8 Tb/sec chips conceptually, including the Tomahawk 3 chip launched in January 2018. Even if the ASICs cost a lot more per chip for the new generation, you need six times as many chips and many more extra hops across them to create the 25.6 aggregate bandwidth with the Tomahawk 3. Another way of saying that is that the Tomahawk 4 can cost six times much as the Tomahawk 3 and still have space, thermal, resiliency, and huge latency advantages to deliver the same raw bandwidth that is carved up into ports.
Broadcom is also picking on the sliced architecture in a competitive merchant chip, which leads us again to believe it is talking about Cisco’s Silicon One:
While Broadcom is not against adding programmability to its switch ASICs, this chart seems to take aim at the performance differences between the Tomahawk 4’s pipelined packet processing and the network processing unit (NPU) that is part of the Silicon One package that uses the P4 programming language to change its pipeline:
The comparisons tell you what Broadcom is thinking about, and who it is worried about.
Taking A Page Out Of The InfiniBand Playbook
If we have SmartNICs to offload network processing from servers, perhaps we need SmartTORs as well, which pull some of those functions off SmartNICs or servers and consolidates them out where they belong: On the switch. Broadcom clearly believes this might be the future of networking, and it is testing the idea with a variant of the Trident 4 switch ASIC called the Trident SmartTOR, the latter abbreviation obviously being for top-of-rack switch.
This network onload idea is not new. For decades, Mellanox (now part of Nvidia) has been offloading network jobs from servers onto its ConnectX network interface cards, and in the past several generations of its InfiniBand and then Ethernet switch ASICs, it has been consolidating certain functions up on the switch itself, accelerating collective operations and such that naturally belong on the switch.
This SmartTOR variant of the Trident 4 chip has a mere 8 Tb/sec of aggregate switching bandwidth, and it has 160 SerDes that run at 25.6 GHz and deliver 50 Gb/sec per lane thanks to PAM-4 modulation. (The same speed and modulation as other Trident 4 chips and the original Tomahawk 4 chip.) This Trident SmartTOR chip is also etched in 7 nanometer processes from TSMC, as you might expect. (Broadcom is already at 5 nanometers with a next-generation of devices, which it has talked about conceptually in the past week without talking about the exact ASICs it plans to create at 5 nanometers for datacenter, cloud, and 5G use cases.) The SmartTOR device has a programmable pipeline for Layer 2 through Layer 7 services on the network, and offers “massive scale” as Del Vecchio put it: 3 million flows, 3 million ACL entries, 1 million tunnels, and 1 million counters. The chip also has MACSec and IPSec encryption of data running at line rate.
Here is how the scale maps to the prior generation of Trident 3-X5 chips:
The use case for the Trident SmartTOR is interesting. Del Vecchio says that enterprise customers are interested in deploying their applications on bare metal, rather than virtualized, cloud infrastructure, and in that case, a lot of the virtual networking that would be done by server virtualization hypervisors or SmartNICs (more rarely) needs to be done centrally in some fashion. Putting this into the switch makes sense.
This is especially true if you look at running these network services on X86 iron or even FPGAs, which often happens in network appliances scattered around the datacenter:
The question is this, and we cannot answer it yet: What will consolidating all of these functions back on the switch cost versus running them on the X86 servers or on FPGA appliances. We won’t know until switch makers create devices using the Trident SmartTOR. Probably about a year from now, or less, if history is any guide.
Yo, just a normal idiot here, every time I read nextplatform’s articles, I learn a little bit more. So thanks.
I saw something the other day that networking guys might find interesting, obviously not for next fiscal quarter, but I’m not smart enough to elaborate on future implications (Ethernet? fiber?), or even if it’s ever realistic. I’ll leave the thinking up to you, reader. I’ll avoid direct links, so info should be enough to look it up, saw it on phys. Feel free to reply.
GWU research designed a new VCSEL that uses a multi-feedback approach from multiple coupled cavities, allowing it to exceed 30GHz modulation, “a resulting modulation bandwidth in the 100GHz range is expected”.