If you want for the rapid pace of innovation in datacenter networking to continue, then you had better hope that the hyperscalers and major public cloud builders don’t run out of money.
That is because it is their collective appetite for bandwidth is paying for the network ASIC, switch, and transceiver makers to push the envelope on technology, and it is their extreme stinginess that is forcing those suppliers to push down the cost of successive generations of wares – two things that the rest of the IT sector eventually benefits from in the trickle down that is one of the core founding principles of The Next Platform.
Without those aggressive decreases in the cost of networking, the hyperscalers and cloud builders won’t buy technology, and we don’t want to think about what happens when they stop buying. That means very bad things for compute, storage and networking because Moore’s Law never did exist in a vacuum. The reason it works is not because transistors and storage media keep getting cheaper, as if there was some force of physics driving it, but because someone buys the new technology when it gets cheaper.
No buyer, no Moore’s Law. And the funny thing about hyperscalers and their cloud building peers is that they will not pay a premium for a premium technology. They expect technologies to improve at that Moore’s Law pace and they will not upgrade to a new technology until the cost per unit of capacity falls well below that of a current technology.
That is the lesson we synthesized from a whirlwind presentation of market data and extremely well informed opinion that Andy Bechtolsheim, one of the luminaries of the networking and computing businesses, shared with the attendees of the Hot Interconnects 26 conference hosted at Intel’s headquarters in Santa Clara this week.
Bechtolsheim was, of course, one of the founders of Sun Microsystems in 1982 and the creator of the first commercially successful Unix workstation, which eventually gave him the funding to start Gigabit Ethernet switch maker Granite Systems in 1995, nearly a decade after Sun went public and kissed $6 billion in annual sales, and only a year later, router maker Cisco Systems bought Granite for $220 million as its entry into the switch market that defines it today. David Cheriton, one of the co-founders of Granite and who like Betcholsheim was an early angel investor in search engine giant Google, teamed back up with Bechtolsheim to start Kealia, which created very high performance converged supercomputers, code-named “Constellation,” based on Opteron servers, dense SATA storage, and massive InfiniBand director switches. When Sun went up on the rocks in the early 2000s, Sun bought Kealia and Bechtolsheim found himself back at Sun. Shortly after that, Bechtolsheim and Cheriton formed the company that would become Arista Networks, which made its big splash in 2009 with 10 Gb/sec Ethernet switches based on merchant silicon from Broadcom and Fulcrum Microsystems, and a homegrown, Linux-derived Extensible Operating System. Arista is now the fastest growing datacenter switch maker and the only real rival to Cisco outside of the handful of whitebox switch makers also pedding wares based on merchant silicon, which now includes Barefoot Networks (just acquired by Intel) and Innovium.
Thanks in large part to the efforts of hyperscalers Microsoft and Google, merchant silicon makers Broadcom and Mellanox Technologies, and Arista Networks way back in 2014, the IEEE had to get with the program and start pushing 25 Gb/sec signaling harder to make 100 Gb/sec Ethernet cheap enough that the hyperscalers and cloud builders could even think about adopting it. The IEEE was content to use ten lanes of 10 Gb/sec to make 100 Gb/sec switches, but this was too hot, too fat, and too costly. So the hyperscalers and cloud builders held their ground and the IEEE blinked and adopted their standard. A few years later, we added PAM-4 modulation to communication SERDES, which are now humming at 50 Gb/sec after encoding overhead is taken out, and now we can get 100 Gb/sec per lane. And thus, we are at the beginning of the 400 Gb/sec wave with Ethernet in the datacenter.
Sales of 40 Gb/sec Ethernet switches peaked in 2016 according to data presented by Bechtolsheim, and 100 Gb/sec port shipments crossed over 40 Gb/sec shipments in late 2017 and are still climbing, with a peak on a much taller curve expected sometime in 2020 and riding out for several years. 400 Gb/sec Ethernet is starting this year, and 800 Gb/sec Ethernet is expected to begin its rise in 2021 – thanks in large part to merchant switch ASIC makers adopting advanced chip making processes and not lagging behind CPU and GPU makers as they had done for many years. The progressive shrinking of chips has allowed more functions to be added to switch ASICs, such as larger buffers and routing tables, as well as faster SERDES to push bandwidth up.
Starting next year, with the 400 Gb/sec rollout, network ASICs will be on par with CPUs in terms of process, and this makes it seem like network innovation is happening faster than on CPUs. The reality, we would say, is that networking was lagging far behind – the 40 Gb/sec half step from 10 Gb/sec to 100 Gb/sec took far too long and hurt distributed systems design to a certain degree – and is finally back on track. It just feels like great progress because networking was far behind compute for so long. Both are going to hit the Moore’s Law wall at some point in the next few years.
The shrinks with each process jump are significant. Take a switch ASIC implemented in 28 nanometer processes in 2015 or 2016, which was top of the line. Moving to 16 nanometers has 3X the transistor density, moving to 7 nanometers (which will happen in 2020 through 2021) yields 15X the density, and getting to 5 nanometers (in 2022 through 2023) yields 30X. This is the big driver jumping Ethernet from 100 Gb/sec to 400 Gb/sec to 800 Gb/sec.
Here is a chart that shows the Ethernet lane speed changes over time:
And this one maps the aggregate bandwidth of the merchant silicon, generally speaking, to the lane speed transitions.
As you can see, the current 50 Gb/sec era is going to be a short one as the industry moves quickly to 100 Gb/sec signaling on 7 nanometer devices, with aggregate bandwidth hitting 25.6 Tb/sec per ASIC in 2021 and doubling up to 51.2 Tb/sec in 2023.
But here’s the problem. The design costs for chips are growing exponentially from generation to generation and the number of foundries that can etch chips in advanced processes is shrinking, so risk is rising. And hence the rise of merchant ASIC makers who can spread some of that risk around, even if they remove a lot of differentiation from the market as fewer and fewer switch makers etch their own ASICs. (Some would say unneeded differences that are not truly differentiated.)
What this means is that it is successively more difficult to reduce costs with each new Ethernet generation, but unfortunately this cost reduction is an absolute requirement for each generation to be adopted because, as Bechtolsheim points out, the hyperscalers and cloud builders are seeing 50 percent growth in their bandwidth needs per year. They can’t wait, and they can’t pay more per bit shipped over the wire. They won’t have a business if they do – or at least not the wickedly profitable ones they currently have.
“People always talk about this as if there is some magic to it, but there really is just substitution from a cost/performance and technology standpoint from the previous generation,” Bechtolsheim explained. “And the speed of adoption is largely driven by the relative price/performance. In the dark ages between 2000 and 2010, there was a 10 Gb/sec standard in 2000, but the equipment was so expensive that very few people could justify deploying it. It took almost ten years for the cost to come down before there was some adoption. In the cloud, this pace doesn’t work at all because they will never adopt a technology unless it is cheaper on Day One.”
This is exactly what happened with the transition from 40 Gb/sec to 100 Gb/sec, as we have talk about here at The Next Platform as it was happening. When proper 100 Gb/sec switches came out, the ports had 2.5X the bandwidth and the incremental cost per port was around 25 percent, but the cost per bit transferred went way down. And boom, the hyperscalers and cloud builders started moving to 100 Gb/sec networks and they are expected to have a big appetite, despite the recent lull in server and switch spending in the past few quarters. Take a look:
Based on projections by The 650 Group made last September, cited by Bechtolsheim in his presentation, switches using 50 Gb/sec SERDES are going to push a huge amount of bandwidth capacity into datacenters next year and for several years to come. But in 2022, the expectation is that machinery based on 100 Gb/sec SERDES will have an aggregate bandwidth that exceeds all of the bandwidth sold, regardless of SERDES speed, into the datacenter in 2020. Aggregate bandwidth shipped will be up by more than an order of magnitude between 2017 and 2022.
But this will only happen if the cost per bits transported keeps coming down because the few remaining fabs can keep shrinking transistors and get payback from the $10 billion or more they spend on a fab. Everything in the 400 Gb/sec generation of switches, based on those 100 Gb/sec SERDES, hinges on the hyperscalers and cloud builders, as this chart so aptly demonstrates:
The top four cloud service providers, according to this Dell’Oro estimate, will account for nearly all of the 400 Gb/sec port shipments in 2019, and about half of the port shipments from there on out. The rest of the clouds will account for the lion’s share of the other half. Large enterprises, which still largely run on 10 Gb/sec switching, barely show up in the data.
We would like to see that change, of course, and for enterprises to not fall so far behind. But they tend to lag far behind the hyperscalers and cloud builders, and the gap is getting wider. Before too long, moving to the cloud may be the only way for large enterprises to get on current technology without spending a fortune on capex. Either way, if they rip and replace their networks or move applications to the cloud, they are looking at a big disruption. The question is, which one has a higher probability of success?
One last thing: Bechtolsheim will be sitting down with us at The Next IO Platform event in San Jose on September 24 to talk about the past, present, and future of datacenter networking. It should be fast moving and fun. . . . You can register to attend here.