It has been a long time since the dominant architecture in the datacenter has been monolithic systems running siloed applications fed by a three tier network. These days, any organization operating at scale has hundreds to thousands of servers and what used to be an application running on one machine is now distributed across a portion of that iron, chatting back and forth continuously. Networks have gotten flatter and faster because of this shift, and the craving for higher bandwidth and lower latency on network fabrics is accelerating, not abating.
The ramp for 100 Gb/sec for Ethernet and InfiniBand alike is just starting in earnest the datacenter, and all eyes are already looking ahead to the 200 Gb/sec speeds. This is ironic given that plenty of enterprises can get by with 10 Gb/sec Ethernet for their workloads, and it will take years, as The Next Platform has discussed before, for them to move up from slower networks. Or, at least they and the analysts who count ports think that. (Look at how long Gigabit Ethernet has persisted for workloads where bandwidth is just not an issue.) But we are always interested in the applications and the systems that support them where bandwidth and latency are the issue, because this is where things get interesting.
The transitions from one network speed to another can be very disruptive and costly, as the move from 1 Gb/sec to 10 Gb/sec speeds was when it started out on routers in 2002 and trickled down to aggregation and top of rack switches in the following years as switch ASIC, adapter card, and transceiver costs all fell.
Switch maker Mellanox Technologies has sold ConnectX server adapter cards and LinkX network cables as part of its stack, and in July 2013 the company invested nearly $130 million to buy IPtronics and Kotura, two companies with expertise in silicon photonics and optical transceivers that are employed in the QSFP modules that are used with current 40 Gb/sec and 100 Gb/sec optical fiber links for networks.
Arlon Martin, senior director of marketing for the silicon photonics unit at Mellanox, tells The Next Platform that the company has figured out how to tweak the photonics and transceivers used for 100 Gb/sec QSFP links so they can have their lanes run at twice the speed and therefore be able to support 200 Gb/sec speeds.
This speed bump on the transceivers is important for a number of reasons, but the main thing is to maintain a balance between the servers and the switches, says Martin. “Datacenters have typically run 10 Gb/sec down to the servers and 40 Gb/sec between the switches,” he explains. “Where we are headed is keeping the same symmetric architecture, but having 25 Gb/sec at the server and 100 Gb/sec between the switches, and we will have 200 Gb/sec switches that mesh very nicely with 50 Gb/sec on the servers.” (In these scenarios, the uplinks out of the top of rack switches run at the higher speeds, of course. And you can use copper cabling at 200 Gb/sec for in-rack linking of up to 3 meters, just as was the case with 40 Gb/sec and 100 Gb/sec Ethernet.)
Some customers, says Martin, may wait it out and skip directly from 40 Gb/sec to 200 Gb/sec on their network fabrics – if they can wait and the economics makes sense. But for those that use fat tree network topologies (as many HPC centers do) or CLOS networks (as all of the hyperscalers and some HPC centers do), they will have to scale their networks along with their compute or they will create bottlenecks. So there will be no skipping the 100 Gb/sec stop if they are upgrading their systems.
Fiber Optics A La Mode
The QSFP transceivers that Mellanox and others make, which are used on both sides of a network link on a switch fabric, are connected by single mode fiber optic cables, which is the new standard in the datacenter as far as the hyperscalers are concerned.
Microsoft has already cabled its datacenters with single mode fiber – the kind used for long haul telephone lines and cable services – and Facebook is shifting from multi mode fiber to single mode fiber on its datacenters concurrent with its shift from 40 Gb/sec to 100 Gb/sec network fabrics in its datacenters. Facebook plans to have all of its datacenters shifted over by January 2017, which is a herculean task but one that will allow this “fiber plant,” as the cabling is called, to be used for the next 20 years and depreciated over that long time horizon. We presume that Amazon Web Services and Google are on the front end of the single mode fiber movement, too, and are itching to ramp from 40 Gb/sec Ethernet network fabrics up to 200 Gb/sec, 400 Gb/sec, and higher as soon as possible. (Google Fellow Amin Vahdat, who was showing off the search engine giant’s homemade switching gear last summer, explained separately that a datacenter with 100 Gb/sec ports coming out of 50,000 servers would need a 5 Pb/sec – yes, that’s five petabits per second – network.)
Facebook, Microsoft, and others have been working with the industry to come up with a cheaper version of single mode fiber that is not rated for the 10 kilometer distances required by the telecom and cable industries and is designed with less aggressive packaging as well as shorter distances of between 500 meters and 2 kilometers, which are the sizes for linking datacenter rooms together or datacenter campuses together, respectively.
According to Martin, some in the industry have been trying to push the CFP2, CFP4, and CFP8 transceivers that are commonly used in routers as the way to get to 200 Gb/sec and even 400 Gb/sec speeds on Ethernet switches in the future. But these transceivers are too expensive, at around $5,000 a pop, and also too hot and too big, which would radically cut down on switch port density. So that is why Mellanox has been working to double pump the speed on the QSFP transceivers, which it is demonstrating at the Optical Fiber Communication conference in Anaheim, California this week.
Moving from 100 Gb/sec to 200 Gb/sec speeds is natural. We know that Mellanox hopes to deliver 200 Gb/sec High Data Rate (HDR) InfiniBand towards the end of 2017, concurrent with the launch of the Power9 processors from IBM and the “Volta” Tesla GPU accelerators from Nvidia that will be used in the “Summit” and “Sierra” supercomputers being built by the US Department of Energy. As was the case with 100 Gb/sec switches and adapters, Mellanox expects for InfiniBand to get the higher 200 Gb/sec speeds first, but Martin says that the lag between InfiniBand and Ethernet should be smaller now that Mellanox has a few generations of Ethernet development under its belt. The optical cables and transceivers that Mellanox is developing for the 200 Gb/sec speeds and QSFP ports can be used for either InfiniBand or Ethernet networks.
Perhaps the main thing is getting the cost of the fiber down by switching to single mode (allowing it to be amortized over more switch generations and more time) and getting the cost of transceivers down, too. Those optical transceivers designed for routers and using CFP-style ports are ten times as expensive as what Martin hints Mellanox can do using its transceivers for QSFP ports – and at volume, the price can come down even further. This is important for a number of reasons, not the least of which being that cabling can be more expensive per port than the switch itself, so lowering that price helps lower the overall cost of networking. Moreover, unlike the fiber plant, which has a 20 year lifespan, transceivers tend to be replaced every three to five years along with switches, so they cannot be exorbitant in price. In general, the idea would be that a 100 Gb/sec transceiver will cost a little bit more than a 40 Gb/sec transceiver, and similarly a 200 Gb/sec transceiver will probably cost a little bit more than a 100 Gb/sec transceiver.