Oracle Engineers Its Own InfiniBand Interconnects
February 22, 2016 Timothy Prickett Morgan
Larry Ellison, co-founder and chief technology officer of database, middleware, and application software giant Oracle, caught the hardware bug pretty bad when he decided to buy Sun Microsystems for $7.4 billion in early 2009. Many did not believe it at first.
But a year later, when the deal closed, Oracle cranked out a five-year Sparc chip roadmap and fulfilled it faithfully and more or less on time with Sparc M7s for high-end, scale up systems and “Sonoma” T7 processors for scale out clusters finishing it out as 2015 came to a close. The interesting bit about the Sonoma chips, which the company unveiled last summer for shipment in machines presumably that will come to market this year, is that they included two InfiniBand ports running at 56 Gb/sec speeds right there on the die. Oracle did the engineering work on these embedded InfiniBand controllers.
But that was apparently only the beginning.
Now Oracle has gone one step further and has created its own 100 Gb/sec EDR InfiniBand switch and server adapter chips, which it intends to deploy in its own “engineered systems” products such as the Exadata database and Exalogic middleware clusters based on Xeon processors and the Sparc SuperCluster machines based on its own Sparc chips. And, interestingly, if you want to buy Oracle’s InfiniBand switches and server adapters to create your own clusters, you will be able to do that, too.
That Oracle has done its own implementation of InfiniBand switching for server and storage cluster fabrics is perhaps something of a surprise. Sun and then Oracle had experience building InfiniBand switches for sure, but had not done their own chips before.
Sun co-founder Andy Bechtolsheim, whose startup Granite Systems was acquired by Cisco Systems to give it a Gigabit Ethernet switching business back in 1996 and who was a key force behind Arista Networks, which is giving Cisco a hard time with 10 Gb/sec and soon 25 Gb/sec switching two decades later, was also the founder of Kealia systems, which created a media stream and supercomputing cluster system called “Constellation” that was comprised of Opteron-based servers dubbed “Galaxy” and a monster 3,456 port InfiniBand switch called “Magnum.” That machinery was first installed at the Texas Advanced Computing Center (TACC) at the University of Texas to build a 500 teraflops cluster, which was nicknamed “Ranger” and which is still running in a datacenter in Africa today.
Because InfiniBand was so crucial to Oracle’s future engineered systems plans, after the Sun acquisition in 2010, the company took a 10.2 percent stake in Mellanox Technologies, which was the biggest supplier of InfiniBand switch and server adapter chips at the time and still is. Intel is the other supplier, which bought the InfiniBand portion of QLogic in 2013. The chip maker’s Omni Path switches, which are an evolution of InfiniBand, are ramping this year.
Oracle will still continue to sell Mellanox chips inside of its switches and adapters, much as it still sells Xeon processors from Intel in some systems and its own Sparc chips in others, both capable of running the same workloads. (A Sparc SuperCluster is a hybrid cluster running Oracle’s database and middleware software that just so happens to use Xeon-based Exadata flash-accelerated storage nodes like the real Exadata machines do.) But the analogy is a bit different here, since Oracle will no doubt prefer its own silicon over those from its partner, even when it comes to Xeon machines.
The obvious question is why is Oracle making its own InfiniBand chips and wrapping circuit boards and metal around them to make switches and adapters. Marshall Choy, vice president of systems solutions at Oracle, have The Next Platform an explanation:
“It is a few things. Having control of the intellectual property is one of those development methodologies that we have at Oracle we learned a long time ago when we first started building engineered systems. When you have control of the IP stack, you can be the master of your own destiny going forward. We are doing specific thing in this chip around hardware acceleration on the network and not only for our own software and what we are doing at the low level for middleware and database to accelerate messaging but also in general for other applications. There are obvious cost efficiencies here as well, but with our association with Mellanox in the InfiniBand Trade Association, it is a validation of an open technology where we have multiple players working on the same standard. We are not going off the reservation and forking anything here. We think it is a good thing for the industry as a whole.”
Exactly how much Oracle can save by doing its own InfiniBand chips has to be pretty substantial, and it is clearly enough to fund the development costs and perhaps a bit more. While Choy is not able to elaborate much, the company will be launching its 100 Gb/sec InfiniBand switches and adapters in the second quarter of this year and has a roadmap that moves on up to 200 Gb/sec InfiniBand in the future and out beyond that. This is not a one-off product, and at some point, Oracle might put out a roadmap for both Sparc chips and InfiniBand networking that looks out another five years. We would guess somewhere around its OpenWorld conference this fall.
All of the switch chip makers are hesitant to talk about the specifics of their chips and you very rarely see die shots or block diagrams of the chips because it reveals too much about how they work. Choy did confirm that Taiwan Semiconductor Manufacturing Corp is etching the EDR InfiniBand chips (which Oracle did not have a catchy code name for while they were under development) but did not know what process it was made with. Oracle has been developing its EDR InfiniBand ASIC for about a year, and the fact that it brought it to market relatively quickly is impressive. It now seems obvious that the InfiniBand ports on the Sparc Sonoma chips were something of a dry run.
Incidentally, the Oracle chips will support the Message Passing Interface (MPI) protocol that is commonly used to handle messaging in supercomputer clusters running simulations and models. And while Oracle essentially shut down Sun’s HPC business six years ago, if it wants to go after HPC deals with its own switches, it will be able to do so.
The Feeds And Speeds
The base EDR InfiniBand switch in the Oracle lineup is the IS2-46, which is technically known as a leaf switch and which is a fairly complex hybrid device that reflects the many networking jobs Oracle expects the chip to handle. The Oracle EDR InfiniBand ASIC has 8 Tb/sec of bisection bandwidth across the switch and can deliver 150 nanosecond latencies on a port-to-port hop. The IS2-46 has 38 InfiniBand 4X ports, and this includes 24 ports running at the 4X speed (97 Gb/sec effective) and four ports that run at 12X speeds (290.9 Gb/sec effective) that can each be broken down into three 4X ports or run in aggregation mode for higher throughput. The switch has another two ports that can be configured either as InfiniBand 4X ports or as 40 Gb/sec Ethernet gateways to link the nodes on the InfiniBand side out to the Ethernet LAN and WAN that is common in the enterprise. The Oracle leaf switch has two Ethernet gateways, each with four 10 Gb/sec ports, and includes an eight-core Intel Xeon D processor with 32 GB of memory and a copy of Oracle Linux and Oracle VM Server to run virtualized network services in the switch as well as Oracle’s Fabric Manager InfiniBand software stack.
The leaf switches are designed to be lashed together to create larger networks using spine switches, and in this case, that would be the IS2-254, which has up to 254 ports running at the InfiniBand 4X speeds and has an aggregate of 24 Tb/sec of bisection bandwidth and 450 nanosecond latency on port-to-port hops. The IS2-254 is a modular spine switch and comes with a dozen modules. With a base configuration, the spine switch comes with 7.6 Tb/sec of bisection bandwidth and supports a star topology with two InfiniBand 4X ports per module slot and two Ethernet gateways. The extended fabric has 240 of the InfiniBand 4X ports and has an optical backplane to link them all together with a total of 19.2 Tb/sec of bisection bandwidth; this version supports fat tree network topologies. This spine switch has the same Xeon D processor for running network services and Oracle’s Fabric Manager but also has SSDs for capturing log data.
The Oracle InfiniBand fabric can be quite large. In a two-tiered network with 64 of the IS2-46 leaf switches and two IS2-254 switches, up to 2,048 server nodes can be lashed together. If you add more spines and cross couple them to a lot more leaves, you can scale the fabric up to around 500,000 nodes and still have microsecond latency across the fabric, says Choy.
On the server side, Oracle has its own two-port 100 Gb/sec InfiniBand adapter card, which has functions to accelerate the live migration of traditional applications as well as those that have been tweaked to be aware of the Remote Direct Memory Access (RDMA) protocol that gives InfiniBand its low latency. The adapters can support up to 33 virtual InfiniBand host channel adapters on each server and one virtual switch per port. Oracle’s own Linux and Solaris variants can drive this InfiniBand adapter card and so can the Linux variants from Red Hat and SUSE Linux. Oracle’s own VM Server implementation of the Xen hypervisor as well as the VMware ESXi and Microsoft Hyper-V hypervisors can also chit chat with the Oracle adapter.
All of this 100 Gb/sec InfiniBand gear will be available in the second quarter; pricing has not been set as yet but will be revealed then. It will be very interesting indeed to see how the Oracle products stack up against Mellanox Switch-IB and ConnectX products and Intel Omni Path wares. Oracle will no doubt be using these products to build out its own public cloud infrastructure, and Choy said straight up that was one of the reasons why Oracle was building it.