Drawing the lines where the server ends and the network begins is getting more and more difficult. Companies want to add more intelligence to their networks and are distributing processing in their switches to manipulate data before it even gets to the servers to be run through applications. The need for higher bandwidth and lower latency is in effect turning switches (and sometimes network adapters) into servers in their own right.
This is not a new idea, but the server-switch hybrid seems to be taking off in physical switches just as the use of virtual switches are taking off on servers equipped with hypervisors and supporting virtualized workloads. In the financial trading arena, applications generally run on bare metal for performance reasons because nanoseconds count, and so instead of a virtual server-switch hybrid, trading companies want a physical box that has low latency on the switching and enough brains to do useful work that might otherwise be done by servers much further down the line from an exchange.
Because of the competitive advantages that specific technologies yield for big banks, hedge funds, and high frequency traders that invest our money and make a zillion quick pennies on it every second, it is rare indeed to find a financial services customer that will admit to using a particular technology in their infrastructure. We almost have to infer the utility of hardware or software from the fact that they exist and that companies are selling and supporting them, barring the occasional customer who will talk and explain how and why they chose to implement a particular machine.
So it is with hybrid server-switches, which typically marry high-bandwidth, low-latency Ethernet switches with compute and storage that is sufficient to do a reasonable amount of processing all in the same box. Such a hybrid machine was not just inevitable because X86 processors became embedded inside of switches over the past few years, but that has certainly helped make it easier to move Linux applications that once ran on separate servers or appliances onto the switch itself. Switches have always included field programmable gate arrays, or FPGAs, to help goose the performance of certain functions that are ancillary to the central switching ASIC but which are of too low of a volume to merit being etched directly in silicon. In some cases, switch makers are putting beefier FPGAs in the switches, giving them multiple kinds of compute to chew on data as it moves back and forth across the switch.
This is precisely the tactic that Juniper Networks is taking with its new server-switch, which has a mix of X86 and FPGA compute embedded in it to accelerate financial applications.
Andy Bach, chief architect for Juniper’s financial services group, tells The Next Platform that the increased message rates required by trading companies, which is growing at somewhere around 30 percent to 50 percent per year, puts enormous pressure on networks. On the CPU front, processor clock speeds are more or less flatlined at around 3.5 GHz, depending on the architecture, and that means companies can’t push performance on single-threaded applications much harder. Aspects of trading applications do not easily parallelize and are dominated by single-threaded performance of CPUs. And while high frequency trading fostered a market for very low latency Ethernet switches, the switching speed pretty much “bottoming out” at a couple of hundred nanoseconds, says Bach, and so switch makers and financial services firms alike are looking for new ways to boost performance and decrease response times. (Bach ran the networks at NYSE-Euronext for more than two decades, so he has sat on the customer side of the table for a lot longer than the vendor side and knows these issues well.)
But there is, Bach explains, another factor that is pushing companies to rethink their switching infrastructure.
“We are starting to see another step function in processing demand,” says Bach. “We went through the era of high frequency traders, and their main advantage was speed and everyone has pretty much caught up. With social media and live news feeds, that really changes the game in terms of what kind of processing capacity you are going to need. Just think about sorting through a live Twitter feed and figuring out what is good news and what is not, what is relevant.”
The stock ticker coming off the exchanges–the crawl coming off Wall Street–is barely any data at all by comparison. But the consolidated trade/consolidate quote ticker, or CTCQ, averages around 1 million messages per second, and the options feed can burst to 20 million messages per second and the equities feed is on the same order of magnitude; messages sizes in the financial industry average somewhere between 140 bytes and 200 bytes. A Twitter feed is on the order of millions of messages per second, and other social media sites like LinkedIn and Facebook also have very high message rates that are comparable to other financial feeds. (It depends on if you take the whole feed or just parts of it.)
Live news feeds are also increasingly part of the trading algorithms, and it is funny to think about the news last week that Intel might be acquiring Altera hitting the street actually coursing its way through financial applications, with that very news possibly being chewed on by a mix of Xeon GPUs and Stratix FPGAs and turned into money. CNBC reported that a fast-acting trader (more likely an application, not a person) reacted within seconds of the Altera rumor hitting Twitter and turned options worth $110,350 into a $2.5 million profit in 28 minutes as the market went crazy and drove Altera’s stock up 28 percent. That fast reaction on the Altera news could have been a bot waiting for such news to come along, plain old luck, or insider trading.
Suffice it to say, there is still an arms race in the financial services industry, and this time it is about using data analytics to pick the right time to trade, and then executing quickly – not just moving quickly for its own sake.
Not Quite A God Box, But Close
Juniper would not bother to create a God box switch if there were not customers asking for such a machine, and the wonder is that Juniper has not done it to date with rivals Cisco Systems and Arista Networks already doing it and upstart Pluribus Networks making a lot of noise last year about its own server-switch half-blood. Bach can’t reveal the early customers who helped it develop and test is QFX application acceleration switch, but he did tell The Next Platform that the Chicago Mercantile Exchange, which runs the world’s largest derivatives trading operations, will be using the new switch to route data selectively to its matching engines.
The new compute-enhanced Juniper switch comes in two flavors and uses the QFX5100 switch as its baseline. This switch has a Broadcom Trident-II ASIC at its heart that has 2.56 Tb/sec of switching bandwidth and that can process 1.44 billion packets per second. The QFX5100 comes in a bunch of flavors, which differ in the number and speed of ports, but the one that the QFX5100-AA is based on has 24 ports running at 40 Gb/sec and a port-to-port hop latency of around 550 nanoseconds.
In prior generations of QFX top-of-rack switches, the Junos network operating system ran in bare-metal mode on X86 processors embedded in the switch. With the QFX5100 launched last year, Juniper put in a two-core “Sandy Bridge” Xeon E3 processor running at 1.5 GHz and added a hypervisor, allowing it to run two copies of Junos for redundancy. This switch had 8 GB of memory for the Xeon E3 chip and a 32 GB SSD for local storage.
On the QFX5100-AA, Juniper upgraded the X86 processor to a four-core Xeon E3-1125C v2 (which is an “Ivy Bridge” generation embedded chip) running at 2.5 GHz, and found that Junos only needed a half core of processing to run atop its hypervisor. That left three cores of raw compute that it can expose through its hypervisor to run Linux applications right in the network stream. The QFX5100-AA has 32 GB of memory allocated to that Xeon E3 chip plus 128 GB of SSD storage, and importantly, has two 10 Gb/sec Ethernet ports with drivers that allow for kernel bypass, yielding very low latency from the network into the Xeon chip and back out again.
If those three Xeon cores are not enough compute to accelerate workloads, then customers can buy an FPGA module, called the QFX-PFA, which is an Altera Stratix V that has been mounted onto a card slot that is normally reserved for adding modular uplink cards to the switch. Specifically, Juniper is using the Stratix V 5SGXAB variant of the FPGA, which has 952,000 logic elements and 1.44 million registers, and this is one of the most powerful FPGAs that Altera makes. This is known as the packet flow accelerator, and it 48 GB of its own DDR3 memory providers 160 Gb/sec of connectivity to the switch.
Programming FPGAs is not easy task, so Juniper is partnering with Maxeler Technologies, which has tools that can convert dataflows constructed in the Java language to the Verilog language that encodes the FPGA’s functions. The QFX-PFA includes a complete development environment and technical support. Maxeler has expertise not only in financial services, but also in the oil and gas industry and in the life sciences arena and that means this switch could see some action in these markets, too.
The FPGA-enhanced switch from Juniper will have a number of different use cases, according to Bach. Putting the compute in the switch to ingest hundreds of global data sources, combine them, and parse them out to customers with specific feed requirements can be done inside the switches, reducing latency and replacing hundreds of servers. The exchanges have to maintain multiple matching engines for their order books, and this can also be partially implemented on the switches instead of solely on servers. In one test, a customer with a trading plant comprised of 1,000 servers and 1,500 network ports with a 150 microsecond latency was able to use the FPGA to implement part of the work, cutting the server count down to 60 machines and the switch port count down to 1,000 ports while at the same time reducing the latency to under 100 microseconds for the matching engine application.
The exchanges can also use the compute inside of the enhanced QFX5100 switch to do pre-trade risk analysis on the fly, as transactions are coming across the network, with a lot lower latency, assessing if trades are bogus or not. The other kind of risk analysis associated with trading – checking to see if a trade is smart or not – can also be done on the fly inside the switch, says Juniper.
Taking On Arista, Cisco, And Pluribus
At the moment, the Juniper QFX5100-AA has three devices as its main server-switch hybrid competition. Arista launched the 7124FX switch back in March 2012, which has 24 ports running at 10 Gb/sec and which is based on the Bali ASIC from Intel, rated at a mere 480 Gb/sec of aggregate switching bandwidth. The 7124FX switch also has a dual-core Turion Neo X2 processor from AMD with 4 GB of memory, which links to an Altera Stratix V FPGA through PCI-Express 2.0 port. The 7124FX also has an atomic clock option, and fully loaded, with a C-to-Verilog development kit, it costs $59,995.
Cisco peddles the Nexus 3548 as an accelerated switch, which came out in September 2012 and which is based on the company’s own “Monticello” ASIC. This switch was aimed at high frequency traders and had 48 ports running at 10 Gb/sec. The Monticello ASIC has 960 Gb/sec of switching bandwidth and can forward at a rate of 720 million packets per second. The Algo Boost functions of this Monticello ASIC allow for certain algorithms commonly used to handle data feeds and transactions to be accelerated, but they are not programmable by end users like the FPGAs used in the Arista and now Juniper switches. Cisco charges $41,000 for the Nexus 3548, but as we point out, it does not have the same level of programmability even if it does have low latency and low jitter.
The third competitor to the Juniper QFX5100-AA is the Freedom Server-Switch from Pluribus Networks, which came out in February 2014. Pluribus has two flavors of server-switches. One uses Broadcom’s Trident-II ASIC married to server with one Xeon E3 processor, while the other uses an Intel Alta FM6400 ASIC matched to a two-socket Xeon E5 server. Depending on the configurations, the Pluribus Freedom machines cost between $25,000 and $80,000, not including a KVM hypervisor, which costs another $10,000 on top of that.
Juniper doesn’t expect to ship the QFX5100-AA and its FPGA add-on until the third quarter, so pricing is not yet set in stone. But Bach says the existing QFX5100 costs $40,000 and to expect a slight premium for the beefier and programmable QFX5100-AA. (Perhaps is the range of $50,000 is our guess.) Adding the FPGA and programming tools will pump up the price, but Juniper says the full-tilt-boogie QFX5100-AA with the FPGA stack will be priced competitively against other options. Whatever the cost for the QFX-PFA module will be, you can bet that most of it will be due to the Maxeler tools, not the Stratix FPGA.
Here’s a funny question: How long before Facebook and its manufacturing partners in the Open Compute Project add FPGAs to the Wedge and 6-pack open switches?