IBM did not just stake the future of its Power chip and the systems business on which it depends on the OpenPower Foundation, a consortium now with 160 members after more than two years of cultivation by Big Blue and its key early partners – Google, Nvidia, Mellanox Technologies, and Tyan. It has staked its future in the systems business on the idea of accelerated computing, which means using a mix of processors and accelerators to maximum performance and minimize costs and thermals for specific workloads.
It is hard to argue that the future of computing in the datacenter is moving away from general purpose processors and systems (perhaps with the exception of hyperscalers, who need homogeneity to keep the acquisition and operation costs low so they can grow their customer bases and services) to more specialized gear. Look at how many different kinds and types of processors that Intel sells just to see the diversity, which is about to be expanded with Altera FPGAs once Intel completes the $16.7 billion acquisition of that chip maker.
In the OpenPower camp, IBM and Nvidia have been tag teaming with hybrid computing for a while, with the big wins being the future “Summit” and “Sierra” supercomputers that the U.S. Department of Energy is spending $325 million to build for Oak Ridge National Laboratory and Lawrence Livermore National Lab, respectively. These are pre-exascale systems based on the future Power9 chips from IBM lashed to the future “Volta” Tesla GPU coprocessors from Nvidia, with the CPUs and GPUs linked by NVLink high-speed interconnects and the resulting hybrid nodes linked by EDR InfiniBand (or perhaps HDR if it gets completed in time).
Naturally and predictably, in the wake of the Intel-Altera deal, Xilinx is becoming an important member of the OpenPower compute platform and at the has announced a strategic collaboration with IBM under the auspices of the OpenPower Foundation to more tightly couple its FPGAs with Power processors and ultimately Nvidia GPUs where that is appropriate. This multi-faceted hybrid computing is something that we expected, and we said as much back in March when we attended the first OpenPower Summit in Silicon Valley, and in fact, we used the high frequency trading applications from Algo-Logic, which bring all three technologies together, as an example of how this could work.
“With every generation – and it is becoming more and more interesting math – we move functions and accelerators onto the chips. With each of our chips, we go through a longer and longer list of special purpose accelerators. When is it going to be that the math will be right to pull in a general purpose programmable accelerator onto the chip? Is it at 7 nanometers or 3 nanometers when that is going to happen? I don’t know just yet.”
John Lockwood, CEO at Algo-Logic, summed it up this way: FPGAs are deployed where you need low latency on transactions, GPUs are used where you need high throughput calculations for the parts of the application components that can be parallelized, and CPUs are used for those portions of the code need fast execution on single threads. The trick is making it all work together to accelerate the entire application. This is not as simple as buying a custom processor as any type, but the OpenPower partners think the arguments are compelling for this approach. Compelling enough for IBM to sell off its System x business and essentially stake its system future on the idea.
The question we had when being briefed about the OpenPower announcements at SC15 this week is this: When does accelerated computing become normal? And then we thought about a future where all of this technology – CPU, GPU, and FPGA – might end up in a single package or on a single die anyway as Moore’s Law progresses for the next decade or so.
Brad McCredie, who is vice president Power Systems development at IBM and president of the OpenPower Foundation, offers his ideas on these issues and when everything that can be accelerated is accelerated.
“I think what that comes down to is predicting a rate of change in software,” McCredie explains. “That is going to be the gate to the timeline that you are thinking about. We see that these transitions do take five to ten years for software to migrate, but I do think that is an end state and people may choose to debate that with me. I think we will see that accelerated computing will be the norm and that software will be developed that way. Of course we are going to build lots of tools and aids to make it easier and easier to use these hybrid architectures. But this is going to be the new normal and we are going there.”
With Intel already having integrated GPUs on selected Xeon processors (both on-package and on-die variations), and talking about how it will have in-package and eventually on-die FPGA accelerators on selected members of the Xeon family, it is natural enough to ask if the OpenPower partners will ever work together to create various integrated CPU-FPGA or CPU-GPU or even CPU-GPU-FPGA hybrid chips. There may be some technical barriers to this that need to be hurdled, of course, but it is not a ridiculous thing to contemplate – particularly as Moore’s Law starts running out of gas. Here is what McCredie had to say about that:
“With every generation – and it is becoming more and more interesting math – we move functions and accelerators onto the chips. With each of our chips, we go through a longer and longer list of special purpose accelerators. When is it going to be that the math will be right to pull in a general purpose programmable accelerator onto the chip? Is it at 7 nanometers or 3 nanometers when that is going to happen? I don’t know just yet. The one thing that I would point out is that as we are getting more and more out of the accelerators, the truth is the amount of silicon in a system that is being devoted to accelerators is outstripping CPUs in many cases. So maybe these things are going to stay separate for quite a while, only because we are just going to add more and more silicon into the system to get the job done as Moore’s Law slows down. When that happens, the corollary is that you need more square millimeters of silicon to get the job done.”
As we have pointed out a number of times here at The Next Platform, one could make an argument for a central processor complex comprised of CPU cores with fast single-thread performance that have had their vector math units ripped out that is coupled to an on-package GPU and FPGA. But McCredie pushed back on that idea.
“The idea is absolutely not stupid, and these are trajectories that we could go on,” says McCredie. “But for several generations, right now where we are sitting in the industry, as far as scaling and performance goes, the investment is key is in the bus and the communication between the processor and the accelerators. This is where the differentiation is going to take place. We need a lot of silicon to do the processor and the accelerators, and no one is saying that they need less CPU or less acceleration. We see demand for more of both, and so need to get better and much more efficient communication between these components.”
For OpenPower, that means a few different things. First, it means embracing and enhancing the Coherent Accelerator Processor Interface (CAPI) that is part of the Power8 chip and that allows for coherent memory access between the Power8 processor and accelerators that link over the PCI-Express bus in the system. As part of the multi-year agreement between IBM and Xilinx, the two will be working on CAPI integration for Xilinx FPGAs, the SDAccel programming stack for FPGAs will be ported to Power processors and optimized for the combination, and Xilinx roadmaps will be aligned with the combined roadmap from the OpenPower partners so compute and networking in their various forms in these hybrid machines move together in unison and at a predictable pace. The Xilinx-IBM agreement also includes joint marketing and sales efforts, too.
The important thing to note is that the Power8+ chip due in 2016 will have both NVLink ports for boosting the bandwidth and lowering the latency between Power chips and “Pascal” GP100 Tesla coprocessors as well as the existing CAPI links for talking to other kinds of accelerators such as FPGAs.
In 2017, IBM will move to the Power9 processors and both CAPI and NVLink will be enhanced to create the foundational technology for the Summit and Sierra systems for the Department of Energy; the enhanced NVLink will be used to hook the Power9 chips to the “Volta” GV100 Tesla coprocessors. The roadmap calls for HDR InfiniBand and matching adapters running at 200 Gb/sec linking the hybrid nodes to each other. The precise generations for Xilinx chips is not clear yet – they just inked the deal, after all. At the moment, IBM and FPGA partners (including Altera and Xilinx) have been able to create a CAPI-enabled PCI-Express port on the FPGA out of logic gates on the FPGA itself, which is something you cannot do with other chips because they are not malleable like FPGAs. In a future generation, the CAPI bus will be made more robust, says McCredie, and will be “pulled away from being 100 percent tied to PCI-Express,” as he put it. Xilinx will align with this enhanced CAPI bus, but don’t expect for Nvidia Tesla GPU coprocessors to use it. The rule is NVLink for GPUs, CAPI for everything else. (We suspect that Enhanced NVLink will have a maximum of eight ports per device instead of four and run at a higher clock speed that 20 GB/sec that the original NVLink has. Oh, and the Power9 chip will have a new microarchitecture and use a new chip process (14 nanometers) at the same time, by the way.
This is a lot of change, but that is precisely what the OpenPower partners have signed up for to chase exascale computing.
Putting Power Acceleration Into Practice
As part of the announcement extravaganza at SC15 this week, IBM is also talking about its own use of Tesla K80 accelerators on two-socket “Firestone” systems that underpin its Watson cognitive computing stack. (We told you about the Firestone machines, which are sold under the Power Systems LC brand by IBM, back in October and also went through some benchmark comparisons to Xeon-based machines for unaccelerated workloads.) McCredie tells The Next Platform that IBM is accelerating various deep learning algorithms in the Watson stack using the Tesla GPUs, and as proof points IBM says that the Retrieve and Rank APIs in the Watson stack have been accelerated by a factor of 1.7X and on natural language processing jobs the performance has been goosed by a factor of 10X. McCredie said that IBM had not yet deployed either the Tesla M4 or Tesla M40 GPU accelerators, announced last week by Nvidia, underneath Watson, but that given their aim at machine learning, he expected that IBM would give them consideration.
To help more customers make the move, the SuperVessel Power cloud that IBM set up in China earlier this year has expanded GPU and FPGA acceleration, and IBM’s own centers in Poughkeepsie, New York and Montpelier, France have been beefed up, too. And of course, Oak Ridge and Lawrence Livermore will be doing development work on hybrid Power-GPU setups, too.
The wins for Power-Tesla hybrid computing and the test beds for Power-FPGA computing have been documented here in The Next Platform over the past eight months, and what we wanted to know is how the uptake is going outside of these HPC labs and the oil and gas industry where this idea was originally rejected and then took off. Rice University, Baylor University, Oregon State, and the University of Texas all have hybrid clusters for doing research, and Louisiana State University is doing on with FPGA accelerated Power clusters. McCredie says that there are examples of companies doing network function virtualization and running other workloads underway at telecommunication firms and service providers, and that eight big proofs of concept are underway in various large enterprise accounts.
The reason is simple: Like Google, they have to beat Moore’s Law, any way that they can. That is why Google was a founding member of the OpenPower Foundation, after all.