American HPC Vendors Get Government Boost for Exascale R&D

Nicole Hemsoth Prickett

7 years ago

The US Department of Energy – and the hardware vendors it partners with – are set to enliven the exascale effort with nearly a half billion dollars in research, development, and deployment investments. The push is led by the DoE’s Exascale Computing Project and its extended PathForward program, which was announced today.

The future of exascale computing in the United States has been subjected to several changes—some public, some still in question (although we received a bit more clarification and we will get to in a moment). The timeline for delivering an exascale capability system has also shifted, with most recent projections landing us in the 2021-2022 timeframe with “at least one” system. This roadmap was confirmed today with a DoE announcement that backs six HPC companies as they create the elements for next-generation systems.

Despite lingering questions about the US exascale effort, one thing is for sure: it is now clear which companies are set to receive some exascale research and development funding, ideally culminating in the system components needed to reach sustained exaflop performance on real-world applications. The vendors on this list include Intel, Nvidia, Cray, IBM, AMD, and Hewlett Packard Enterprise—a list we will pick apart this with deeper dives into each of the vendors and their current exascale-oriented technologies.

Broadly, there will be $258 million in funding allocated over a three-year contract period for this PathForward project spread across all of these six companies. There is no indication that the funds are evenly split six ways across those winning the awards. The R&D funding is set along “work package” guidelines wherein the companies deliver reports to the DoE exascale groups (including software and application developers) to check the pace along the way. According to Argonne National Lab’s Paul Messina, who heads the Exascale Computing Project that will provide guidance to the PathForward groups, “It’s not like we spend the money and then wait three years to get an answer.”

“This $258 million in funding over three years will be supplemented with the companies providing additional funding amounting to at least 40 percent of their total project cost, bringing the total investment to at least $430 million”

As the technical lead for HPE’s “The Machine” architecture, Paolo Faraboschi clarifies for The Next Platform, “This is an R&D acceleration contract. At the end of PathForward, the DoE is not expecting a product. They’re hoping for the acceleration of these projects so they become product ready. Typically the partners in these kind of projects assume a 40% cost share, so for every ten dollars the DoE puts in, the company is expected to put in four dollars so the total amount of work is fourteen dollars. The procurement of the systems will come in the second stage and this is not funded by DoE centralized but the national labs—they have the budget, site, they send out a tender for a specific installation. This will happen toward the end of PathForward, so the expectation is the companies that have been awarded PathForward will continue the productization phase so when the facilities are ready to issue an RFP they can respond with a technology that matches. The second phase is a different, far larger funding pool. The individual machines will be size of the entire grant at least at that scale.

“The Department of Energy plans to deliver at least two systems – at least one in 2021 and perhaps two or others later,” Messina explained. “The systems will be purchased not by the exascale project but the facilities that house the leading computing resources at our labs.”

We have been listening closely to any word from Messina because Argonne was supposed to be the site of an Intel-based supercomputer, called Aurora and based on the chip maker’s future “Knights Hill” Xeon Phi processor. A few weeks ago, we started hearing some high-credential rumors that there might be some architectural or timeline changes with Aurora. To this, Messina confirms, albeit with no real detail: “The Aurora system contract is being reviewed for potential changes that would result in a subsequent system in a different timeframe from the original system. But since these are early negotiations we can’t be more specific.”

We have also been tuned into what Messina says because as lead of the Exascale Computing Project, it was his push to have one of the exascale architectures fall into the category of a novel architecture as one of the 2021-2022 target machines. As we noted a few weeks ago, the language at ECP has very recently shifted away from the use of “novel” over to the less exotic-sounding “advanced” and as we can see from the R&D funding listed here, there is no company other than HPE with The Machine that has something that could fit into the “novel” category. When the novel architecture emphasis first arose, we projected that the most conservative fit (meaning manufacturable at scale as well as broadly programmable) would be The Machine and we might be on the right track considering the only truly novel, producible architecture – quantum computing via the D-Wave systems – is not on the list of R&D investments.

Having this research and development funding on the front-end is good news for the exascale effort, but the question is really whether this all too little—and more importantly—too late. We are in the middle of 2017 and while this funding is generous, if it takes three years to research and develop and another two years to productize—well, you see the problem. That’s 2022 to get it installed and up to another year for full production.

Meanwhile, in China and Japan, for instance, there is a very clear extreme scale computing roadmap in terms of both applications and architectures. These countries set their minds to particular sets of applications and built the architecture around those with what appears to be extraordinary focus. The fact that roadmaps for exascale exist that are not altered frequently speaks volumes about a sense of direction. The DoE has produced its own exascale roadmaps, but the timelines for big supercomputer projects shift, as do the architectures tied to them. For instance, IBM pulled the plug on its Power7-based Blue Waters system and Cray won a deal to build a hybrid CPU-GPU system at the National Center for Supercomputing Applications several years back.

What this R&D funding means, coming as it does in mid-2017, is that there is still a lack of clarity about architectures for systems with definitive timelines attached. This is not a negative statement; workloads are changing with the introduction of machine learning (which the Japanese are building around with their next-gen AI supercomputer designed for both deep learning and traditional simulations) and it is important to get it right.

The PathForward funding will give the six vendors an opportunity to dedicate more resources to their next generation architectures with sound guidance from the HPC application and systems teams that will use them. The risk, as it would be anyway, is that these architectures fail to be a fit—and it is on the vendors to bear more burden for that system development than it might have been in past times (read this for historical perspective of funding trickle-down).

Here’s another way to think about this R&D investment. $258 million in funding across six companies to develop something that might be able to stand in as the correct architecture eventually. We were told this year it took $3 billion for Nvidia to develop its “Volta” GPU. Other architectures require similar billion-dollar scale investments. Unlike government investments in days gone by (those made of DARPA and DoE funds), the actual manufacture and rollout of the products generated from this R&D front-end provided by PathForward will fall to the vendors.

“The work funded by PathForward will include development of innovative memory architectures, higher-speed interconnects, improved reliability systems, and approaches for increasing computing power without prohibitive increases in energy demand,” Messina says. Of the vendors, he adds that it is essential they “play a role in this work going forward.”