The battle for HPC centers and national labs is underway among the leading AI chip startups in the high-end datacenter space (Graphcore, Cerebras, and SambaNova in particular).
We talked last week about how Graphcore is stacking up in traditional supercomputing spaces at the University of Bristol, SambaNova has found inroads at Argonne and Lawrence Livermore (LLNL) labs in the U.S., and Cerebras has also become established at LLNL. In addition to those recent wins, Cerebras has crossed the Atlantic for its first European win at EPCC in Edinburgh.
Despite this tit-for-tat in HPC site wins among the AI chip triumvirate, the jury is still out on what the favored architecture will be to meet the needs of deep learning in supercomputing. There are enough systems distributed around the globe now, however, that within a year we might start to have a sense, especially since EPCC will be evaluating the Cerebras experiments to its neighbors with Graphcore. The grand point of comparison is what stacks up in terms of price, performance, power, and usability against the GPU—but that battle royale will take much longer to shake out.
EPCC will be pairing its forthcoming water-cooled Cerebras CS-1 system with a small fleet of HPE SuperDome Flex servers, which are themselves just one part of a ten-year, £100m contract the center signed with HPE. Since much of EPCC’s mission is focused on coupling data-intensive computing and large-scale analytics with traditional supercomputing, these big memory systems are already a staple. Now, however, they’ll be connected to the Cerebras waferscale system over 12 Ethernet links, each running at 100Gb/s (1.2 terabits/s on the network).
Mark Parsons, EPCC director, tells us that in their evaluations of the CS-1, the capabilities they would get for the price put Cerebras on par with Nvidia GPUs, which the center has a nice share of on other systems. In fact, he says that when it comes to the complexities of dealing with vast numbers of GPUs on the network, having a single system that can do both AI and data analysis in a lower power envelope speaks volumes for where EPCC might look down the road.
“If I look at the complexity of running large clusters of GPUs and their power requirements, that’s going to come under more scrutiny in the next five or so years,” Parson says. “That’s why I think the large waferscale-type providers are going to succeed; around the lower energy costs especially. I don’t see us buying hundreds of GPUs but rather a set of these types of AI accelerator platforms.”
Other than GPUs, EPCC has not taken the leap, even experimentally, with any of the other AI chip startups that have proven some traction in HPC-oriented labs and research centers. “There’s activity in the UK around Graphcore already, but parts of the funding I got for this was to do something different, to let us compare and contrast.” He adds that in terms of availability in a final form that could be simply bought, plugged in, and arrive ready was another part of the decision process. And he’s right; it has been quick. He placed the order in December and by March, the machine will be delivered and installed.
Of course, a machine on the floor is still a way off from broad usability, at least for some groups. It’s no simple task to get HPC to speak TensorFlow or to be retrofit into other AI frameworks, which the Cerebras system needs from a software perspective. While Parsons was careful to point out that EPCC’s overriding goal is to bring the worlds of data science and traditional HPC together, the first applications on the Cerebras machine are likely to be in areas where AI frameworks, Spark, and traditional data science already are up and running. He points to Cerebras work on BERT as attractive for the NLP groups at EPCC (something the center has been known for over the last decade) and for some humanities work in large-scale text analysis.
Parsons says he noticed Cerebras following their integration in another HPE SuperDome Flex system at the Pittsburgh Supercomputer Center. Interestingly, both EPCC and Pittsburgh have similar interests infrastructure-wise. Both have been at the leading edge of marrying “big data” and analytics with traditional supercomputing.
Parsons and team talked with HPE about building something similar for EPCC, noting that he was excited about the opportunity to put a huge number of network cards in the SuperDome Flex system so they could have tons of data sitting in memory with fast disk attached to allow for fast data speeds—an important element for data-intensive HPC and large-scale analytics. He adds too that in terms of Cerebras, the attractive points were the scalability and data movement capabilities when matched with SuperDome. “The fact that as a company, they seemed to understand that large-scale deep learning is about pushing data quickly to the processors and acting on it. I liked that and it’s also a much simpler way of doing that for large-scale problems rather than dealing with multiple GPUs on a network.”
We’ve published several pieces about the Cerebras architecture and approach here. As the company’s CEO and co-founder Andrew Feldman claims, the CS-1 which is built around the world’s largest processor (the Wafer Scale Engine) “is 56 times larger, has 54 times more cores, 450 times more on-chip memory, 5,788 times more memory bandwidth and 20,833 times more fabric bandwidth than the leading graphics processing unit (GPU) competitor.”
“We are excited to bring our industry-leading CS-1 AI supercomputer, coupled with HPE’s advanced memory server, to EPCC and the European market to help solve some of today’s most urgent problems,” Feldman says. “Our vision with the CS-1 was to reduce the cost of curiosity, and we look forward to the myriad experiments and world-changing solutions that will emerge from EPCC’s regional datacenter.”