The San Diego Supercomputer Center (SDSC) will be preparing to put nearly ten racks of Habana AI hardware on the floor, marking the first time we have seen the AI chip startup (acquired by Intel in 2019) in force at any major supercomputing site.
This means we can cautiously add Habana to the list of AI chip startups that have found footing at national labs and large research institutions. The most prominent startups at these locations include Cerebras, SambaNova, and to a lesser extent, Graphcore.
The “cautiously” classifier comes from the nature of the procurement. As SDSC’s deputy director, Shawn Strande, tells us, the NSF funding for the system is not standard—it’s not about putting a production machine on the floor to validate and use immediately. Rather, it’s about experimentation and building a scientific computing community around an emerging architecture. Since the other labs and universities have already picked their AI accelerators, that left Habana. And for a cool $5 million total, including all networking, there is no reason not to take a chance.
This is especially interesting for a supercomputing facility like SDSC, which serves a large number and broad class of users across many scientific domains. Further, SDSC’s approach to procuring and using HPC systems is all about “computing without borders” or, in other words, not taking a monolithic machine approach to serving diverse HPC but having tailored systems that can handle general purpose workloads along with more targeted machines that can be meshed together using containers and clouds instead of relying on a single architecture to serve all users.
The forthcoming “Voyager” supercomputer will have some unique features that go beyond the training (based on the Gaudi architecture) and inference hardware (Goya chips). The system is being integrated by Supermicro for summer delivery and will be based on their X12 platform with the Gaudi AI training system, which has eight of the Gaudi HL-205 cards matched with dual-socket “Ice Lake” CPUs. The separate but connected Goya inference system will also have HL-100 PCIe cards with “Cascade Lake” CPUs. The software stack is Habana’s SynapseAI platform.
In total there will be 336 Gaudi processors for scaling training but for Strande, what is most interesting about Habana’s architecture are the fully integrated ten 10GbE ports of RoCE RDMA v2 on the device, something that promises to get around some of the scalability bottlenecks users have hit with large training workloads. “The converged Ethernet in the Gaudi chip lets us experiment with scaling and we are interested in doing our work on a more open networking architecture like Ethernet.”
Storage and networking also take a rethink given the system capabilities. On the networking side, SDSC has stuck with long-term partner for other HPC systems, Arista for Voyager’s networking backbone. “We looked at options for networking but the bandwidth and low latency running Arista’s infrastructure means the networking out of an individual Gaudi node will be 400Gb using the large Arista core. We think the bandwidth and latency aspects will be important for this machine. We also look forward to exploring a more open interconnect platform and a high performance one as well with the 400Gb.”
Despite the networking certainty, storage is up in the air.
The team will deploy initially with Ceph with the understanding that they will explore different options during the machine’s experimental phase—a long three years before they have to open it up for full production. Here is where they have a lot of wiggle room to try out new concepts in storage they might not be able to given the mission-critical demands on large workhorses like the Expanse supercomputer. The Voyager architecture has NVMe on every node—training, inference, and compute. The Arista switch is hooked into their wider datacenter fabric and other storage systems, which is also promising if they hope to integrate Voyager’s capabilities into other systems and workflows in the future, which they do plan on if all goes well. In other words, it’s open season at SDSC for all of you storage startups promising cool capabilities with NVMeOF and unique file systems that can handle the metadata messes that Lustre and other parallel file systems weren’t designed for.
This is an experiment we’ll continue to follow, especially since it’s the only Habana system in the wild at a large HPC center. Strande says that while the system isn’t designed for the double-precision requirements of HPC the workload they’ll tackle will be strictly AI/ML and there are growing numbers of applications in areas including astronomy, climate, chemistry, physics, and beyond.
And what could be better for Intel than having a fleet of HPC engineers test drive the Habana architecture and software stack? For $5 million total (imagine the networking piece of this alone cost-wise) we can assume here at TNP that there was some serious generosity with this machine on Intel’s part. This happens in HPC, supercomputing sites don’t run like the enterprise and hyperscale businesses we also cover. But it would appear to us that Intel is eager not to be left behind as every other AI architecture is chosen as part of NSF, DoE, and other procurements with all the PR, peer reviewed benchmarks, and interest the others are generating.
This is not to say the architecture won’t be high performance for scientific computing AI workloads. We just haven’t seen a system like this emerge in this area before. Strande is confident. “From what we are seeing initially the performance of the architecture shows the system will do well for some applications that we currently have running on GPUs,” he says.