Businesses and government agencies are deploying AI at breakneck speeds. But in the rush to stay ahead, or at least not fall too far behind, organizations are having to confront some hard truths.
It is far too easy to get hung up on the pure compute side of AI—the various accelerators for training and inference. But building infrastructure with a systems-level view of how those decisions have an impact on everything from storage, networking, systems software, and the end applications is often not something that comes up until performance problems rear their ugly head.
In the spirit of this topic we hosting a jam-packed day to talk about these system-wide impacts of AI infrastructure building on May 9 in San Jose. The diverse interview and panel-based lineup for The Next AI Platform event includes those who have deployed at scale (or will share their evaluation process) from companies like Facebook, Baidu, Google, and others as well as large organizations like NASA, NOAA, and national labs with big scientific computing problems that could benefit from more integration of deep learning using HPC systems.
Among the live interviews we will be hosting around this topic is one with Cray’s Per Nyberg, who tells us how difficult it’s been for companies and research institutions to grapple with dizzying array of technologies that make up artificial intelligence. And it’s not just the number and diversity of technologies, it’s how they interact with one another in a real live system.
This dilemma even pertains to tech-savvy users in supercomputing labs, who increasingly are applying these technologies to their traditional scientific computing workflows. The fact some of this AI componentry – especially GPU accelerators – have their origins in HPC, doesn’t mean these research labs have it all figured out. The challenge in this environment is that traditional HPC simulations and AI training and inference have different hardware demands. “These future systems need to cater to the unique requirements of each type of workload,” Nyberg told us.
Users who have dipped their toes into artificial intelligence by buying a few GPU accelerator cards and installing some standard libraries soon discover that AI is a systems problem that requires a balance of components up and down the hardware stack. That includes not just the processors and coprocessors, but also the memory, the system interconnect, and the storage subsystem. And that doesn’t even address the complexity of the software stack, nor the fact that HPC and AI stacks have bifurcated into separate silos.
If anything, the challenges are even more daunting in the commercial realm, where the workloads are being re-focused towards deep learning and machine learning. Part of the problem here is that business require predictability and the AI space is too young and fast-moving to provide this. Chipmakers and system vendors are rapidly expanding their product portfolios to support this application set, even as the software stacks are evolving and diversifying.
One of Nyberg’s customers, the head of the HPC group at a large investment bank, lamented about all this uncertainty. “He said if you asked him three years ago what the next three years are going to be like, he was pretty certain he could predict that,” said Nyberg. “Today, asking himself the same question, he just doesn’t know.”
A lot of customers begin by trying to figure out what type of processor would be optimal for their AI work. It could be x86 CPUs, GPUs, or perhaps the next custom AI ASIC, or maybe all of the above. But then they realize storage technology has to be factored in as well: do they need mostly large block I/O or do they need to concentrate on IOPs? Likewise, they have to consider the kind of data movement that is needed in the system, from the I/O buses all the way to the system interconnect.
All of these challenges are becoming more conspicuous now, not because they didn’t exist before, but because a lot of organizations have moved from the proof-of-concept stage into production deployments. According to Nyberg, that process began in earnest last year and is now even more pronounced. That’s when they realized their enterprise storage is not up to the task, or their memory or system interconnect have become a bottleneck.
The way these businesses react to these challenges depends a lot on whether there is in-house expertise on the relevant technologies (and is a reason some companies are poaching talent from supercomputing labs to rectify this). If this expertise is missing though, sometimes these users will try to run their AI work on existing IT gear, which Nyberg noted, leads to varying degrees of success and disappointment.
If users are new to HPC, they’re often told that AI is defined by the processor, and they try to get a handle on that. That in itself is a daunting task, given the dueling claims of Intel, Nvidia, AMD, and Xilinx for their silicon, as well as the promise of custom-built solutions from the likes of startups such as Graphcore, Wave Computing, and Habana. Even if they have reached a consensus on processor choice, some other technology like 3D XPoint memory, NVMe-over-Fabric, or CCIX, comes along that promises AI nirvana.
According to Nyberg, when that happens, users don’t know how to evaluate these other technologies in context. Even despite newer benchmarks like MLPerf and Deep500, we still don’t have good ways of measuring AI capability for whole systems. As a result, says Nyberg, customers don’t have the tools to do the needed comparisons.
We will delve far deeper into this and other topics at The Next AI Platform on May 9th in San Jose. This event will sell out, register now to get a front row seat for live interviews, expert panels, great conversation, and more.
Be the first to comment