At The Next AI Platform event on May 9 in San Jose, we brought together a wide range of people leading the charge throughout the AI infrastructure stack.
We talked chips and accelerators, which tend to garner all the funding and headlines, but we took a systems-level view of how adding compute creates bottlenecks at scale. This meant devoting a solid chunk of the day to storage and I/O, which is talked about less frequently but matters more than ever.
As it turns out this is also where some serious funding is being pushed as the I/O starvation problem finds its way into the broader AI hardware conversation. For example, one startup in this area, WekaIO, just secured additional funds to keep toiling in the file system, metadata, and storage trenches with $31.7 million.
After all, we can have the fastest accelerators matched with the most optimized host processors, but if those cannot be fed those expensive investments in hardware and expertise the ROI diminishes quickly.
File system pioneer and current CTO at WekaIO, Andy Watson, pulled together points made throughout the day around performance and echoed this ROI theme. The interview below covers quite a bit of territory: GPU and accelerator starvation, the underlying file system and metadata constraints, the typical deployment patterns and installation sizes for AI infrastructure, and of course, implementing novel approaches to increase I/O capabilities.
“It’s not a matter of feeds of speeds. That’s not the end of the story, it’s the beginning. Because if a team is unable to complete their work because it has stalled it’s a serious problem. It’s about being faster, giving fast, unlocked access to a large shared pool of data that allows training runs to complete much faster than they would otherwise.”
“It doesn’t matter how large the infrastructure investment is, it will be hard to host all data on sufficiently fast hardware up close and local to those compute devices,” Watson explains.
“At some point there has to be a mechanism to manage moving data around, which is something we built into our architecture. We have our fast flash layer and we tier into object storage. That’s a deliberate choice because object is proven over twenty years. We have heuristics in our software that make it transparent so the performance is the same. You’re not going to know whether the data is coming locally right off the flash or is being returned from the object store into the flash layer and presented as a file.”
Watson provided a sense of scale and growth for AI deployments in research and enterprise (since many of the hyperscale companies implement their own similar strategies in-house). “We are installing at the largest genomics site in Europe where they have tens of petabytes but they are planning to grow to the hundreds. This will be typical of genomics sites in research,” Watson says.
In enterprise and broader AI/ML, he says that there will be growth but there are some limiting factors from a storage scalability standpoint (that incidentally, have nothing to do with a scalability wall in infrastructure). “The scale there will be up to but probably not being 100 petabytes and that is only because at a certain point, the examples become redundant and move into different categories.”
He says that one of the automotive companies WekaIO has had conversations with is looking to deploy AI-driven datacenters at the 100 petabyte scale but like others, it will probably not extend beyond this point, even if it the combination of their own file system and object stores allow exabyte potential.
WekaIO’s perspectives on growth and scalability trends are worth noting given their emphasis on large system deployments that cut across a vast swath of big systems in research and enterprise. The interview is worth a watch/listen, thanks to all who were there to see it live and help us piece together the big picture for AI infrastructure on May 9.