It is easy to focus on the specific accelerators and processors for deep learning training and inference and herald these as the enablers of what is next in AI. Doing so misses the point, however, because without a systems-level view, systems based on servers stuffed with GPUs get relatively slim performance without integrating the storage, network, and systems software pieces of the puzzle.
This systems-level view is the focus of The Next AI Platform event on May 9th in San Jose. Here we will focus on the various elements of emerging hardware stacks for AI—starting with those accelerators and processors but with emphasis in the middle of the day on storage and I/O. As many who have put AI training into production quickly realized, I/O bottlenecks put a major strain on the expected performance of GPU-dense systems.
We recently discussed oversights in AI systems implementation with Andy Watson during the storage and networking segment of the day. Many will recognize Watson’s name from some of his pioneering work in NFS as well as from his role as CTO at NetApp for over a decade. He is now at CTO of WekaIO, which is carving out a niche in the enterprise and as production AI workloads start to become the norm.
“Looking back over my career, NFS was more than sufficient to meet most requirements but that’s all changing with GPU accelerated workloads. The GPUs themselves have a much greater IO appetite, especially in training,” Watson says. “The fundamental change is that the introduction of GPUs has intensified IO requirements and the nature of AI research has increased dependency on IO in the infrastructure.”
All of the focus on compute and accelerators for AI training has led to a lack of systems-level thinking when it comes to building infrastructure for deep learning. We asked Watson what he sees lacking in terms of planning for AI infrastructure as it relates to I/O in particular. His answer is a nuanced one—the oversights are workflow-driven, ultimately.
“Workflow is just as important as the applications software that’s running; it’s the process by which you organize the data and feed it along and all those steps along the way. Some organizations have tried NFS or parallel file systems like GPFS or Lustre and build into their workflow the notion that they have to package the data at various stages and copy it to local NVMe based flash on the GPU servers.
Watson says this may seem fast since they are getting local flash performance while making their runs but copying data locally is an extra task that takes wall clock time, so even if the computational event itself runs fast because it’s getting low latency I/O to that local image of data copied, it has to wait for that data to be copied.” WekaIO’s assertion is that there is no need to copy in the first place and this becomes a question of reconsidering workflow for peak performance.
Another important aspect of workflow is that with all the copying, it has researchers and AI experts focusing on “housekeeping” of the data—the constant reorganization of it with the goal of getting good performance. “It is not uncommon for several applications to accumulate data on an ongoing, hour-by-hour basis—always ingesting more. Think of autonomous cars, for instance. There is a constant flow of data from fleets of cars; it comes in non-stop and accumulates. This leaves the researchers or some IT group with the task of reorganization of data to make sense for their infrastructure and goals,” Watson explains.
“One of their goals is not to break the underlying file system they’re pouring all this data into, which leads to reduced performance or instability since most file systems can’t handle more than 100,000 files in the same directory without severe consequences. So when we look at that aspect of workflow, there’s another big consideration—our approach does not come with any restrictions. With our software, we can put all the files in the same directory with no performance hit, scalability limits or management challenges because of how we handle the metadata. We can put trillions of files in the same directory because of this fundamentally different storage architecture and file system.”
For a much more technical explanation of how WekaIO’s architecture functions (and seems to support Watson’s claims about scalability within a single directory) there is a detailed piece here.
There are other considerations on the AI I/O side as well, including how much hardware to buy, how much of the data is kept and where it is shuttled off to, especially with data ranges in the petabytes. All of that is not cheap with all flash, which explains why so many startups in the AI I/O arena are looking at a more object storage-oriented approach that resides on hard disk at a lower cost per gigabyte and has lower deployment demands.
The argument for a unified name space, a monolithic file system that handles moving data about so it’s always at the ready and remains transparent to the application is a good one—but what pain points will early adopters of AI infrastructure at any scale going to have to hit before they seek new ways to tackling I/O and performance of doubtlessly expensive AI hardware investments?
We will be talking through the unexpected roadblocks in building AI infrastructure from a systems-level point of view with Watson and many others during our jam-packed day in San Jose on May 9th.
We are almost sold out, register now and get an in-depth view into what the next AI platforms will look like and find time to chat with Andy at our afterhours event, so kindly sponsored by Lambda Labs.
Be the first to comment