Given our focus on the systems-level of AI machine building, storage was a big topic of discussion at the sold-out Next AI Platform event we hosted in May.
It was difficult to leave out where NVMe over fabrics and other trends are fitting into AI training systems in particular, so we asked distributed NVM-Express flash storage upstart Excelero, which is a pioneer in creating pools of flash storage that look and behave as if they are directly attached storage for a server’s applications, what is it about AI workloads that makes storage a challenge.
The answer, according to Josh Goldenhar, vice president of products at Excelero, is revealed in some basic feeds and speeds that show the imbalance in many machine learning systems, either for training or inference.
“The answer is pretty straightforward,” explains Goldenhar. “If you look at the specs of the latest Nvidia DGX-2 and take the aggregate performance across the cards, the cards themselves can directly process from the memory 14 TB/sec, which is an amazing number. Even though that is an amazingly huge number, we have to count how the cards are hooked into NVLink and PCI-Express, and all of that is x16 to the server, when you add all of that up, it is actually around 256 GB/sec, and that is a considerably lower number. But that is still so much more than has been put into the box.”
This is a funny thing, considering that the convolutional neural networks flavor of machine learning was fully realized and started actually working when high memory bandwidth combined with the massive amount of parallel compute on a GPU and the availability of large datasets on which to train to actually make machine learning training converge to answers that offer 95 percent or higher accuracy for many different types of identification work. Compute has kept up, memory bandwidth has sort of kept up, but storage and networking have lagged behind, says Goldenhar. But now storage, thanks to NVM-Express, and networking, thanks to many breakthroughs in signaling and encoding, have finally caught up.
But the early AI hardware designs making use of NVM-Express were not quite right, says Goldenhar. The first instinct with NVM-Express was to use it to directly attach flash to the compute complex to get the most bandwidth and the lowest latency and really start to be able to use all of that IOPS available from flash.
“You get the most bandwidth for training and you get the lowest latency for inference,” as Goldenhar puts it. “But the problem with this approach, if you take the DGX-2, is that Nvidia only put eight flash drives in it, and for reads that is at most 25 GB/sec. We just said that the GPUs would love to pull 256 GB/sec in burst mode, but the funny thing is that there is enough bandwidth in that box to do 96 GB/sec over the network. This forces you to want to go for disaggregation with NVM-Express over Fabrics.”
There are other reasons to want to do NVM-Express over Fabrics, too, such as to avoid copying or to have larger datasets than a typical machine learning node – tens of terabytes at best – holds. For instance, a medical image can weigh in at 1 TB, so that is not very many images to train against that the box can hold. And with a GPU farm where training takes place, you don’t want to have to copy images to each of the machines for them to do their training runs; you just want to give the machines access to a shared pool where the images reside.
But what about hotspots in the data? The trick is to have an architecture, as the NVMesh from Excelero does, that tries to keep hotspots from forming a bottleneck in the first place.
“We believe that you have to go for distribution not just at the backend, but at the front end,” Goldenhar explains.
“In traditional NVM-Express over Fabrics today, it is deployed just like all flash arrays. There are NVM-Express drives and they are in a particular target, and with the NVM-Express over Fabric standard, you can have one namespace that can be shared. That namespace is physically limited to one physical box. If you go ahead and distribute that further, you start doing what a proxy does and that is you take data in, but you are limited by the processing power of that box and so you go ahead and you make the requests through the boxes that are around it. So you add latency and you add network traffic. The GPUs themselves, they don’t have the knowledge of where they could directly get the data. So we believe you have to go to a distributed client model, where the block driver or the file system driver – whatever is the closest access point from where the GPU is actually requesting the data – has the intelligence to ask for that data directly to the store that actually has that data. You avoid all of the hotspots because you are not going to direct all of the traffic to any one server. Your backend is fully distributed and the front end intelligence knows how to directly fetch that data.”
So now the question is, will flash – or rather, non-volatile storage in general – eat most of Tier 0 and Tier 1 storage, and will NVM-Express over Fabrics be the dominant way that this storage is deployed? You have to watch to find out what Goldenhar has to say about that.