Nvidia has unveiled GPUDirect Storage, a new capability that enables its GPUs to talk directly with NVM-Express storage. The technology uses GPUDirect’s RDMA facility to transfer data from flash storage into the GPU’s local memory without needing to involve the host CPU and system memory. The move is part of the company’s strategy to expand its reach in data science and machine learning applications.
If successful, Nvidia could largely edge out CPUs from yet another fast-growing application area. The data science and machine learning market for servers is thought to be a $20 billion to $25 billion per year opportunity, which is about the same size as the combined HPC and deep learning server market. Essentially, Nvidia is looking to double its application footprint in the datacenter.
That expansion strategy began in earnest last October, when it introduced RAPIDS, a suite of open source tools and libraries for supporting GPU-powered analytics and machine learning. In a nutshell, RAPIDS added support for GPU acceleration in Apache Arrow, Spark, and other elements of the data science toolchain. It was designed to bring GPUs into the more traditional world of big data enterprise applications, which up until now has been dominated by CPU-based clusters using things like Hadoop and MapReduce.
According to Josh Patterson, Nvidia’s new general manager of data science, RAPIDS encompassed all of machine learning, both supervised and unsupervised, as well as data processing. That was met with some skepticism from the traditional enterprise crowd. “I think the data processing part was what caught people off guard,” Patterson tells The Next Platform.
But the fact is that GPUs are getting bigger, better connected, and, from an application standpoint, more versatile. And at the same time, data analytics is getting more complex, with machine learning pipelines often integrated into the workflow. Applications using terabytes of data and needing petaflops of compute are becoming ever-more common.
All of which requires scalable infrastructure. With NVLink and NVSwitch, an array of connected GPUs can behave as one giant computational accelerator, sharing all the local memory between them and providing memory atomics for applications. The technology was originally conceived for the DGX architecture, which was primarily aimed at tackling training on bigger, more complex neural networks. Patterson says the idea was that these giant virtual GPUs could also work for big data. But one of the missing pieces was a fast data path to storage.
Typically, on GPU-accelerated systems, all I/O goes through the host – the CPU and system memory – before it gets shunted to the GPU’s smaller local memory. The CPU usually accomplishes this via a “bounce buffer,” an area in system memory where copies of the data are held before it’s transferred to the GPU. That kind of indirection is undesirable since it introduces extra latency and memory copies, reducing the performance of application code running on the GPU and using up CPU cycles on the host. That’s the problem GPUDirect Storage is intended to solve.
Nvidia is claiming a 50 percent boost in I/O bandwidth by using the technology, with latencies up to 3.8X lower. For remote NVM-Express over Fabrics storage, which can house petabytes of data over a shared pool of interconnected storage servers that are in turn connected to servers, the GPU-maker claims the technology provides even faster access than page caching from system memory. And if you have a DGX-2 system with 16 GPUs and 1.5 TB of main memory on the host, GPUDirect Storage will increase throughput by 8X, compared to an unoptimized version. That’s because all the I/O bandwidth potential of a DGX-2, which is in the neighborhood of 200 GB/second, can be now be tapped. (The bandwidth from host memory to the GPUs on a DGX-2 tops out at 50 GB/second.) The extra speed will be relevant to all sorts of data analytics workloads that are I/O bound, as well as file-heavy applications like deep learning training and graph analytics. Traditional HPC could also see a benefit.
The fact that raw data can be directly loaded from storage into GPU memory means the graphics processor can also be used for decompressing and decoding the files, relieving the CPU from these mundane tasks. Patterson said as of today the technology supports a number of commonly used file formats used for data analytics, including CSV, Parquet, AVRO, ORC, and JSON. Support for other formats like XML, HDF5 and ZAR are coming in future releases, he added.
The two basic technologies that make all this possible is Remote Direct Memory Access (RDMA), and NVM-Express (NVMe), specifically NVM-Express over Fabrics (NVMe-oF). RDMA is encapsulated in in GPUDirect’s protocol and implemented in a variety of network adapters, including Mellanox NICs. As we alluded to previously, this can used be to scale up GPU systems, DGX-style, using NVLink and/or NVSwitch.
The technology works for both locally attached NVM-Express and remote storage, a la NVMe-oF. Patterson notes that at this point networking support is confined to InfiniBand, but support for RDMA over Ethernet (RoCE) is in the works, which presumably would open up support for networking gear from other vendors besides Mellanox Technologies.
GPUDirect Storage is not yet generally available. A closed alpha starts next week for select customers, with an open beta release planned for November. Patterson says general availability is slated for early 2020, at which point it will be just another component of the CUDA toolkit. As Patterson’s first big decision in this general manager role, he wanted to get the word out about this far enough in advance so that customers considering deploying GPU infrastructure can factor this new capability into their plans.
“I sometimes think we wait too long to tell people where we’re going,” he explained. “I think enterprises don’t move as fast as we think they do, and we want to future-proof them now.”
In the meantime, Nvidia has to continue to convince the industry it’s serious about data analytics. Which means engaging with more storage partners and analytics software-makers so they can connect the dots with their GPUDirect Storage. Patterson said they intend to follow the same strategy they used to bring HPC and deep learning users into the GPU fold, which means lots of user outreach and shrink-wrapped software offerings – open source where applicable but keeping the proprietary bits inside CUDA.
If all goes according to plan, 2020 could be another breakout year for Nvidia GPUs, even without the benefit of new hardware. “We really think this is going to fundamentally change how people do data science,” says Patterson.
bye, intel.