Darrin P. Johnson, Director of Technical Marketing, NVIDIA
The resurgence of innovative techniques like artificial intelligence (AI) and deep learning in the enterprise space is enabling today’s businesses to use data to uncover deeper insights, derive actionable intelligence, and drive competitive advantage. However, the arrival of faster and more powerful hardware, specialized processors, and big data is causing traditional storage architectures to become a critical performance bottleneck in the process of training deep learning models. Next-generation storage solutions from Hewlett Packard Enterprise (HPE) and their network of partners are capable of supporting the storage requirements for these data-intensive workloads, enabling enterprises to train algorithms faster and arrive at data-driven insights more quickly than ever before.
Deep learning is rapidly being adopted among enterprises as they increasingly realize the economic and social benefits of teaching machines to perform a variety of tasks that used to be exclusively done by human beings. While the basic concepts of deep learning aren’t new to the industry, the powerful computing capabilities made possible by NVIDIA graphics processing units (GPUs) and the arrival of plentiful datasets have caused enterprises across virtually every industry to quickly build out their infrastructures to support this new wave of computing.
Before deep learning models can begin to actually learn on their own based on observing large datasets, the neural networks must first be trained, a process which is typically done using powerful GPU clusters. NVIDIA GPU technologies have become absolutely vital to this process, because GPU-based servers allow the model to tackle many tasks in parallel, instead of performing operations serially, or one after another. This allows the training process to be completed in much less time, which helps data scientists more quickly arrive at insights they need to drive breakthroughs and solve the grand challenges plaguing modern society.
The quality of AI training models depends on the size and accuracy of the training data—the more data processed, the more accurate the model will be. NVIDIA GPUs offer extensive computing abilities and are capable of ingesting immense datasets before reaching their computational saturation point. It is the combination of these two factors that places substantial demands on the storage architecture. Traditional storage systems that feed the GPU servers can be too slow or have insufficient throughput to keep pace with the GPU, resulting in poor utilization of GPU resources. A high performance storage system that supports multiple parallel data paths to the GPU node is required during this stage of the training to avoid GPU data starvation.
This video from leading software provider WekaIO illustrates the bottlenecks that can occur in computing systems attempting to train models with legacy serial storage access. The video showcases IO bottlenecks with respect to CPUs however the problem is amplified dramatically due to the parallel nature of GPU processing, with the analogy being used extended to thousands of trains simultaneously.
HPE is already recognized within the industry for offering a variety of compute innovations and deep expertise to help enterprises succeed with deep learning. Through deep collaboration with NVIDIA and WekaIO, HPE is expanding their AI portfolio with new solutions that include AI-ready infrastructure driving supercharged performance for deep learning modeling. Beginning in May 2018, HPE will offer WekaIO Matrix™ software in conjunction with their industry-leading deep learning servers, delivering integrated flash-based parallel file system capabilities that will allow customers to scale storage capacity and performance to new levels and significantly accelerate their compute-intensive workloads.
WekaIO Matrix is an NVMe flash-optimized parallel file system that offers the performance and scalability to eliminate the I/O bottlenecks that can occur as customers dramatically increase the compute horsepower and fabric bandwidth within their HPC and AI environments. WekaIO software turns any server into a high-performance, scale-out pool of storage that can be shared across all applications and provide data as quickly as is required for large GPU clusters processing massive datasets. NVIDIA GPUs are known within the industry for being the driving force behind today’s most data-intensive AI workloads, and WekaIO’s software ensures these GPUs can be used to their maximum potential and never left idling.
The WekaIO Matrix software offers the following benefits to accelerate deep learning training:
- Proven performance – Over 6.5GBytes per second of bandwidth to a single GPU node, which delivers over six times more data than NFS and two times more than a local NVMe SSD.
- Shared data access – Shared access to data via POSIX, NFS, or SMB, allowing many worker nodes to access training data simultaneously without negatively impacting performance.
- Simple installation and management – WekaIO’s Trinity™ management software streamlines system installation and configuration, offers day-to-day system monitoring and management, and provides an analytics platform for long-term planning.
- Automatic data tiering – Training data is automatically placed onto a high performance “hot tier” for maximum bandwidth to the GPUs while the training catalog can be stored in lower cost disk-based object storage. The software automatically feeds in more data from the object store when needed.
This new partnership has tuned, optimized, and validated the WekaIO Matrix software-defined storage solution for deployment within HPE environments. The combination of WekaIO Matrix and powerful servers from HPE enables customers to easily adapt storage infrastructures to modern demands, maximize the performance of deep learning data, and achieve deeper insight faster than ever before.
HPE, NVIDIA, and WekaIO are enabling customers with supercharged AI solution and capabilities with a new software-based, scale-out storage solution that enables businesses to capitalize on the benefits of NVMe based flash technology and improve the performance of their storage architectures. To learn more about how we can help your business discover better storage for faster deep learning, please follow me on Twitter at @darrinpjohnson. And for more general news and information you can also visit @HPE_HPC and @NVIDIADC.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.