Site icon The Next Platform

New Optimizations Improve Deep Learning Frameworks For CPUs

Today, most machine learning is done on processors. Some would say that acceleration of learning has to be done on GPUs, but for most users that is not good advice for several reasons. The biggest reason is now the Intel Xeon SP processor, formerly codenamed “Skylake.”

Up until recently, the software for machine learning has been often more optimized for GPUs than anything else. A series of efforts by Intel have changed that – and when coupled with Platinum version of the Intel Xeon SP family, the top performance gap is closer to 2X than it is to 100X. This may stun some, but it is well documented and not all that surprising when we understand the underlying architectures. With such closeness in performance, use of a GPU accelerator is more of a luxury than a necessity – and there are better choices emerging for ‘luxury’ when we really need it.

Make no mistake however, ‘accelerators’ can have an advantage in performance and/or power consumption when machine learning is all we need. I’ll come back to that with “What if we only do machine learning?” at the end of this article. Since most of us need more than a “machine learning only” server, I’ll focus on the reality of how Intel Xeon SP Platinum processors remain the best choice for servers, including servers needing to do machine learning as part of their workload.

Whine, Whine, Whine – Where Are The Benchmarks?

Intel engineers will tell you that frameworks for deep learning have been highly biased to be optimized for GPUs and not CPUs. So, Intel did something about it – and the lack of CPU optimizations in these frameworks has been addressed today by optimization efforts by Intel to add CPU optimizations to frameworks which were already optimized for GPUs.

The results speak for themselves. TensorFlow benchmarks, with CPU optimizations added, see CPU performance gain as much as 72X (see Intel blog titled TensorFlow Optimizations on Modern Intel Architecture). Similarly, Caffe benchmarks, with CPU optimization added see CPUs gain as much as 82X (see Intel blog titled Benefits of Intel Optimized Caffe in comparison with BVLC Caffe). That just a start. The website for Torch (torch.ch) proclaims “Torch is a scientific computing framework with wide support for machine learning algorithms that puts GPUs first.” Intel offers an alternative branch, which lets us choose to have CPUs first when we choose to use CPUs.  I’ve personally used this repository for my own work, and I know it helps a lot.

Later in this article, I go through the frameworks and libraries one-by-one and supply links where to download, and details on benchmark results thus far.

The most important benchmarks, of course, are your own programs. So, I advise you to compare results when using frameworks and libraries that offer CPU optimizations and GPU optimizations. Thanks to Intel, you can do both now.

This is not obvious unless you know that deep learning frameworks, tools, and libraries exist that are optimized for CPUs. In fact, the most popular frameworks have versions that are well optimized for CPUs, in particular – Intel Xeon SP processors.  Here is a partial run down of key software for accelerating deep learning on Intel Xeon Platinum processor versions enough that the best performance advantage of GPUs is closer to 2X than to 100X.

Deep Learning Frameworks We Know And Love

All of these frameworks have been optimized for both Intel Math Kernel Library (Intel MKL) and Intel Advanced Vector Extensions (Intel AVX).

Deep Learning Math Libraries

In addition to the frameworks and libraries noted above, the Intel Data Analytics Acceleration Library (DAAL) is an open source library of optimized algorithmic building blocks for data analysis stages most commonly associated with solving Big Data problems. The library is designed for use popular data platforms including Hadoop, Spark, R, and Matlab. It is available from https://software.intel.com/intel-daal. There is also a good article in Parallel Universe Magazine, Issue 28, starting on page 26, titled Solving Real-World Machine Learning Problems with Intel Data Analytics Acceleration Library.

What If We Only Do Machine Learning?

While Intel Xeon Scalable processors may be the best solution when we justify a server supporting a variety of workloads, what if we want to take a leap and buy a “machine learning only” server or supercomputer?

My best advice “be sure you really know what you need” and be aware that things are really changing in the field. I do not mean to dissuade any one, but it is difficult to guess all the options we will have even a year from now. I have no doubt that the reality is that accelerators for machine learning will shift from GPUs to FPGAs, ASICs, and products with ‘neural’ in their descriptions. The CPU of choice in all these solutions where you have to support a variety of workloads will remain Intel Xeon processors.

Choices for accelerators are getting more diverse. High-core count CPUs (the Intel Xeon Phi processors – in particular the upcoming “Knights Mill” version), and FPGAs (Intel Xeon processors coupled with Intel/Altera FPGAs), offer highly flexible options excellent price/performance and power efficiencies. An Intel Xeon Phi processor-based system can train, or learn an AlexNet image classification system, up to 2.3 times faster than a similarly configured system using Nvidia GPUs. (see Inside Intel: The Race for Faster Machine Learning). Intel has shown that the Intel Xeon Phi Processor delivers up to nine times more performance per dollar versus a hosted GPU solution, and up to eight times more performance per watt. Coming soon are more products that are purpose built for AI from Intel Nervana.

It’s an exciting time to be a computer geek, and machine learning is nothing if it is not fun. It is great to see all the options available to build super-fast machines for machine learning.

Foundation For Machine Learning

The Xeon SP processors, particularly the Platinum processors, offer outstanding performance for machine learning, while giving us more versatility than any other solution. If and when we are ready to add acceleration, Intel Xeon Scalable processors still serve as the core of a versatile system with accelerators – and the choice of what those accelerators can be is growing quickly. Either way, relying on Skylake processors and their excellent support for machine learning gives us the best combination of performance and versatility in one package.

Learn more:

James Reinders is an independent consultant in high performance computing and parallel programming. Reinders was most recently the parallel programming model architect for Intel’s HPC business, and was a key contributor to the design and implementation of the ASCI Red and Tianhe-2A massively parallel supercomputers.

Exit mobile version