Having been at the forefront of machine learning since the 1980s when I was a staff scientist in the Theoretical Division at Los Alamos performing basic research on machine learning (and later applying it in many areas including co-founding a machine-learning based drug discovery company), I was lucky enough to participate in the creation and subsequently to observe first-hand the process by which the field of machine-learning grew to become a ‘bandwagon’ that eventually imploded due to misconceptions about the technology and what it could accomplish.
Fueled by across-the-board technology advances including algorithmic developments, machine learning has again become a bandwagon that is becoming rife with misconceptions coupled with misleading marketing.
That said, the extraordinary capabilities of machine learning technology can be realized by understanding what is marketing fluff and what is real. It is truly remarkable that machines, for the first time in human history, can deliver better than human accuracy on complex ‘human’ activities such as facial recognition, and further that better-than-human capability was realized solely by providing the machine with example data. Significant market applicability means that machine learning, and particularly the subset of the field called deep-learning, is now established and is here to stay.
Understanding key technology requirements will help technologists, management, and data scientists tasked with realizing the benefits of machine learning make intelligent decisions in their choice of hardware platforms. Benchmarking projects like Baidu’s ‘Deep Bench’ also provide valuable insight by associating performance numbers with various hardware platforms.
Understanding what is really meant by ‘deep learning’
Deep learning is a technical term that describes a particular configuration of an artificial neural network (ANN) architecture that has many ‘hidden’ or computational layers between the input neurons where data is presented for training or inference, and the output neuron layer where the numerical results of the neural network architecture can be read. The values of the output neurons provide the information that companies use to identify faces, recognize speech, read text aloud, and provide a plethora of new and exciting capabilities.
Originally ‘deep learning’ was used to describe the many hidden layers that scientists used to mimic the many neuronal layers in the brain. While deep ANNs (DNNs) are useful, many in the data analytics world will not use more than one or two hidden layers due to the vanishing gradient problem. This means some claims about deep-learning capability will not apply to their work.
More recently, the phrase ‘deep learning’ has morphed into a catchphrase that describes the excellent work by many researchers who reinvigorated the field of machine learning. Their deep-learning ANNs have been trained to deliver deployable solutions for speech recognition, facial recognition, self-driving vehicles, agricultural machines that can recognize weeds from produce and much, much, more. Recent FDA approval of a deep-learning product has even opened the door to exciting medical applications.
Unfortunately, the deep-learning catch-phrase is now morphing into the more general and ambiguous term of AI or artificial intelligence. The problem is that terms like ‘learning’ and ‘AI’ are overloaded with human preconceptions and assumptions – and wildly so in the case of AI.
Let’s cut through the marketing to get to the hardware.
Training is not ‘learning’ in the human sense nor is it ‘AI’, it is the numerical optimization of a set of model parameters to minimize a cost function
People use the phrase ‘learn’ when discussing training because we all understand the concept of learning to do something. The danger is that people tend to lose sight of the fact that training is simply the process of fitting a set of model parameters for the ANN (regardless of number of layers) to produce a minimum error on a bunch of examples in a training set.
Unlike humans, ANNs have no concept of a goal or real-world constraints. For example, a project in the 1990s attempted to train an ANN to distinguish between images of a tank vs. a car. A low error was found after training but in the field the real-world accuracy was abysmal. Further investigation found that most of the tank pictures were taken on a sunny day while the pictures of the cars were taken on cloudy days. Thus the network ‘solved’ the optimization problem by distinguishing cloudy vs. sunny days and not cars vs. tanks, (which could have been bad news for people driving on a sunny day).
What is really exciting about machine learning is that once the training examples have been identified, the remainder of the ‘learning’ process becomes a computational problem that does not directly involve people. Thus, faster machines effectively ‘learn’ faster. Given the wide applicability and commercial viability of machine learning, companies such as Intel, NVIDIA, and IBM agree that machine learning will become a dominant workload in the data center in the very near future. Diane Bryant (formerly VP and GM of the Data Center Group, Intel) is well-known for having stated, “By 2020 servers will run data analytics more than any other workload.” In short, big money in the data center is at stake.
Inferencing is a sequential calculation
The payoff is achieved when the ANN is used for inferencing, a term that describes what happens when the ANN calculates the numerical result for a given input using the parameters from the completed training process to perform a task. Inferencing can happen quickly and nearly anywhere – even on low-power devices such as cell phones and IoT (Internet of Things) edge devices to name just two.
From a computer science point of view, inferencing is essentially a sequential calculation* that is also subject to memory bandwidth limitations. It only becomes parallel when many inferencing operations are presented in volume so they can be processed in a batch, say in a data center. In contrast, training is highly parallel as most of the work consists of evaluating a set of training parameters across all the examples.
This serial vs. parallel distinction is important because:
- Most data scientists will not need inferencing optimized devices unless they plan perform volume processing of data in a data center. Similarly IoT edge devices, real-time, surveillance, autonomous driving and other verticals will perform sequential rather than massively parallel inferencing.
- Inferencing of individual data items will be dominated by the sequential performance of the device. In this case, expect massively parallel devices like accelerators to have poor inferencing performance relative to devices such as CPUs. FPGAs are interesting as they may exhibit some of the lowest inference latencies plus they are field upgradable.
Parallelism speeds training
All hardware on the market uses parallelism to speed training. The challenge then, is to determine what kinds of devices can help us speed training to achieve the shortest ‘time-to-model’.
Each step in the training process simply applies a candidate set of model parameters (as determined by a black box optimization algorithm) to inference all the examples in the training data. The values produced by this parallel operation are then used to calculate an error (or energy) that is used by the optimization algorithm to determine success or calculate the next set of candidate model parameters.
Figure 1: Black box plus steps to calculate each step in parallel during training (image courtesy TechEnablement)
This evaluation can be performed very efficiently using a SIMD (Single Instruction Multiple Data) computational model because all the inference operations in the objective function can occur in lock-step.
- The SIMD computational model maps beautifully and efficiently to processors, vector processors, accelerators, FPGAs, and custom chips alike. For most data sets, it turns out that training performance is limited by cache and memory performance rather than floating-point capability.
- The ability of the hardware to perform all those parallel operations during training depends more on the performance of the cache and memory subsystems that flops/s. Once the memory and cache systems are saturated, any additional floating-point capability is wasted. Customers risk shooting themselves in the foot should they base purchase decisions solely on device specifications that claim high peak floating-point performance.
- The training set must large enough to make use of all the device parallelism else performance is wasted. Contrary to popular belief, CPUs can deliver higher training performance than GPUs on many training problems. Accelerators achieve high floating-point performance when they have large numbers of concurrent threads to execute. Thus training with data sets containing hundreds to tens of thousands of examples may utilize only a small fraction of the accelerator parallelism. In such situations, better performance may be achieved on a many-core processor with a fast cache and stacked memory subsystem like an Intel Xeon Phi processor. Thus, it is important to consider how much data will be available for training when selecting your hardware.
Reduced precision and specialized hardware
Vendors are also exploring the use of reduced precision for ANNs because half-precision (e.g. FP16) arithmetic can double the performance of the hardware memory and computational systems. Similarly, using 8-bit math can quadruple that performance.
Unfortunately, basing a purchase decision on reduced-precision floating-point performance is a bad idea because it does not necessarily equate to faster time-to-model performance. The reason is that numerical optimization requires repeated iterations of candidate parameter sets while the training process converges to a solution.
The key word is convergence. Reduced precision can slow convergence to the point where the number of training iterations required to find a solution exceeds the speedup accrued due to the use of reduced precision math. Even worse, the training process can fail to find a solution due to getting stuck in what is known as a false, or local minima due to the reduced-precision math.
Also consider the types of ANNs that will be trained. For example, special-purpose hardware that performs tensor operations at reduced precision benefits only a few, very specific types of neural architectures like convolutional neural networks. It is important to understand if your work requires the use of those specific types of neural architectures.
In general, avoid reduced precision for training as it will likely harm rather than help.** That said, reduced precision can help for many (but not all) inferencing tasks.
Memory capacity and bandwidth are key to calculating gradients for orders of magnitude faster time-to-model runtimes.
Many of the most effective optimization algorithms such as L-BFGS and Conjugate Gradient require evaluation of a function to calculate the gradient of the ANN parameters with respect to the objective function.
Use of a gradient provides an algorithmic speedup that can achieve significant – even orders of magnitude – faster time-to-model as well as better solutions than gradient-free methods. Popular software packages such as Theano include the ability to symbolically calculate the gradient through the use of automatic differentiation so native code can be generated, thus getting the gradient function is pretty easy.
The challenge is that size of the gradient gets very large, very fast as the number of parameters in the ANN model increases. This means that memory capacity and bandwidth limitations (plus cache and potentially atomic instruction performance) dominate the runtime of the gradient calculation. ***
Further, it is important to know that the instruction memory capacity of the hardware is large enough to hold the all the machine instructions needed to perform the gradient calculation. The code for the gradient calculation for even modest ANN models can be very, very large.
In both cases the adage from the early days of virtual memory applies, “real memory for real performance”.
Given the dependence of gradient calculations on memory, look for hardware and benchmark comparisons using the stacked memory that is now available on high-end devices and systems with large memory capacities. The performance payback can be significant.
Near-term product expectations
Recent product announcements show that the industry has recognized the need to provide faster memory for both processors and accelerators. Meanwhile custom hardware announcements (namely from Google and Intel Nervana) are raising awareness that custom solutions might leapfrog the performance of both CPUs and GPUs for some ANNs. To utilize custom solutions (ASICS and FPGAs), on-package processor interfaces will be offered on some Intel processor SKUs. These interfaces should provide tight coupling to the custom device’s performance capabilities while acting as the front-end processor to bring custom devices to market. **** However, this is a performance conjecture at this point. Even without specialized hardware, it is expected that the inclusion of the wider AVX-512 vector instructions (and extra memory channel) will more than double both training and inference per core performance on the Intel Skylake processors without requiring an increase in data set size to exploit parallelism. (Using more cores should provide an additional performance increase.) Both Intel Xeon Phi and Intel Xeon (Skylake) product SKUs will offer on-package Intel Omni-Path interfaces, which should decrease system cost and network latency while increasing network bandwidth. This is good news for those who need to train (or perform volume inferencing) across a network or within the cloud. We look forward to validating all these points in practice.
When evaluating a new hardware platform consider:
- Many data scientists don’t need specialized inferencing hardware.
- What is the real (not peak) floating-point performance when the calculation is dominated by memory and cache bandwidth performance?
- How much parallelism do I really need to train on my data sets (i.e. many-core or massive parallelism)?
- Am I paying for specialized hardware I don’t need?
- Reduced-precision data types are currently a niche optimization that may never become mainstream for training – although it is an active research area.
Rob Farber is a global technology consultant and author with an extensive background in HPC and in developing machine learning technology that he applies at national labs and commercial organizations. Rob can be reached at email@example.com.
*Some acceleration through parallelism can be achieved during a single inferencing operation, but the degree of parallelism is limited by the data dependencies defined by the ANN architecture and is generally low.
** Current research indicates reduced-precision helps when the matrices are well-conditioned. Generally ANNs are not well-conditioned.
***Chunked calculations can help fit gradient calculations into limited memory devices such as GPUs. However, this introduces inter-device bandwidth limitations such as the PCIe bus which is famous for acting as a bottleneck in accelerated computing.
**** Google has announced their custom ASIC hardware will be exclusively available on Google Cloud. The Intel announcement regarding Nervana processor integration can be here.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Fujitsu just released details of their entirely new chip for deep learning that they claim will outperform the competition by 10x. Its a 16 “core” chip with integrated HBM2 and Tofu interconnect so it should scale very well into large AI supercomputers. Each of the 16 cores consists of 8 SIMD EUs. Since its being developed simultaneously with Post K, I imagine that we will see “exascale”(not DP FLOPS) AI supercomputers. Sounds like something to keep an eye on.
Very good article. However the bulk of DNN algorithms at least when you deal with video, images, audio are CNN so the loose in accuracy seems mostly justifiable. This is probably less true LSTM/ RNN-type networks where the exploding gradient problem is even more severe. Also the author should maybe point out the deeper your network (at least in case for CNN) the more training data and the more training you have to do.