Current custom AI hardware devices are built around super-efficient, high performance matrix multiplication. This category of accelerators includes the host of AI chip startups and defines what more mainstream accelerators like GPUs bring to the table.
However, times might be changing as the role of matrix math tightens, making those devices weighted in the wrong direction, at least for areas in AI/ML that can trade in a little accuracy for speed. And while that may sound familiar (approximation) there are some new ideas on the horizon that blend the best of that old world of optimization and quantization and add a new twist—one that could dramatically reduce what’s needed on a device.
Instead of just approximating the way to efficient, fast AI, there are emerging algorithmic approaches that can take a known matrix and remove the multiply-add step altogether. There’s some overhead from the averaging, hashing, and other trickery we’ll get to in a moment but the takeaway is that there could be a future impact on the hardware ecosystem for AI, especially for matrix math workhorses like GPUs, not to mention the slew of ASICs.
According to Davis Blalock of MIT CSAIL and MosaicML, experiments using hundreds of matrices from diverse domains show that this approach can run 100X faster than exact matrix products and 10X faster than current approximate methods. Again, while approximate approaches aren’t for everyone and it’s “only” a 10X improvement it’s the newness of this concept, both building on and recreating current methods (linear approximation, hashing to get around linear operations, quantization), that makes it worth paying attention to.
The algorithm they’ve tested (on CPU only so far and only at small scale) can be dropped in with no runtime overhead and with all the tools to quickly sum low-bitwidth integers, according to Blalock and MIT CSAIL colleague, John Guttag. The algorithm, called MADDNESS (Multiply-ADDition-IESS), is built to favor speed of accuracy via an averaging mechanism.
And AI chip startup wannabes take note: the speedups the team achieved against other approximate approaches might be much higher with a custom bit of hardware optimized for this over straight dense matrix mutliplies.
“If the hardware could lookup-accumulate as many bytes per cycle as it can multiply-accumulate, our method could be over 4X faster,” the MIT duo says. “Combined with the fact that multiplexers require many fewer transistors than multipliers, this suggests that a hardware implementation of our method might offer large efficiency gains compared to existing accelerators.”
This MIT approach might not be general purpose for a lot of what happens in the datacenter but it’s a big step forward from straight quantization/approximation/hashing, and so on.
So far, this has only been tested on low-end CPU and it only works when there is a training set available for one full matrix and it’s only at its best when there is one matrix that is bigger than another. “Our method also loses utility when the larger matrix is known ahead of time; this assumption is common in similarity search, and eliminates the need for a fast encoding function entirely. Our approximate integer summation and fused table lookups would likely be useful independent of any of these assumptions, but demonstrating this is future work,” they note.
In terms of the hardware, this is not a CPU-only approach, it’s just that it would take quite a bit of work to take the same MADDNESS and apply it to GPUs or some of the ASICs for these workloads. They have yet to apply the algorithmic approach in any scalable way (no multi-CPU/threading) since they’re focusing for now on creating a foundation for individual threads.
One more limitation to what we can see emerging from MADDNESS is that it’s still early days. They’re not using multiple convolutional layers or accelerating entire networks at this point. The sticking point is important and major: building this into existing frameworks is a serious endeavor, and not just due to deciding which approximation kernels to include.
Imagine, however, what happens if these changes are introduced into major frameworks. That would take a large enough base willing to make the accuracy/performance tradeoff but if the growth of neural networks and their power/compute demands continue, that might be the only option. This would make the current set of devices, all of which are optimized for dense matrix multiplication, unevenly weighted for the real job. But of course, this is all conjecture—there is still much to be done before this approach can scale and be used widely.
Such an architecture might not be an accelerator at all (in the PCIe or offload sense we consider these now). It might be possible for the CPU to recapture share once again–and right at a time when models are outpacing what anyone can afford to do at scale in both training and inference.
“We believe that accelerating full networks with our ideas is a promising direction, particularly for inference,” the MIT’ers say. “This is especially true at the hardware level—our method requires only multiplexers, not multipliers, and can therefore be implemented easily and with far less power than current matrix product logic,” they explain. “Our results suggest that future methods similar to our own might hold promise for accelerating convolution, deep learning, and other workloads bottlenecked by linear transforms.”