All processor designs are the result of a delicate balancing act, perhaps most touchy in the case of a high performance CPU that needs to be all things to users, whether they’re running large HPC simulations, handling transaction processing, dispatching training runs or whipping out inference results.
Few understand the nature of the architectural tradeoff game better than Bob Valentine, joined Intel as a microprocessor architect in 1991. Thirty years later, his work developing the core elements of Intel’s new AMX (Advanced Matrix Extensions) for AI/ML for Sapphire Rapids, his focus is still about building in balance. Specifically, compromising between what is possible for hardware and software designers, with providing low-level capability with high-level usability.
Along the way, including with his work on the Centrino ISA in the 2000s, he says the goal has been to make the smartest tradeoffs possible to fit the needs of multiple groups. But one key difference between those days and now is the targets keep moving. Instead of developing the next generation of an existing processor or enhancing features, he’s had to keep revising his way through what AI/ML needs built into the processor as those needs keep changing.
“We went through three different data types in designing AMX,” Valentine tells The Next Platform. When they started, bfloat didn’t even exist and on the path to Int-8 and bfloat, they threw out quite a bit of work on other precision and integer types.
What they finally did settle on are extensions baked into forthcoming Sapphire Rapids devices that attempts to balance the most prominent use case for the CPU in AI/ML, inference, with more capabilities for training. All of this balanced yet further with the general-purpose processing plus HPC-oriented appeal.
AMX, which was just introduced at Intel’s Architecture Day (we have deeper dives on the other aspects separately), takes a tiled approach to providing 8X the operations per cycle and per core with Int8 over existing VNNI instruction set, which was optimized at the time for what AI/ML looked like a generation back (CNNs). Valentine says that despite some of the work that was tossed aside during the transitions, AMX reflects what they are seeing for real workloads and users, which means recommendation, NLP, and other workloads.
The best way for now to think of AMX is that it’s a matrix math overlay for the AVX-512 vector math units, as shown below. We can think of it like a “TensorCore” type unit for the CPU. The details about what this is were only a short snippet of the overall event, but it at least gives us an idea of how much space Intel is granting to training and inference specifically.
The slide below highlights the flow and role of tiles. Data comes directly into the tiles while at the same time, the host hops ahead and dispatches the loads for the tiles. TMUL operates on data the moment it’s ready. At the end of each multiplication round, the tiles move to cache and SIMD post-processing and storing. The goal on the software side is to make sure both the host and AMX unit are running simultaneously.
The prioritization for AMX toward real-world AI workloads also meant a reckoning for how users were considering training versus inference. While the latency and programmability benefits of having training stay local are critical, and could well be a selling point for scalable training workloads on the CPU, inference has been the sweet spot for Intel thus far and AMX caters to that realization.
“Intel’s role and AMX’s role in the industry [broader AI] isn’t quite known yet,” Valentine says.
“We can do good inference on Skylake, we added instructions in Cooper Lake, Ice Lake, and Cascade Lake. But AMX is a big leap, including for training. The inference side is well understood, but training needs certain operations that aren’t required in inference. The way you parallelize inference versus training is different. Many inference scenarios are latency critical. You might not parallelize it on as many cores because you want to decrease that synchronization overhead. Training, on the other hand, is more of an HPC-like throughput job, you’re sending out to more cores so the communication overhead is higher and the per core problem size is lower.”
Still, he says, all of these aspects aren’t as far apart from one another as one might think—at least not from a design standpoint. At the core of HPC and AI/ML is matrix math. Having a low-latency, high performance matrix multiplication engine with enough software abstraction to make it usable and enough low-level capability to make it flexible could make a real difference, even if Valentine isn’t sure how that might play out in the market.
While the specifics aren’t clear, during Architecture Day, the theme of ubiquity was clear in everything from the HPC and general purpose nature of the chip to its role in future AI. At the same time, the market at the high end of AI is veering toward specificity with custom systems for training in particular (although much more CPU for inference).
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Two typos: “…dispatches the loads for the toles.” and one you may be unable to fix: In this slide: http://www.nextplatform.com/wp-content/uploads/2021/08/AMX1.png the third bullet under TMUL is missing the B parameter.