The CPU Still Matters In The AI Stack

The CPU does not rule the computing roost when it comes to machine learning training, but it still has a role when it comes to machine learning inference and for other kinds of data analytics related to machine learning. Moreover, no system that uses GPUs or FPGAs to accelerate machine learning applications exists without a central processor, or more commonly, a pair of them. It is still the CPU that is in control of what is going on within an AI system or across clusters of them.

AMD has largely sat by the sidelines as machine learning took off at hyperscalers, but with a revitalized Epyc CPU and Radeon Instinct GPU accelerator roadmap, the company is poised to not only regain some foothold in the traditional HPC simulation and modeling arena for which its Opteron processors were popular a decade ago, but also to expand into the machine learning space that increasingly co-exists with HPC.

Ahead of The Next AI Platform event in San Jose this week, we had a chat with Ram Peddibhotla, corporate vice president of datacenter product management at AMD, who moved to the chip maker after stints running software development for ARM server chips at Qualcomm, during its brief foray into server processors, and for a long time at Intel, where over two decades he was in charge of relationships with and software development for the Linux community and then the public cloud providers.

In general, Peddibhotla sees the AI compute landscape the way we would expect, given the relative newness of these applications.

“My perspective is that there is no one size fits all for machine learning compute given the rapidly evolving landscape both on the training and the inference side,” says Peddibhotla. “Our focus with Epyc is really to ensure that we have an open ecosystem that gives our customers a choice when building training platforms. This includes work with Nvidia as well as our own Radeon Instinct GPUs. We think that when you pair our GPUs with our CPUs, especially given the impending support for PCI-Express 4.0, plus the truly open software ecosystem, there is value that customers can derive. But we are very clear-eyed in enabling an open ecosystem – and that translates to the inference side as well, where there would be more FPGAs and lower-end GPUs and that would the customers use. Everyone is thinking about CPU features and where the mix will be for CPUs versus accelerators. There are a large number of startups in the Bay Area who are pursuing accelerators for inference. Again, there is a whole bunch of inference that can be done on CPUs and there are instructions that are being added to CPU to accelerate inference and we are watching all of that. Our plan is to continue to support an open ecosystem from an inference point of view and we will keep evaluating what gets integrated into our CPU roadmap.”

This is no different from what Intel, IBM, and the ARM collective are doing with their processors.

Looking ahead, one of the big advantages that AMD might have is its emphasis on having memory, I/O, and compute in balance in such a way that single socket servers make more sense as the CPU behind machine learning training workloads than do the more traditional two-socket X86 servers that dominate the datacenters of the world. The Frontier system that is being built by Cray using custom AMD CPUs and GPUs that need to support both AI and HPC workloads at exascale for Oak Ridge National Laboratory and is based on a single socket Epyc node with four Radeon Instinct accelerators hanging off it, with coherent memory across the CPUs and GPUs. Just like IBM did with the Power9 processors and their NVLink ports out to “Volta” Tesla GPU accelerators to create the Summit hybrid AI-HPC supercomputer. Summit had three GPUs per CPU, but Frontier will boost that ratio up to four GPUs per CPU, and do so using standard PCI-Express links – we are not sure if they are PCI-Express 4.0 or PCI-Express 5.0 yet – rather than a proprietary interface.

This could be the shape of things to come, and not just at Oak Ridge. But that still means pairing the right CPU to the right number of GPUs for the specific workload.

“It depends a lot on what the specific requirements are with regard to the latency in the serialized code,” Peddibhotla explains. “If you have a very fast turnaround time in a training workload, that might be the case where customers would take a higher bin CPU part. And I think here are advantage for our roadmap is that constancy of the connectivity up and down our product stack – we don’t artificially limit these features for just high bin processors. So for those customers who can tolerate some latency, they could go with the middle of the road product from our product stack, but still get the handoff from PCI-Express and be able to attach multiple accelerators and get the job done. For those customers who are latency sensitive, they could take the top end of the stack and get the work finished faster to not stall the PCI-Express links out to the accelerators, which are waiting for the serial code to finish to take their next steps.”

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.