The Future Of AI Training Demands Optical Interconnects

Artificial intelligence has taken the datacenter by storm, and it is forcing companies to rethink the balance between compute, storage, and networking. Or more precisely, it has thrown the balance of these three as the datacenter has evolved to know it completely out of whack. It is as if all of a sudden, all demand curves have gone hyper-exponential.

We wanted to get a sense of how AI is driving network architectures, and had a chat about this with Noam Mizrahi, corporate chief technology officer at chip maker Marvell. Mizrahi got his start as a verification engineer at Marvell and, excepting a one year stint at Intel in 2013 working on product definition and strategy for future CPUs, has spent his entire career as a chip designer at Marvell, starting with CPU interfaces on various PowerPC and MIPS controllers, eventually becoming an architect for the controller lione and then the chief architect for its ArmadaXP Arm-based system on chip designs. Mizrahi was named a Technology Fellow in 2017 and a Senior Fellow and CTO for the entire company in 2020, literally as the coronavirus pandemic was shutting the world down.

To give a sense of the scale of what we are talking about, the GPT 4 generative AI platform was trained by Microsoft and OpenAI on a cluster of 10,000 Nvidia “Ampere” A100 GPUs and 2,500 CPUs, and the word on the street is that GPT 5 will be trained on a cluster of 25,000 “Hopper” H100 GPUs – with probably 3,125 CPUs on their host processors and with the GPUs offering on the order of 3X more compute at FP16 precision and 6X more if you cut the resolution of the data down to FP8 precision. That is a factor of 15X effective performance increase between GPT 4 and GPT 5.

This setup is absolutely on par with the largest exascale supercomputers being built in the United States, Europe, and China.

While Nvidia uses high speed NVLink ports on the GPUs and NVSwitch memory switch chips to tightly couple eight Ampere or Hopper GPUs together on HGX system boards, and has even created a leaf/spine NVSwitch network that can cross connect up to 256 GPUs into a single system image, scaling up that GPU memory interconnect by two orders of magnitude is not yet practical. And, we assume, the scale needs are going to be even larger as the GPT parameters and token counts all keep growing to better train the large language model.

The physical size of current and future GPU clusters and their low latency demands means figuring out how to do optical interconnects. So, will Marvell try to create something like the “Apollo” optical switches that are at the heart of the TPUv4 clusters made by Google? Does it have other means to do something not quite so dramatic and still yield the kinds of results that will be needed for AI training? And how does the need for disaggregated and composable infrastructure fit into this as a possible side benefit of a shift to optical switching and interconnects. And where does the CXL protocol fit into all of this?

Find out by watching the interview above.

Does anyone know anything about the CXL cache coherence protocol?
Given that they are scaling to 256 GPUs, I’m going to assume it’s not a bus snooping type protocol. Are GPUs able to cache remote memory, or is the coherence only for the host processor? I’m assuming the coherence traffic is somehow reduced compared to a true cpu-NUMA machine, but how?

HuMo says:

May 17, 2023 at 9:10 pm

I think that the answer to this key question may not be known that well at this time, but possibly the subject of active research, for example in fenced rendering of pink ponies with multicore Coq, and diy herding of armed cats in shared memory poetics — as seen in Dr. Jade Alglave’s (Ph.D. 2010, Paris 7, Diderot) award winning research (Blavatnik Award, 2023; Royal Society Brian Mercer Award, 2015; Lead Concurrency Architect at Arm). Her focus may be more on consistency than coherency and snooping, but suggests (to me, maybe wrongly) that knowledge is still evolving around these issues (search terms: “jade alglave” or “jade alglave pony”; also: Coq is now called Cat it seems, due to cross-cultural challenges: https://www.theregister.com/2021/06/15/coq_programming_language_change/).

Reply
- 8^d says:
  
  May 18, 2023 at 11:22 am
  
  Nice plugarooni, HuMo, of a fellow Frenchy, one of that new generation of folks with most sizeable cerebral cortices (much better than GPT-7), born on the very same day that Yann LeCun was at a Paris cafe, with a baguette and accordion, thinking of the textual organization of chapters in his upcoming thesis on horror backpropagation. Somehow though, I doubt that published CXL standards rely on the future results of such active research … so, to Paul’s query (from CXL whitepapers), CXL 1.x and 2.0 (PCIe 5.0) consider a single host processor that uses snoop messages to manage coherency of data cached (if any) in attached device(s) (CXL.cache protocol).
  
  CXL 3.0 (PCIe 6.0) is more interesting as it introduces: 1) enhanced coherency with “active” attached devices (GPUs, FPGAs), and; 2) memory sharing among multiple hosts. As far as I understand it, coherency in both cases is snoop-message oriented but acts over limited ranges of memory addresses (blocks, regions) to avoid traffic jams caused by snoop congestion. The trick (I think) will be to develop software that takes advantage of this restricted zonal coherency, to maintain the “psychologic” (a term borrowed from Gilbert Strang) illusion of a macroscopically NUMA system at the many-node level. A kind of Kung-Fu psychology for HPC gastronomy, I think, whereby the contents of the flying plates are not as important as their layouts, to maintain balance, and prevent indigestion in participating clients (opposite to fast-food cafeteria HPC).
  
  Then again, Siamak Tavallaei would probably offer a more technical, sedated, and accurate reply (if he/she is roaming around these parts … as in TNP’s August 9, ’22, “CXL Borgs IBM’s OpenCAPI”).
  
  Reply

Hubert says:

May 15, 2023 at 9:13 pm

Cool interview! Being in France, I think that it would be great if EuroHPC (say SiPearl) would get on-board to (or engage in some world-leadership in) this co-packaged silicon optics coherent networking tech., putting that together for their Exascale effort (with Marvell, or Ayar labs, or whoever is ready to move this forward).

Paul Berry says:

May 16, 2023 at 11:25 am

Does anyone know anything about the CXL cache coherence protocol?
Given that they are scaling to 256 GPUs, I’m going to assume it’s not a bus snooping type protocol. Are GPUs able to cache remote memory, or is the coherence only for the host processor? I’m assuming the coherence traffic is somehow reduced compared to a true cpu-NUMA machine, but how?

- HuMo says:
  
  May 17, 2023 at 9:10 pm
  
  I think that the answer to this key question may not be known that well at this time, but possibly the subject of active research, for example in fenced rendering of pink ponies with multicore Coq, and diy herding of armed cats in shared memory poetics — as seen in Dr. Jade Alglave’s (Ph.D. 2010, Paris 7, Diderot) award winning research (Blavatnik Award, 2023; Royal Society Brian Mercer Award, 2015; Lead Concurrency Architect at Arm). Her focus may be more on consistency than coherency and snooping, but suggests (to me, maybe wrongly) that knowledge is still evolving around these issues (search terms: “jade alglave” or “jade alglave pony”; also: Coq is now called Cat it seems, due to cross-cultural challenges: https://www.theregister.com/2021/06/15/coq_programming_language_change/).
  
  - 8^d says:
    
    May 18, 2023 at 11:22 am
    
    Nice plugarooni, HuMo, of a fellow Frenchy, one of that new generation of folks with most sizeable cerebral cortices (much better than GPT-7), born on the very same day that Yann LeCun was at a Paris cafe, with a baguette and accordion, thinking of the textual organization of chapters in his upcoming thesis on horror backpropagation. Somehow though, I doubt that published CXL standards rely on the future results of such active research … so, to Paul’s query (from CXL whitepapers), CXL 1.x and 2.0 (PCIe 5.0) consider a single host processor that uses snoop messages to manage coherency of data cached (if any) in attached device(s) (CXL.cache protocol).
    
    CXL 3.0 (PCIe 6.0) is more interesting as it introduces: 1) enhanced coherency with “active” attached devices (GPUs, FPGAs), and; 2) memory sharing among multiple hosts. As far as I understand it, coherency in both cases is snoop-message oriented but acts over limited ranges of memory addresses (blocks, regions) to avoid traffic jams caused by snoop congestion. The trick (I think) will be to develop software that takes advantage of this restricted zonal coherency, to maintain the “psychologic” (a term borrowed from Gilbert Strang) illusion of a macroscopically NUMA system at the many-node level. A kind of Kung-Fu psychology for HPC gastronomy, I think, whereby the contents of the flying plates are not as important as their layouts, to maintain balance, and prevent indigestion in participating clients (opposite to fast-food cafeteria HPC).
    
    Then again, Siamak Tavallaei would probably offer a more technical, sedated, and accurate reply (if he/she is roaming around these parts … as in TNP’s August 9, ’22, “CXL Borgs IBM’s OpenCAPI”).

The Future Of AI Training Demands Optical Interconnects

Sign up to our Newsletter

4 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Crafting A DGX-Alike AI Server Out Of AMD GPUs And PCI Switches

AMD ROCm 6.3 Has Goodies For AI Aficionados And HPC Gurus Alike

Dell Sets Up For A Killer Spike In AI Server Sales

4 Comments

Leave a Reply Cancel reply