Meta Platforms Crafts Homegrown AI Inference Chip, AI Training Next

As we pointed out a year ago when some key silicon experts were hired from Intel and Broadcom to come work for Meta Platforms, the company formerly known as Facebook was always the most obvious place to do custom silicon. Of the largest eight Internet companies in the world, who are also the eight largest buyers of IT gear in the world, Meta Platforms is the only one that is a pure hyperscaler and that does not sell infrastructure capacity on a cloud.

And as such, Meta Platforms owns its software stacks, top to bottom, and can do whatever it wants to create the hardware that drives it. And, it is rich enough to do so and spends enough money on silicon and people to tune software for it that it could no doubt save a lot of money by taking control of more of its chippery.

Meta Platforms is unveiling homegrown AI inference and video encoding chips at its AI Infra @ Scale event today, as well as talking about the deployment of its Research Super Computer, new datacenter designs to accommodate heavy AI workloads, and the evolution of its AI frameworks. We will be drilling down into all of this content over the next several days, as well as doing a Q&A with Alexis Black Bjorlin, vice president of infrastructure at Meta Platforms and one of the key executives added to the custom silicon team at the company a year ago. But for now, we are going to focus on the Meta Training and Inference Accelerator, or MTIA v1 for short.

For all of you companies who thought you were going to be selling a lot of GPUs or NNPs to Facebook, well, as we say in New York City, fogettaboutit.

And over the long haul, CPUs, DPUs, and switch and routing ASICs might be added to that list of semiconductor components that are not going to be bought by Meta Platforms. And if Meta Platforms really wants to create a lot of havoc, it could sell its own silicon to others. . . . Stranger things have happened. Like the pet rock craze of 1975, just to give one example.

NNP And GPU Can’t Handle The Load With Good TCO

The MTIA AI inference engine got its start back in 2020, just as the coronavirus pandemic was causing everything to go haywire and AI had moved beyond image recognition and speech to text translation to the generative capabilities of large language models, which seem to know how to do many things they were not intended to do. Deep learning recommendation models, or DLRMs, are an even hairier computational and memory problem than LLMs because of their reliance on embeddings – a kind of graphical representation of context for a dataset – that have to be stored in the main memory of the computational devices running the neural networks. LLMs do not have embeddings, but DLRMs do, and that memory is why host CPU memory capacity and the fast, high bandwidth connections between the CPUs and the accelerators matters to DLRMs more than it does for LLMs.

Joel Coburn, a software engineer at Meta Platforms, showed this chart as part of the AI Infra @ Scale event that illustrates how the DLRM inference models at the company have been growing in size and computational demand in the past three years and how the company expects them to grow in the next eighteen months:

Keep in mind, this is for inference, not training. These models absolutely blasted through the relatively small amount of low precision compute that on-chip vector engines in most CPUs supply, and on-chip matrix math engines such as those now on Intel Xeon SP v4 and IBM Power10 chips and coming to AMD Epycs soon may not be enough.

Anyway, this is the kind of chart we see all the time, although we have not seen them for DLRMs. This is a scary chart, but not terrifying like this one:

On the left side of the chart, Meta Platforms is replacing CPU-based inference with neural network processors for inference, or NNPIs, in the “Yosemite” microserver platforms we talked about way back in 2019. The CPU-based inferencing is still going on, you see, but as the DLRM models have gotten fatter, they busted out beyond the confines of the NNPIs, then Meta Platforms had to bring in GPUs to do inferencing. We presume that these were not the same GPUs that were used to do AI training, but PCI-Express cards like the T4 and A40 from Nvidia, but Coburn was not specific. And then that started getting more and more expensive as more inferencing capacity was required.

“You can see the requirements for inference quickly outpaced the NPI capabilities and Meta pivoted to GPUs because they provide a greater compute power to meet the growing demand,” Coburn explained in the MTIA launch presentation. “But it turns out that while GPUs provide a lot of memory bandwidth and compute throughput, they were not designed with inference in mind and their efficiency is low for real models despite significant software optimizations. And this makes them challenging and expensive to deploy and practice.”

We strongly suspect that Nvidia would argue that Meta Platforms was using the wrong device for DLRMs, or it might have explained how the “Grace” CPU and “Hopper” GPU hybrid would save the day. None of that appears to have mattered because Meta Platforms wants to control its own fate in silicon just as it did back in 2011 when it launched the Open Compute Project to open source server and datacenter designs.

Which begs the question: Will Facebook open source the RTL and design specs for the MTIA device?

Banking On RISC-V For MTIA

Facebook has historically been strong proponents of open source software and hardware, and it would have been a big surprise if Meta Platforms did not embrace a RISC-V architecture for the MTIA accelerator. It is, of course, based on a dual-core RISC-V processing element, wrapped with a whole bunch of stuff but not so much that it won’t fit into a 25 watt chip and a 35 watt dual M.2 peripheral card.

Here are the basic specs of the MTIA v1 chip:

Because it is low frequency, the MTIA v1 chip is also burning a fairly low amount of power, and being implemented in 7 nanometer processes means the chip is small enough to run pretty cool without being on the most advanced processes from Taiwan Semiconductor Manufacturing Co, ranging from 5 nanometers down to 3 nanometers with a 4 nanometer stepping stone in between. Those are more expensive processes, too, and perhaps something to be saved for a later day – and a later generation of device for training and inference separately or together as Google does with its TPUs – when these processes are more mature and consequently cheaper.

The MTIA v1 inference chip has a grid of 64 processing elements that have 128 MB of SRAM memory wrapped around them that can be used as primary storage or for cache memory that front ends sixteen low power DDR5 (LPDDR5) memory controllers. This LPDDR5 memory is used in laptops and is also being used in Nvidia’s impending Grace Arm server CPU. Those sixteen channels of LPDDR5 memory can provide up to 64 GB of external memory, suitable for holding those big fat embeddings that are necessary for DLRMs. (More on that in a moment.)

Those 64 processing elements are based on a pair of RISC-V cores, one plain vanilla and the other with vector math extensions. Each processing element has 128 KB of local memory and fixed function units to do FP16 and INT8 math, crank through non-linear functions, and move data.

Here is what the MTIA v1 board looks like:

There is no way that fan can be put atop the chip and have a dozen of these fit inside a Yosemite V3 server. Maybe that is just being shown for scale?

Here is the neat bit in the MTIA server design. There is a leaf/spine network of PCI-Express switches in the Yosemite server that not only lets the MTIAs connect to the host but also to each other and to the 96 GB of host DRAM that can store even larger embeddings if necessary. (Just like Nvidia is going to do with Grace-Hopper.) The whole shebang weighs in at 780 watts per system – or a bit more than the 700 watts a single Hopper SXM5 GPU comes in at when it is running full tilt boogie.

That Nvidia H100 can handle 2,000 teraops at INT8 precision within 700 watts for the device, but the Meta Platforms Yosemite inference platform can handle 1,230 teraops with 780 watts for the system. A DGX H100 is 10,200 watts with eight GPUs, and that is 16,000 teraops for 1.57 teraops per watt. The MTIA comes in at 1.58 teraops per watt, and is tuned to Meta Platform’s DLRM and PyTorch framework – and will be more highly tuned. We strongly suspect that a chassis of the MTIAs costs a lot less per unit of work than a DGX H100 system – or else Meta Platforms would not bring it up.

Raw feeds and speeds are not the best way to compare systems, of course. DLRMs have varying levels of complexity and model size, and not everything is good at everything. Here is how the DLRMs break down inside of Meta Platforms:

“We can see that the majority of the time is actually spent on fully connected layers, followed by embedded back layers, and then trailed by such longtail operations like concat, transpose, quantize, and dequantize, and others,” explained Roman Levenstein, engineering director at Meta Platforms. “The breakdown gives us also insight into where and how MTIA is more efficient. MTIA can reach up to two times better performance per watt on fully connected layers compared to GPUs.”

Here is how the performance per watt stacks up on low complexity, medium, and high complexity models:

Levenstein cautioned that the MTIA device was not yet optimized for DLRM inference on models with high complexity.

We are going to try to find out which NNPIs and GPUs were tested here and do a little price/performance analysis. We will also contemplate how an AI training device could be crafted from this foundational chip. Stay tuned.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

5 Comments

  1. I’ve been wondering about proper use cases for RISC-V for a while (aisde from quadfurcated pet-rock droop engines), and this might be one, maybe. ARM and POWER seem much closer to the sweet spot between x86/64-CISCs and very simple RISCs, for generic workloads, in terms of the balance between performance and efficiency. Preemptive multitasking, virtual machines, and automated garbage collection (gc), as found in Java programs running on contemporary OSes (for example), require some solid CPU oomph, like that provided by ARM/POWER and x86/64. But, single-tasking machine code, or even C, running on bare metal, with manual memory allocation, is probably fine performance-wise on RISC-V. Patterson’s perspective (in letting go of SOAR and SPUR in favor of RISC-V) may have been “exactly” this, that a computer would have a sufficiently large number of simple cores (and spare ones) to run each (simultaneous) process on a separate physical core, without any need for context-switching, without virtualization, without “simultaneous multi-threading”, and possibly without gc too. To do this well (it seems to me) the OS should be adapted to remove most of its advanced modern bits that enable these features, and focus more (or only) on dispatch of ready-processes to free cores, or the system (or device) could run without an OS at all. The use of RISC-V in this Meta MTIA seems to fit this perspective, as I doubt that this particular chip is meant to ever run an OS (similarly to a GPU). Hopefully they find a way to fit that paunchy-portly rotund-yummy dragon-stout fan in there somewhere!

  2. >>NNP And GPU Can’t Handle The Load With Good TCO<< TCO = TOTAL Cost of Ownersnip. And to that I'd wonder if the green shaded Meta bean counters have factored in development, deployment, software, support and maintenance costs into their counting machines? Core business semiconductor guys spread these costs over thousands of customers while Meta has one. With late generation devices reaching into the $Bs to properly develop and deploy, I'm not sure the math pencils out.

  3. I think a lot of people are going to be rather surprised at where RISC-V pops up and how quickly it evolves to compete head to head with “top end” processors from the Arm and x64 incumbents. From a programmer’s perspective (outside the device driver and compiler writer realms), programmers don’t really care all that much about the details of the CPU instruction sets nowadays. Anyone who thinks they do is invited to compare code base sizes between the 80s and the 2020s – modern code is not written to the metal any more.

  4. Do you think the likes of a western Apple, or an oriental Baidu or Huawei are, for some time ago now, pioneering with ethereal champions expert and experienced in their respective novel fields leading AI Interference with AI Inference Training for Chipped Platforms in Linked Computerised Networks ……. in their own inimitable stealthy unilateral way, failsafe secure and proprietary intellectual property protected within their sealed gardens and almighty fortresses of virtual delights …… and making ready to lay immaculate waste to any and all wilfully negative and punitive competition or similarly disposed opposition with a vast and comprehensive series of controlled releases for subsequent universal deployment/employment/enjoyment?

    And you surely do realise if you think not, the likes of an Apple or a Baidu or a Huawei streak ever further away, and way out ahead in front, with nothing to hinder or stop progress leading the Internetworking of Things with their way of doing all things.

    Do you realise that is where you and they, and IT and AI, are presently at?

  5. @Timothy,
    “Which begs the question: …”
    No it doesn’t, it RAISES the question. “Begging the question” is a rare logical fallacy. The phrase seems to give your question more rhetorical heft, but it’s wrong.

    It’s crazy that keeping us scrolling through an endless feed of divisive inflammatory posts can pay for tens (hundreds?) of millions in custom chip design and fabrication to optimize Deep Learning Recommendation Models. Just show me what my friend have been doing!

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.