If you are looking for an alternative to Nvidia GPUs for AI inference – and who isn’t these days with generative AI being the hottest thing since a volcanic eruption – then you might want to give Groq a call. It is ramping up production on its Language Processing Units, or LPUs, alternatively known as the GroqChip, and it expects to be able to ship a huge number of them in support of inference for large language models.
As we are fond of saying these days, if you have a matrix math engine that can support a generative AI model, then you can sell it to someone who is desperate to not be left behind in the early days of the generative AI boom. The CS-2 wafer-scale processor from Cerebras Systems, the SN40L Reconfigurable Dataflow Unit from SambaNova Systems, and the Gaudi 2 and its follow-on Gaudi 3 engines from Intel are but a few examples of compute engines that are not Nvidia GPUs or AMD GPUs that are getting traction because demand for the HBM memory and advanced packaging that they employ is crimping supply. The GroqChip LPUs are unique in that they do not depend on HBM from Samsung or SK Hynix and the CoWoS packaging from Taiwan Semiconductor Manufacturing Co that welds external HBM to compute chips.
As the “OceanLight” supercomputer in China, which is based on homegrown SW26010-Pro processors etched with 14 nanometer processes, proves full well, you do not have to use advanced processes and packaging to build a compute engine that can get real HPC and AI work done at tremendous scale. The OceanLight architecture ranks among the most computationally efficient machines ever built, and is probably a tiny bit more powerful at running real workloads than the “Frontier” supercomputer at Oak Ridge National Laboratories, if the anecdotes and the Gordon Bell Prize submissions and awards are any measure.
And so, Groq co-founder and chief executive officer Jonathan Ross is perfectly in step with the times when he contends that clusters of Groq LPUs will provide higher throughput, lower latency, and lower cost for LLM inference than using Nvidia GPUs. Admittedly, that is a tall order for a 14 nanometer chip that first debuted a few years ago, inspired by Google’s homegrown Tensor Processing Unit, or TPU. But as anyone can plainly see, the high demand and relatively low supply of GPU motors from Nvidia and AMD give Groq and the other matrix math engine suppliers the opportunity that they have been waiting for to get more of their iron into the field.
The question we had was what has been gating Groq so far. Was it that their software stack wasn’t ready? Was it that AI models were created for GPUs and needed to be tuned? Nope.
“I’ll be direct and explain what we were gated by,” Ross tells The Next Platform, and we like that sort of thing as you might imagine. “There were a hundred startups in this space, all claiming that they were going to be 10X better than Nvidia, and people would dig in and that didn’t tend to materialize. We had a very complicated story because we were doing things very differently. No one buys something because something is better, but because they have problems that are unsolved. You have to solve an unsolved problem. And until recently, we were going to people and giving them a problem saying we could lower your cost if you switch to our chips, or we could speed things up and they were telling us it’s fast enough, it’s cheap enough, you’re just giving me a problem. But now, people have these models that they cannot run fast enough. And so we’re solving their problem, and that’s a very different sales motion. Up until we got a demo for large language model inference about two months ago, we had zero interest. Now, we are beating people away with a stick and we are having fights internally over how we allocate the hardware to customers. Our first 40 racks are already allocated, and we are on track to deploy what we believe is the equivalent of all of OpenAI’s tokens per second in the next twelve months as our plan of record, and may deploy more than that. We have an unencumbered supply chain – we don’t have HBM, we don’t have CoWoS, so we’re not competing with all of them for those technologies.”
Let the technical and economic substitution begin!
Here is what Groq is proposing for commercial-grade inference that must have sub-second response time on LLM replies. A pod of its current generation of GroqChips has an optical interconnect that can scale across 264 chips, and if you put a switch between pods, you can scale further but you get the extra hop across the switch jumping between pods, which adds latency. In the next generation of the GroqRack clusters, the system will scale across 4,128 GroqChips on a single fabric, but that is not ready for market just yet, according to Ross. Groq’s next-generation GroqChip, due in 2025, etched in Samsung’s 4 nanometer processes, will scale even further than this thanks to the process shrink, the architectural enhancements, and the advancements in the fabric on the chips.
For its benchmarking, Groq linked together 576 of its GroqChips and ran inference against the LLaMA 2 model from Meta Platforms, scaling to 70 billion parameters. The GroqRack has nine nodes, and normally these are eight nodes for compute and one as a redundant space, but for the benchmarks all nine nodes were used for compute to get 574 of these linked in three switched pods across those eight racks. (Each node has eight GroqCard adapters.
The LLAMA 2 prompts had 512 token inputs and 1,024 token outputs at INT8 processing, and on the Nvidia H100 GPUs that Groq compared this setup to – which was for an eight-CPU HGX system board that is becoming the unit of compute for generative AI training and sometimes inference – those 576 GPUs can do an inference in one-tenth the time at one tenth the cost of generating the tokens. It takes the Nvidia GPU somewhere on the order of 10 joules to 30 joules to generate tokens in a response, while the Groq setup takes about 1 joule to 3 joules per token. So that is 10X the speed of inference at one tenth the cost, or 100X better price/performance.
Read it again: Groq says it can deliver 100X better bang for the buck at 10X the speed for LLaMA 2 inference.
Now, there are arguably a lot more Groq devices to make this happen – one fat Nvidia server versus eight racks of Groq gear – but it is hard to argue with 1/10th the overall cost at 10X the speed. The more space you burn, the less money you burn.
You can of course scale Nvidia’s SuperPODs to 256 GPUs in a single memory space, and that does allow for larger models and more parallel processing to speed up the tokens per second. But that comes at the cost of paying for an NVSwitch fabric across those nodes, which ain’t free.
On the demo that Ross did for us, the Groq setup with 576 chips was able to push above 300 tokens per second on the prompts we did, and he says the typical Nvidia GPU setup is lucky to push 10 tokens to 30 tokens per second.
The Groq approach is wide, slow, and low power and makes it up in parallel across many units with lots of local SRAM memory next to the compute, while the Nvidia approach is faster on the matrix math and much faster on the main memory that is stacked up and running in parallel.
“In twelve months, we can deploy 100,000 LPUs and in 24 months we can deploy 1 million,” Ross declares, and it is not at all hard to believe that not only will this be possible, but it will be salable because of the dearth and high cost of GPUs and the fact that a lot of organizations want to move away from proprietary models like GPT-3.5 and GPT-4 from OpenAI and towards the very open LLaMA 2 from Meta Platforms.
If you have a compute engine that can run PyTorch and LLaMA 2, and if it doesn’t cost too much, you can sell it.
Thinking ahead to that next-generation GroqChip, Ross says it will have a 15X to 20X improvement in power efficiency that comes from moving from 14 nanometer GlobalFoundries to 4 nanometer Samsung manufacturing processes. This will allow for a lot more matrix compute and SRAM memory to be added to the device in the same power envelop – how much remains to be seen. At constant power, that would be a 3.5X reduction in the number of chips to do the same work on the exact same design, and architectural improvements could put that to 5X or even higher, we think. So what takes 576 GroqChips in nine racks to accomplish for LLaMA 2 70B inference today might only take around 100 chips in two racks in 2025.
In the interim, Groq has a new node coming that boosts the number of chips it has in the node by 4X, from the eight LPUs implemented on PCI-Express cards in the current GroqNode to what we presume will be directly mounted LPU chips on a four motherboards (with eight LPUs per board) interlinked to get to 32 of them in a chassis. By moving to 32 LPUs in a node, the cost, power, and latency of the overall cluster will go down, says Ross.
“That will hold us over until we get that next chip,” he adds.
Now, Ross might have said it can deploy 1 million LPUs in 24 months, but that doesn’t mean customers will buy that many over that time. But even at $1,000 a pop, that would be $1 billion.
Also, don’t confuse the idea of generating a token at one-tenth of the cost with the overall system costing one-tenth as much. The Groq cluster that was tested has very high throughput and very high capacity and that is how it is getting very low latency. But we are pretty sure that a Groq system with 576 LPUs does not cost one tenth that of a DGX H100, which runs somewhere north of $400,000 these days. If you can get 576 LPUs for $40,000, by all means place your orders with Groq right now. You will be hard pressed to find a better deal than $69 each for a datacenter-class AI inference engine and the chassis and networking wrapped around it.
We strongly suspect that Groq was talking about joules per token per second in the data that we saw, and perhaps the latency as you scale Nvidia infrastructure beyond the limits of the NVSwitch coherent interconnect for GPUs, which is a real barrier when it comes to inference latency.
Quite insightful! I’m not entirely clear though if the LPU is the same GroqChip that was described at Hot Chips 34 (in 2022) as: “The Groq Software-defined Scale-out Tensor Streaming Multiprocessor” (TSP)? TNP had a nice article on this TSP, with diagrams, in 2020 ( https://www.nextplatform.com/2020/09/29/groq-shares-recipe-for-tsp-nodes-systems/ ) and Argonne seems to have some GroqRacks installed at its AI Leadership Testbed, with some chip in them (LPU? TSP? GroqChip?) ( https://www.alcf.anl.gov/events/groq-ai-workshop ).
So, what I wonder is if the LPU is a particular software-defined configuration of the “flexible” TSP GroqChip(?), or different (more specialized) silicon altogether?
Irrespective, the LPU’s performance is impressive (to me) and I’m glad Groq is getting positive attention for it, along with world record LLaMA-2 racing performance! (but, “Inquisition Minds” …).
My sense of it was that it was the same 14 nanometer silicon, but with some software-defined tweaks to how it is used.