A Look Inside the Groq Approach to AI Inference

If the only thing you really know to date about machine learning chip startup, Groq, is that it is led by one of the creators of Google’s TPU and that will target inference, don’t worry, you didn’t miss something.

Details about the company’s architecture and approach have been scant and opaque, which has been purposeful—and not because the team is uncertain about what they’ve developed. Following tapeout and some internal benchmarking with an eye on what the next big models will be for big datacenter deployments, the team appears confident they are at the edge of the next wave of AI devices with their bare bones hardware that took its style cues from a compiler first. Then again, all the AI chipmakers feel this way. This, however, might be the first truly different thing we’ll see and the one with the team that can actually find reach into hyperscale datacenters as easily as the big automakers.

Following a lengthy, cagey conversation with Groq co-founder and CEO, Jonathan Ross we were able to walk away with a sense of what this chip is and more important, is not. And with some of these elements we can piece together a picture of this differentiation. The goal of the conversation (aside from prying) was to set the stage for the architecture deep dive we will receive in the next few months or so. With that said, we did gather some interesting architectural and conceptual data about how the forthcoming chips were conceived, designed, and how Groq will intersect the market with the right inference device to fill the gaps in terms of large-scale datacenter AI deployments.

And yes, to repeat, we are talking about inference specifically. Some of the early rumors about Groq were focused on its future role in the training market but Ross insists the team’s starting point was always there, especially after seeing large-scale model deployments at Google and what it would take to run those. “From the time of the first TPU deployments it became clear inference was the much bigger problem. Training is pretty much a solved problem. They can always chip away at accuracy and precision but the time it takes to train is not as big of a problem any longer. Costs are down and it’s a one-time cost, not a revolving one,” Ross says.

“Inference is an inherently larger market. Training scales with the number of machine learning researchers you have, inference scales with the number of queries or users. Training is compiling, inference is running,” he adds.

“Inference is much more difficult too, for that matter. Training can be solved by throwing a bunch of money at the problem; it can be solved at a system level by taking existing architectures, stitching a bunch of chips together, and getting a sufficient gain. With inference, it’s about deploying that across a large fleet of devices, perhaps millions of servers, each with their own inference device.” Further, each chip in that case needs to be efficient and robust. “In training you can periodically checkpoint and restart but that’s no luxury in inference. It’s real time, latency sensitive, and needs ultra-performance and efficiency and that compounds in large-scale datacenter deployments.

We are at a point now in the datacenter where sizable, complex models can be trained but ultimately, not deployed because it’s just too expensive.

Ross remembers when Google’s Jeff Dean did the math before the TPU rollout and shared that while they could train production-level models, they could not afford to deploy them with existing architectures because of it would be too expensive. “If they were going to deploy speech recognition for everyone, Google would have to build 2-3X the number of datacenters—an extra 20-40. These costs are measured in the billions. If you do the math the other way in terms of operational costs just for speech recognition they would have to double their computational power. And now, Ross tells us, in terms of operations per second, about half of Google’s compute is running on the TPU. That’s an interesting set of metrics (particularly the last) but it highlights some of the key problems in inference at scale Groq wants to tackle. And no, it is not a knockoff TPU. Not even close, he says.

What Groq is not doing is not iterating on Google’s AI processor, or GPUs, or FPGAs or even CPUs as we know them. We will know much more in the coming months (we’ve been assured an architecture deep dive as soon as it’s allowed) but there are some notable hints that do seem to highlight Groq’s potential uniqueness in the market for datacenter inference in particular.

Here are a few themes about the architecture Ross described for us that offer some insight.

It’s More Minimalist Than We Might Expect

“When our new engineering hires come on board they are all shocked at how simple the architecture is,” Ross tells us. This isn’t necessarily a surprise because the tensor-based hardware concept is novel but not necessarily difficult.

“A lot of what we are seeing in the AI space now is a clever variation on existing themes: FPGA, CPU, GPU, and TPU. Some are going the direction of 4,000 core CPUs, others with FPGAs that can be reconfigured in microseconds, and others are trying to make GPUs with more capable cores or cores that are independently programmable, Ross explains. “each of these approaches has its own custom software stack, even when they’ve just tweaked existing designs. There’s not a lot of reuse potential in these options or advantage. So the question becomes, is it worth using one of those when it doesn’t get you to where you need to be, which is well over a 10% improvement? ML is expensive, there’s a thirst for compute and there are a lot of applications that aren’t feasible to deploy. 10% improvements aren’t going to get you there, you have to do something radically different.”

“As engineers we needed to come up with what core axioms to build around and much of this went against a lot of the common wisdom in semiconductors where the belief is that specialization gives improved performance.”

Again, we are limited by what they are allowed to say but it was obvious this is not a datacenter accelerator (meaning it doesn’t snap into PCIe).  With that said, this is a CPU.

We should not think of this as a traditional processor, however, Ross says. The thing that falls down there is that while general purpose CPUs are great at serial, there’s a lot of overhead in coordinating hundreds or thousands of them and that eats most of your gains. ML is not about serial processing, it’s focused on parallel processing and while that sounds like something a GPU should do well there’s so much extraneous hardware in those devices that the gains are also lost.

Ross is cagey but says they can get significant advantage out of getting ride of a lot of that extraneous hardware. The goal is getting an enormous amount of parallel throughput and existing devices have too much jammed inside to make them efficient, scalable, and high performance.

Batch Size 1, Batch Sparsity Key Advantages

If you’re not a practitioner this whole conversation about batch sizes for inference devices might not seem like the big deal it has been for training. But it definitely matters. In training one might have large batch sizes to train across 2 million images, taking 64 images at a time, for instance, with no real hardship. Inference is a different story.

With larger batch sizes you get the same performance if there are 64 inputs or 1. The latency and processing time will be the same. If you only have a one input you’re wasting most of the hardware. So if you’re doing batch size 64 and your inputs are much lower (and in many use cases they will be), there’s a lot left on the table.

Consider an autonomous driving example. If you’re trying to infer what different roadway signs are but thee are only three in an image, you’re getting 3/64 of the compute. But if you can run batch size 1 you’re getting 100% efficiency.” So why isn’t this already a thing? Ross says the problem until now is that people have been designing for the hardware they have, not for what they can have.

“Most of the people we’re talking to will not even consider a larger batch size for inference, they will only deploy batch size 1,” Ross says. “Batch size 2 or 4 are not relevant for in the inference market we are focused on although might be a fit for some workloads. What this means is an enormous memory bandwidth problem, which helps explain why few AI hardware makers have put extensive emphasis here.

Here’s another important addition. Historically, people have just used large batch sizes and not thought about the potential that opens up with batch size 1. The concept of “batch sparsity” which means you can change the model you’re running from one input to the next means it is possible to run a highly customized model that can change on the fly. With this, imagine speech recognition inference at a call center where 64 people call in at the same time. Before, there would be one model to serve all, but with this applied, the model can have different variations in that model it can detect and run on (for instance, “the drunk model” or the “sober and clear speaking model” or one to suit different accents). This leads to a dramatically different user experience (although at what price point/speed/programmatic/other price only time can tell).

The Compiler Came Before the Hardware

All of the above is possible because the team started with the compiler.

Ross was not, by the way, a hardware engineer when TPU came about. He was on the software and compilers side of the house. He says the reason they are so radically unlike anything to market is because of this base point.

The first six months of Groq were spent on just the compiler pre-hardware. It was only after that when the team looked into what the right architecture should be.

“So how that works is a starting point with a four dimensional tensor, then you realize you’re not building a 4D chip, so you move to one or two dimensions with complicated operations and then realize you won’t put dedicated hardware into the chip until you break all of that into smaller operations to run those instructions and you’re left with the design that could only happen if you started with the hardware.”

Ross tells us that the compiler has managed to pare down compile time to seconds.

Keepin’ It Real with FP16

We don’t need to tell you that all the innovative architecture in the world can’t compare to ease of use. The vision Ross consistently gave of Groq is that they are led by what the people they’ve had conversations with want (batch size 1 just one example). “Many of the models we see are difficult to quantize to INT-8. Things like LSTM and RNNs really prefer floating point.”

“If you’re building for just one type of numeric, you’ll find you can’t support the majority of customers because not everyone is willing to quantize,” he adds.

It Can Handle Training But That’s Not Its Intended Purpose

As a casual aside, like it was no big deal, Ross mentioned that one of the first test customer models they compiled was an RNN and they beat the V100 significantly.

We didn’t get more details on this or the specifics or how and why this was the case, but he just put that there, and so will we. If one was suspicious at all they might be led to think that Groq is purposefully keeping all details about training capabilities mum so it they do not lose the perception of having extraneous stuff to serve training or trying to create the mythical (energy hogging) chip to serve both sides of the ML workload. But that is just the conjecture from the TNP side. It’s a smart strategy if it is the case but the real test will come if we see Groq produce MLperf results for inference and training in 2020 (Ross says they’ll have MLperf results in 2020 but gave no other specifics).

It’s Deterministic and Not Reliant on Locality

But here’s the catch. It’s not what we think of in compsci when we hear “deterministic” or think about “locality”.  And here’s the other catch, even though those terms aren’t what we think, they are two major differentiators.

By deterministic, Ross says he means that that at compile time, the user will know exactly (to the many decimals kind of exactly) how long a model will take to run. This is also one of those things that might not sound like a big deal on paper, but with increased scale (and especially for inference) is a game-changer.

As we add more components and we’re getting into the billions of transistors on existing devices then pushing multiple chips together even more it gets harder to scale in the datacenter. There’s tail latency (where, if you scale to thousands of chip if just one is slow there’s a 60% chance every single query will be slow if you’re waiting for all results to come back) and this will become even more of a problem over time. “This deterministic design means  you could link 30 of our chips together, end to end, and know to the nanosecond how long it will take to run a computation and that allows you to scale, Ross explains.

Ross says that when he looks at ML accelerators and what they need to achieve processing-wise for the future, the first order engineering concerns are different than with anything that exists. Even staid concepts like locality simply do not matter, he tells us. “We don’t care at all about locality”. That one doesn’t make sense but perhaps over time it will.

Single Implementation, Multiple Design

Ross jokes that Groq’s approach to architecture is SIMD, but in this case, single implementation, multiple design.

Ross wouldn’t go into much more detail than that about how they’ve split their design thinking across multiple form factors and use cases. We tried to keep the conversation focused on Groq’s position in datacenter inference but he did say that they are looking for those areas where users cannot afford to deploy on their models and to unlock applications that weren’t available before due to cost.

What he did say is “a lot of the classical divides over architectures like edge and non-edge do not apply here. People might consider automotive to be edge, for instance, but a lot of the players need an enormous amount of compute to solve their problems, so they’re looking to deploy on compute that looks more like a classic datacenter.”

A Combination of Other Factors

Ross says he’s not sure about the analyst numbers that predict a $60 billion market for inference by 2025 but seeing the thought process from the Google point of view about the sheer affordability of inference has been a big motivator.

“One thing I’ve seen over the years in general is that as compute gets cheaper, the total spend increases. This will be the case in inference as well because there are a lot of things in ML that simply aren’t feasible because right now they can train the models but they cannot afford to deploy them.” He says with new affordable inference options new workloads entirely will hit the market and in far wider areas.

The question will be optimization, and not from a programmer’s perspective. It’s what will datacenter inference customers be optimizing for? What will drive things the most? Cost, of course, is central, as Ross argues and he mentions energy consumption only in the context of auto makers (although of course that matters).

If inference becomes about who drives the lowest cost device but is still only hitting that 10% performance increase point, what will datacenter folks do? Will they go for the slightly hotter, more expensive thing for 25% performance increase? 50%? Where is the threshold? And if something emerges that can hit even better numbers across the board in the datacenter for a workload category that will take over whole swathes of the total amount of processing on the planet, what does it mean for general purpose CPUs or accelerators?

We’ve watched much in this AI chip space over the last several years. And it goes without saying when Groq is finally ready to surprise us with that minimal datacenter inference killer what will have to say? And by the way, it’s worth considering they’ve held off on the big architecture reveals because they want to the story to get out there not to expect much in the way of exotic or specialized chips that don’t look like anything we’ve seen. Perhaps what we’ll be looking at will be rather pedestrian with the compiler as the star of the show.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. The above article just showed up among the “new” items at Hacker News, so one made the logical inference that it would be a good thing to come here and read what it’s all about. It makes us wonder what is a “minimal datacenter inference killer”. At the Mentifex open-source AI project, the InFerence mind-module uses two known facts to infer a new fact. For instance, each AI Mind knows that “students read books”. If the human user inputs, “John is a student”, the AI makes an at first tentative inference that John, being a student, perhaps reads books. The AI asks, “Does John read books?” The user may respond yes, no, maybe, or no response at all. The AI stores the human confirmation or denial in the knowledge bank (KB) of the AI. Thanks for the article!

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.