Few of the AI hardware startups that have made it through the first round of reality (roughly 2016 until the present) have managed to navigate the choppy waters without shifting course, sometimes wildly.
Nearly all have a story that begins with training, which washed into a rabid focus on inference (with some brave companies touting equal excellence in both, which makes for a convoluted architectural tradeoff story), and now, as the race begins in earnest with real silicon delivered and at stake, the winds have blown another direction. And it’s one that none of those chip startup survivors truly architected for in those early days of design when the specialized co-processor was the name of the game. We now have a shift toward valuing much broader general-purpose workloads, namely a broad class of linear algebra-driven applications.
Every one of the startups will tell you that they were always architecting with linear algebra and HPC offload in mind all along as something of a secondary consideration. That is questionable. Because another unexpected thing happened during that 2016-present shift for datacenter-focused AI chip startups. They all thought they were fighting for the unseen swarms of enterprises eager to implement large-scale AI. What they found was that those shops might have been willing to test and explore. The ones who were most likely to place a testbed order were the national labs. And those labs could actually turn out to be real customers with high-value workloads who could consider a shift away from expensive, power-hungry GPUs for something that can handle fast-flowing research data much more efficiently, crunch it like a supercomputer, train on it like an A100, and inference like nobody’s business.
The best example of this playing out was Cerebras with its funky wafer-scale approach that surprised us with its role across applications (training, inference, HPC-esque number crunching, and surprise, surprise, programmability following some footwork). They got to the labs first (well, rather, they got the labs to talk to about before their competitors could tell similar tales from the testbed) and now it seems the startups are realizing labs are more than just good test/dev PR—they might actually buy this stuff. Who would have guessed it?
Less on the academic/government side is the financial services sector, which is even quicker than the labs to snatch up something new to sniff out a new story for systems and build swiftly on the basis of solid results. This might be the second, equal market for all the AI chips to really duke it out now that they’ve found themselves after years of switching from strong training, then inference, now fat math stories.
We know that the handful of datacenter-focused AI chip startups that meet that narrative arc are all installed at national labs in the U.S. and Europe already. From folks we’ve talked to it’s less about hardware performance and much more about who has the most workable software stack. Even with legions of grad students to toil away in conjunction with (engineer) resource-strapped startups, none of these stories we’ve heard are compelling. No one has ever said, “well, it just worked. Wasn’t much worse than porting to CUDA” but that’s to be expected. Other than the software issue, the truly compelling praise has been, “we were surprised to find that this thing could do the heavy math we needed for data coming off (insert detector, sensors, etc.) and train and infer better than the CPU and GPU boosted infrastructure we had, all far more efficiently.
This is something that startup Groq, whom you’ll recall was founded by one of the architects of the TPU way back in the dawn of AI chip startup times, is getting a handle on in its early work with its two biggest segments (at least according to Bill Leszinske, who heads products). Those segments are, you guessed it, national labs and fintech. In between this shift from training to datacenter inference to now more general purpose compute was another branch, one that took some of the datacenter-focused startups right off the path and into the autonomous world, leaving several without even bothering to talk about their datacenter ambitions at all–all within the span of a year.
But how to stand out if you’re still toughing it out for those high-value datacenter-level use cases? Talking about software and compiler magic only goes so far, getting the world excited about the low latency, rapid-results value pf batch size 1 is difficult outside of fintech for the most part, and competitors might not have landed the national lab glory much in advance but they at least got them to talk publicly. So here’s what Groq is doing, starting today: they’re getting serious about systems—signed, sealed, delivered. And it’s worth paying attention to, not because they’re first (they aren’t) but because they are doing things differently. Remember, it’s apparently still the wild west (if funding is any indication) so that’s still a good thing, right?
There’s more on the Groq announcement from today here but what started off as a preface to that story ended up, well, this. A discussion about rapid pivots. Such fast turnarounds on story are nothing new, of course. The very company that sits at the top of the AI pile, Nvidia, has managed to ride these shifts in strategic story perfectly—and has always been in the right place at the right time with something just flexible enough for what was next in terms of what they had tradeoff to match architecture to tale.
It wasn’t so long ago (in relative terms) that Nvidia was building its gaming graphics empire. Some clever work as part of a research aside led to the GPUs eventual dominance as the accelerator of choice in some of the largest supercomputers on the planet and subsequent success in non-HPC acceleration (databases, general purpose enterprise workloads, etc.). By the time the world was abuzz with deep learning, Nvidia had already seen the tide rolling in and prepared the beach by beefing up its software and developer resources and to make sure they were hooked into whatever happened next, even if they had a big, heavy, expensive, tool for the job. Masterful, really. And damned lucky.
But for the dwindling list of pure datacenter all-things-to-all-AI-workloads chip startups, the luck has always been fleeting (change is too quick in ML/AI), the resources to quickly adapt in short supply (engineering), and if anyone happens to get it all right at the exact correct moment they’re snapped up and smothered out by one behemoth or another (Nervana, for instance).
This is not at all a grim, hopeless view of things by any means. But it is a necessary reset in article form reminding us that the real action is happening in two experimental areas for now and that will drive the new story for AI chip startups—one that is more tuned for multi-purpose computing with the added benefit of being good at training and inference (yes, like Nvidia has been doing for about four years now, albeit with the sideswipe of separate form factors for training/inference even though a V100 alone could suffice for both with a nice linear algebra boost to boot).
This is a note that we should watch what happens at the labs with these startups, to see what kind of HPC magic they can do because after all, that same supercomputing-fed stuff is right up the alley of the world’s largest financial services firms too. And there is a market in both of these areas. The big enterprise AI crush hasn’t come yet. It might not. And when it does it might look a lot like a regular old CPU with some GPU for the exciting stuff, but hardly a meaningful portion of nodes and probably a completely separate cluster for training. There are exceptions here, publicized use cases, but by and large, the next wave of waiting for that big promised market to arrive is in monte carlo simulations, in Cholesky decompositions, and in nitty gritty places where it’s going to be tough to carve out enough to satisfy the hopeful investors and hungry analysts.
Large-scale IT is a story of these weird dual-purpose uses that lead to primary markets. This isn’t new, it’s just different in the quickened pace of what’s required from silicon that needs to be set in stone long before the shift has happened, only to miss the boat and find a second, unlikely home somewhere else. That home could be a castle still in the fog or one made of sand. We’ll see.
Either way, watch for startups that started with the ambition of owning training, then shuffled to inference when Nvidia exhibited a clear lead and the analysts promised a massive, booming market for inference without clarifying if it was datacenter or at the edge, and now will vie for supremacy in the upper echelons crunching numbers for general purpose HPC and risk analysis and so on with architectures that were designed from the ground up to do training, rerouted in their software for inference in some cases, and now have to do triple duty—all against an incumbent that has been doing all three since the beginning.
Great article. I was looking into the “relationship” between AI startups and Nvidia. Your timely article gave me a first-order view of that topic. Great writing/style.
A great perspective on the deft pivot, multiple pivots in fact, required to remain relevant in AI hardware. As I read, I thought about the progression of leading areas of the moment: first training, then inference, then the fat math, and presently the shift to a systems approach you describe about Cerebras, Groq, (and unmentioned) Graphcore.
But this idea that there has been an incumbent doing it all from the beginning is the one that resonates. My sense is Nvidia isn’t adjusting to, but rather setting the bar: forcing the industry react to them rather than the other way around. The systems approach was initiated some years ago, at least for a merchant offering, with DGX. And well before that, fintech has been making use of GPGPUs since the mid aughts.
If these initiatives inform where the industry is going it shouldn’t be too hard to see new battlegrounds have opened in both a) extending flexibility with new floating point and integer formats and, b) vertically integrated solutions (with Drive, Clara, Aerial, Issac, etc).
The question I think is whether any scrappy start up can get in front with a compelling enough solution to make Nvidia adjust to them.