Compute

Taalas Etches AI Models Onto Transistors To Rocket Boost Inference

Published

Adding big blocks of SRAM to collections of AI tensor engines, or better still, a waferscale collection of such engines, turbocharges AI inference, as has been shown time and again by AI upstarts Cerebras Systems, SambaNova Systems (which Intel is rumored to have taken a run at late last year), Groq (just eaten by Nvidia for $20 billion), and Graphcore (eaten by SoftBank for $600 million a year and a half ago) as they compare against GPUs from Nvidia and AMD.

But if you really want to push the envelope of AI inference, says startup Taalas, which dropped out of stealth mode today, then the thing to do is stop screwing around and encode the weights of a finished AI inference right into the transistors of a chip and get rid of all of the software cruft that comes with trying to make compute engines malleable so companies can constantly tweak and tune their models.

By doing so, you can also radically simplify the architecture of the AI device and, the way that Taalas has done it, you can eliminate the wall between compute and memory that plagues all serial and parallel compute engines – and especially GPUs and AI XPUs that have had to resort to HBM stacked DRAM to get bandwidth commensurate with their floating point and integer performance.

Taalas is two and a half years old and has raised over $200 million in three rounds of venture funding. The company is located in Toronto, one of the hotbeds of AI research and also where there is plenty of chip expertise, including Tenstorrent, where the company’s three founders all worked. Ljubisa Bajic is the co-founder who has the chief executive officer job at Taalas, and is well known as the founder of Tenstorrent.

What might be less known is that Bajic spent a few years after the Dot Com Boom designing video encoders for Teralogic and Oak Technology before moving over to AMD and rising through the engineering ranks to be the architect and senior manager of the company’s hybrid CPU-GPU chip designs for PCs and servers. Bajic did a one-year stint at Nvidia as s senior architect, bounced back to AMD as a director of integrated circuit design for two years, and then started Tenstorrent. When chip luminary Jim Keller was brought in in the fall of 2022, Bajic decided to leave and after a six-month break got to work on a completely different idea for AI inference computing and started Taalas in Toronto.

Lejla Bajic, who is Ljubisa’s wife, is the chief operating officer at Taalas, and she was a software engineer at FPGA maker Altera in the wake of the Dot Com Boom, and then became a senior engineer at ATI, the Canadian GPU maker that AMD bought in July 2006 for $5.4 billion. Lejla Bajic also rose through the AMD engineering ranks over the years, eventually becoming senior manager of systems engineering. She joined Tenstorrent in October 2017 to do the same job and left when her husband did.

The third co-founder at Taalas is Drago Ignjatovic, who was a senior design engineer working on AMD APUs and GPUs and took over for Ljubisa Bajic as director of ASIC design when the latter left to start Tenstorrent. Nine months later, Ignjatovic joined Tenstorrent as its vice president of hardware engineering, and he started Taalas with the Bajices as the startup’s chief technology officer.

Significantly, Paresh Kharya, who was senior director of product management and marketing for the datacenter business for three years and then director of AI infrastructure product management at Google Cloud (managing its GPU and TPU hardware and their software stacks), has joined Taalas as vice president of products. The company has 25 employees at the moment, most of them engineers who worked at AMD, Apple, Google, Nvidia, and Tenstorrent, and they have plenty of experience bringing chips from idea to systems. The company has only spent $30 million on research and development to get to the launch today, and has more than $170 million still in the bank.

Mashing Up ROM And SRAM, Ditching HBM And Crazy I/O

Most good ideas seem obvious in hindsight, and creating a dataflow engine that can embody the weights and algorithms of an AI model and then pouring context and queries through it is also not a new idea. To a certain extent, that is what FPGAs and the first generation of AI accelerators do, and it is what GPUs and special accelerators like TPUs and Trainiums also do.

For now, Taalas is keeping the precise workings of its Hard Coded Inference architecture secret, but Bajic and Kharya game me a high level overview of how this works. But before we get into that, Kharya is a history buff like we all are and showed this funny picture that is very much “plus ça change, plus c’est la même chose.” Take a look:

On the upper left, that is the massive copper cabling to interconnect the transistor compute frames of the IBM 7030 Stretch supercomputer from 1961, and the bottom right is the racks and racks of the ENIAC vacuum-tube powered supercomputer from 1946, which eventually spawned the Sperry Rand computer business (now part of Unisys).

The joke is we had massive copper cables and 150 kilowatts per rack back then, and the way GPUs and XPUs have evolved, we are back to the future. (Don’t overanalyze that – it is meant to be funny.)

So what, precisely, is the Hard Coded Inference chip, and how does it work?

Kharya explains it this way:

“We basically have an architecture where we are embedding the models, and we are hard coding the models and the weights into our what we call the mask ROM recall fabric, which is paired with an SRAM recall fabric. Together, they are able to store both the model as well as do all the computations of KV cache. We have adapters and customizations – we support all of that. This design allows us to be super-dense in terms of compute and in terms of storage, and we can do compute on that storage incredibly fast, which is what drives density up and cost down.”

“In the current generation, our density is 8 billion parameters on the hard wired part of the chip., plus the SRAM to allow us to do KV caches, adaptations like fine tuning, and etc. In our next generation, we would have the ability to go up to 20 billion parameters in a chip. Even with trillions of parameters, we’re talking about few tens of chips, which is a very, very small compared to anything else out there on the market today.”

Without being specific about the architecture – Taalas wants it to be a bit of a black box for now – Bajic added this:

“We have got this scheme for the mask ROM recall fabric – the hard-wired part – where we can store four bits away and do the multiply related to it – everything – with a single transistor. So the density is basically insane. And this is not nuclear physics – it is fully digital. It is just a clever trick that we don’t want to broadcast. But once you hardwire everything, you get this opportunity to stuff very differently than if you have to deal with changing things. The important thing is that we can put a weight and do the multiply associated with it all in one transistor. And you know the multipliers are kind of the big boy piece of the computer.”

“What we invented is not particularly difficult, either. It’s just a clever thing that nobody saw because nobody went down this path. We showed up more than two years ago, and we wanted to remove the barrier between memory and compute altogether. That was the genesis of this whole thing. Now, the first way we came up with to do it – and basically the only way we could see at the time that would produce a product on a predictable timeline, because we didn’t want to be research profs and three years down the line have something that doesn’t work – was to quickly veered off into this ROM-based approach. We started studying it in detail and then we realized that actually this was even better than we thought.”

“We actually designed all this stuff from scratch internally. We didn’t use off the shelf anything, we did lots of transistor level design, hand layout – basically our whole effort ended up being a throwback to the 1970s.”

What is immediately obvious is that every change in model, from say Llama 3.1 to Llama 4, will require a new spin of the HC chip. For now, Taalas is focusing on etching the weights for open source models onto its HC chips, but it is not hard to imagine Anthropic and OpenAI picking up the phone and ordering custom accelerators for their models. Even Google might want to give it a try. By the way, Taalas has filed 14 patents under Bajic to cover its technology as far as we can see; there could be more because patent searching is very bad – even Google Patents.

To etch a new model on an HC inference engine involves changing a two layers of metal in the HC chip design, not a complete scrapping of it. And with the cost of training models running into the billions of dollars, paying a relatively nominal fee to adapt an HC inference engine to a new release of a model or for an entirely different model is not a big deal. Kharya says it costs 100X as much to train a model then to get a customize HC chip in reasonable volumes from Taalas.

Perhaps equally importantly, the time between major model releases is lengthening and people are getting attached to their models – there was plenty of gnashing of teeth when OpenAI moved customers from GPT 4.5 to GPT 5, for instance, because the latest release is a bit sycophantic. Given this, it may make sense to order a few hundred thousand to a few million of the HC inference engines.

With the “foundry optimal workflow” that Taalas has created in conjunction with Taiwan Semiconductor Manufacturing Co, customers can go from model weights to deployable PCI-Express cards, actually doing inference, in two months.

The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. At 815 mm2 it is pushing up against the reticle limit of chips these days (before we move to High NA processes that will cut the reticle size in half, which is not at all desirable). Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.

By the way, because the HC1 card is so fast, to get low latency inference does not requiring batching up queries, and that means the bandwidth pressure on the Taalas devices is low. So low that the PCI-Express bus is fine if you want to gang cards up to run a larger model, which Taalas will allow customers to do later this year using pipeline parallelism to spread work across the HC cards. By summer, in fact, it will have a Llama 3.1 model with 20 billion parameters hard coded into an HC chip, and by the end of the year it will have a frontier-class large language model – maybe Llama, maybe DeepSeek, maybe both – running inference across a collection of HC cards. This architecture will be called HC2.

So, just how fast and just how cheap is this Taalas HC1 card? Let’s take a look, starting with the latest throughput for Llama 3.1 8B models as assessed by Artificial Analysis:

These initial performance results for the HC1 were run by Taalas itself, not Artificial Analysis, but you can play around with the chatbot demo at this link and request developer API access at this other link and run your own tests.

That’s a pretty big gap with a “Blackwell” B200 GPU (which Taalas itself ran), and even a substantial gap with what Grow, SambaNova, and Cerebras can deliver with their SRAM-heavy AI compute engines.

For fun, Taalas took the Llama 3.1 8B and DeepSeek R1 671B models and compared the Nvidia B200 against its HC card. (Our guess is that it took around 35 cards to load up the memory to run DeepSeek R1 671B on the Taalas boxes.) Here is how they stack up:

Now, what you want to know is throughput, latency, and cost per token, and this chart brings it all together:

On GPU systems, the interactivity – how many users you can support concurrently asking queries and getting fed answers – depends on the latency you want. If you want low latency, you can’t have a lot of users, and if you want lower cost, you have to pay for it with increased latency of tokens processed as input or output.

As you can see, Taalas is showing much lower costs and incredibly lower latencies on these two models tested.

We look forward to independent testing as the HC cards ramp into production, and to see what Taalas will charge for these AI inference engines. This sure looks like a game changer for AI inference.