Standing out in the crowded server inference space is getting more difficult, especially at this late stage of the startup game. Not that this is anything new overall, but it takes many orders of magnitude to find a spot in the datacenter, especially with a new software stack. Untether AI, which came out of stealth today, says that their at-memory approach to inference can deliver far lower power with an architecture that isn’t exotic, although it is elegant.
Thus far the company, founded in 2018, has raised $27 million with venture funds from Intel Capital, among others. It is sampling its runAI200 devices now with commercial available in early 2021. These chips which will be at the heart of the TsunAImi PCIe device, which contains four such processing elements. The approach is what Untether calls “at memory” computing (as opposed to in-memory or near memory), something we’ll unpack in a moment.
The startup is focused on Int-8, low latency server-based inference only with small batch sizes in mind (batch 1 was at the heart of their design process). The company’s CEO, Arun Iyengar (you might recognize his names from leadership roles at Xilinx, AMD, and Altera) says they are going after NLP, recommendation engines, and vision systems for the applications push with fintech at the top of their list for markets, although he was quick to point out that this was less about high frequency trading and more for broader portfolio balancing (asset management, risk allocation, etc.) as AI has real traction there. He also says their low power approach would be a good fit for on-prem centers doing large-scale video aggregation (smart cities, retail operations, for example). He admits willingly that they’re starting with these use cases instead of coming out bold with ambitions to find a place among the hallowed hyperscalers, but says there’s enough market out there for low-power, high performance devices that they’ll find their niches.
Most of the Toronto-based startup is comprised of ex-FPGA engineers, but they say this expertise has served them far better on the software side than with the hardware architecture. This architecture does appear to meet all the right conditions for low power consumption and big memory bandwidth. On the surface, this is because each of the processing elements has its own SRAM to cut down on data movement, which is aided by an interesting north/south communication feature called a “rotator cuff” that can direct feed from the SRAM for the coefficients. We’ll dive deeper into that in a moment.
The path to early wins is through removing a lot of excess energy from data movement in traditional architectures. It’s well known that far more energy goes toward getting data through a device than it takes to compute so their approach maximizes smart arrangement of SRAM and compute elements, eliminating long, narrow buses, putting the processing right into the SRAM array (without the analog techniques we see with Mythic, for instance). With a 16nm device, they’ve had to build in some redundancy but they can get massive parallelism with big memory bandwidth because each processing element has its own dedicated SRAM. All coefficients are stored on the chip, the entire network can be loaded onto the chip as well (with multi-chip partitioning if needed). The basics of the compute units are below:
The company says that when meshing four of these into a PCIe card they can get 2 PetaOps of performance with Int-8 at batch size 1. The PCIe device has a total of 800MB of on-chip SRAM with a PB of memory bandwidth overall fed by x16 (4×4) PCIe Gen4. They can also do some clever scalable voltage/frequency tricks, something probably learned value-wise from the engineering expertise of the FPGA-centric team.
The details of this architecture are worth noticing. Let’s break these down in more detail based on what’s below:
SRAM is organized in 511 memory banks distributed in rows throughout the chip and three different ways to move info in and out. There’s the pipeline bus that goes on a per row basis from the PCI complex to the different memory banks (which means those banks can DMA from the server DRAM to bring data into the top of the network and stream it out at the end with flexible start and end points). They can also put multiple graphs on a chip simultaneously (since they can run asynchronously) and let each do their own direct memory access to server memory to keep graphs fed and send data out.
And here’s where things get interesting. The third way is by using the east/west communication channel called the “rotator cuff” that lets Untether move activations between processing elements inside a memory bank in addition to between those bands with 16GB/s bank to bank transfer capabilities. On the north south route is something they devised called a direct row transfer where they can move complete rows of SRAM data between the rows within the bank and between the banks. All of this, as you can imagine, is a nifty way to carve down data movement.
This “rotator cuff” idea is worth delving into because it’s something that does indeed differentiate and possibly, pay off big in terms of power consumption. More specifically, the processing element that takes activation and coefficients come in directly from the SRAM. You’ll see that the SRAM itself is bifurcated to ensure the shortest distance to the processing, which is why you see the SRAM array above and below the Pes. The activations are fed by the rotator cuff, which allows for direct contact of three of the nearest neighbors to either side of the PE (all controlled in the F register). All this means they can move activations sequentially from one PE to another as they go through the operations, shortening the distance those activations have to go.
Another little feature of note is that the Pes have zero detect wiring so it one of the registers shows a zero value, the mult/accumulator doesn’t handle it, which Beachler says cuts power in half. “You’re still paying for the fetch but we don’t do the operation if there are just zeroes. The accumulator itself is 32-bit, which lets us do the 8×8 multiply followed by a quantization step to bring it back to 8-bit activations for the next calculation.
Benchmarks, being what they are, are subjective. Without saying anything at all, we’ll just leave here for you to fight about.
The software stack for this was discussed in less detail (in part for time limitations, not a comment on its stability) but Untether’s CEO says that they have made big investments in the programmatic side of this device, making key hires, including Alex Grbic (well-known at Altera before becoming CTO at Deloitte’s AI business in Canada).
Quantization and layer optimization will all be handled in existing frameworks, including TensorFlow and PyToch with the imAIgine Compiler and toolkits handling graph optimization, multi-chip partitioning, communication optimization and tools for debugging. They also have the imAIgine runtime with the requisite API, drives, health monitoring, and multi-card communication built in.
Even with all of this, elegant as it is and with seeming power consumption benefits, we have to wonder if they’re not too late to market—at least too late in 2020 standards. There is some AI inference startup fatigue to be felt in our conversations with potential early adopters and testers of such hardware and it has nothing to do with architecture or performance: it’s all about software usability.
With its launch today, Untether has one of the most detailed hardware architecture stories we’ve seen to date. Most of the time we’re presented with scant information coupled with dramatic benchmark results. This time was different and for that we’re grateful because we can start to pick out what’s working and isn’t. By all analysis we’ve been able to do, this is a promising architecture, but the world might not be ready for something new. If they are, we’ll be the first to follow what they do with these devices, why, and how.
It is also worth noting for those interested in a tale of remote semiconductor companies. The company brought this device to bear over Zoom and through other collaborative software platforms. If you’re ever curious about what works or doesn’t, they’ll likely be glad to share, but it sounds like doing this is not only possible, it removes some of the barriers of in-person development.