Baking Specialization into Hardware Cools CPU Concerns

As Moore’s Law spirals downward, ultra-high bandwidth memory matched with custom accelerators for specialized workloads might be the only saving grace for the pace of innovation we are accustomed to.

With advancements on both the memory and ASIC sides driven by machine learning and other workloads pushing greater innovation, this could be great news for big datacenters with inefficient legions of machines dedicated to ordinary processing tasks—jobs that could far more efficient with more tailored approaches.

We have described this trend in the context of architectures built on stacked memory with FPGAs and other custom accelerators inside recently, and we expect to see more workloads targeted for specialized accelerator or processors. The problem with that approach is expense, or ROI. Designing and fabbing a custom ASIC is no cheap proposition, but by turning capabilities inside memory, stacked and otherwise, some areas are seeing a viable alternative to general purpose CPUs and accelerators.

Last week we described how hybrid memory cube (and for that matter, high bandwidth memory from AMD) are keys to new architectures in deep learning, but this same concept is being applied in other areas—applications that are in use in hyperscale and enterprise datacenters in volume already.

Although not HMC or HBM based yet, a research effort out of the University of Michigan is allowing memory to trump processing for large-scale text searches and this concept can stretch to meet HBM and HMC eventually as we expect a growing number of workloads will. With the reliability of price and performance jumps at regular intervals dropping off, this is just one of many new efforts from academia that show a different path—one of specialization of hardware via accelerators that take up little space and big advantage of bandwidth, which is the real bottleneck for a number of applications.

“What is different than existing tools for searching high velocity text is that memory has become so cheap that it is now possible to place huge datasets inside very high bandwidth memories,” one of the lead researchers from the University of Michigan, Dr. Thomas Wenisch tells The Next Platform. “Instead of designing search systems that assume everything resides on disk and you only have to keep up with that, we are arguing that since these fit in memory now cheaply, it makes sense to redesign search capabilities to keep up with this.” Of course, if this can be boosted for text search the same techniques can be applied for other purposes, including combing through server and application logs and speeding queries for search engines and other text-based data volumes.

Baking the search algorithms into the memory hardware versus software is far faster, Wenisch argues, pointing to his team’s hardware-based accelerator called HARE (hardware accelerator for regular expressions), which specializes in regular expressions—something that can already be found in UNIX but that is all software and requires various hops because it is based on the assumption that it is running data out of disk.

“The key innovation of HARE relative to previous work is that you can search for more than specific words, you can also search for patterns called regular expressions; little programs built in to describe the words you’re looking for.” Many search capabilities use the UNIX command line “grep” tool for doing so, but the software was designed to keep up only with disk. By encoding this inside of memory and taking advantage of fast speeds, the team says they can outperform the software-only tool by between 1.5X-20X based on their FPGA prototype accelerator.

“We see the impending end of Moore’s Law,” Wenisch says. “People are excited about ASIC accelerators as a potential avenue to keep getting innovation when there’s no more gas there. Right now it is expensive to do an ASIC; that has to come down or the whole electronics industry starts to fall apart so we need mechanisms to make things better. People are placing bets on accelerators like this to make things faster, but it seems clear if you can design hardware to store data structures on the chip and use efficiency and other tricks to fit bits that can’t be done in software to get between 1000X and beyond moving from software and CPU only to a specialized ASIC, then it seems like a clear argument.”

For those interested in high-speed text searches, this is noteworthy news, but the bigger picture, as Wenisch another developers of custom ASICs, accelerators, and baked-in intelligence inside memory is what is most engaging here. We are at the beginning of a trend—one that may not find its way into all areas, but is important to watch. HARE, just like other tech we have been following on the novel architectures front, is anticipating the demise of Moore’s Law, no matter how far on the horizon it seems to be.

This another example of work happening to circumvent the CPU and outshine the software innovations that sprung up to take advantage of the predictable performance bumps. From the many novel architectures being designed for deep learning to several efforts like this targeting specific applications, there is little doubt that for some areas, the CPU will be responsible only for the “housekeeping” duties. It will be a slave to the compute (albeit limited) inside of newer technologies, including high bandwidth memories with logic layers inside that could one day sport an FPGA to further fine-tune the performance and capabilities.

Of course, this is all from a research chip perspective; something emulated via an FPGA to show prospective results. Getting a custom bit of silicon to serve such a need would be a risky and expensive proposition. “The ROI argument boils down to yes, it is possible to do this in software in a distributed, parallel way, but that’s a lot of computers just to have this search application running if you’re Google, for instance. But if you could just do this in the corner of a chip versus running hundreds of servers, we are talking about a much more complicated ROI argument.”

The story is becoming less about whether it is practical and possible to start seeing a decline of the CPU as the hinge upon which innovation swings and more about how willing both makers and users of custom silicon will be to invest in and adopt new accelerators. But with the advances coming in high bandwidth memories, which can handle some compute, and a shift in the tides from software being the point of acceleration back to hardware, we can see how this could be a very interesting couple of years ahead indeed.

OranjeeGeneral says:

September 22, 2016 at 4:46 am

There is a just a small flaw in the idea. As it relies on ASIC hardware costs coming down, but they will not actually currently the trend is reverse they are going up. Simply as each smaller new shrink node is becoming more and more complex and expansive.

So Foundries either need customers that manufacture huge (and by huge I really mean huge > 10mil units per year) or they have to charge massive to get their ROI back.

This trend will keep on going until we find a good replacement for silicon.

Baking Specialization into Hardware Cools CPU Concerns

Sign up to our Newsletter

1 Comment

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Riding CXL Memory Up In A Down Economy

Just How Bad Is CXL Memory Latency?

The Expanding CXL Memory Hierarchy Is Inevitable – And Good Enough

1 Comment

Leave a Reply Cancel reply