The idea of bringing compute and memory functions in computers closer together physically within the systems to accelerate the processing of data is not a new one.
Some two decades ago, vendors and researchers began to explore the idea of processing-in-memory (PIM), the concept of placing compute units like CPUs and GPUs closer together to help reduce to the latency and cost inherent in transferring data, and building prototypes with names like EXECUBE, IRAM, DIVA and FlexRAM. For HPC environments that relied on data-intensive applications, the idea made a lot of sense. Reduce the distance between where data was processed and where it was stored, and you can run more workloads more quickly and get back results faster.
However, there were challenges to NMP efforts in the 1990s that made it difficult for the concept to gain widespread commercial acceptance. In particular, manufacturing and implementing PIM technologies in systems at the time proved unpractical and costly, so while the idea was sound, there was little momentum behind it. That said, with organizations nowadays being buried under massive amounts of data and the growing demand to be able to collect, store and – most importantly – analyze the data to make faster, smarter business decisions, better serve customers, drive revenue and increase efficiencies, the idea of PIM has made a strong comeback in the areas of not only near-memory processing (NMP) but also in-memory processing (IMP). New designs in memory – such as high-bandwidth memory (HBM) and hybrid memory cube (HMC) – and the widening use of an array of processing units – not only CPUs and GPUs, but also field-programmable gate arrays (FPGAs) and customized ASICs – are addressing the issues of practicality and cost when developing and implementing PIM technologies.
“As such, processing in memory (PIM), a decades-old concept, has reignited interest among industry and academic communities, largely driven by the recent advances in technology (e.g., die stacking, emerging nonvolatile memory) and the ever-growing demand for large-scale data analytics,” Soroosh Khoram, Yue Zha, Jialiang Zhang and Jing Li, researchers in the Department of Electrical and Computer Engineering at the University of Wisconsin-Madison, wrote in a recent paper outlining the benefits and hurdles inherent in PIM, which they further define as comprising both NMP and IMP.
In a paper, titled “Challenges and Opportunities: From Near-memory Computing to In-memory Computing,” the researchers also outline some projects their group at the university undertook to better understand how modern FPGAs could be combined with emerging memory technologies to better run large-scale, memory-intensive workloads — including deep learning applications – in NMP environments. The development of new memory technologies by established memory vendors for NPM has been crucial to addressing the demand coming from organizations. That includes not only HBM and HMC, but also Bandwidth Engine (BE2). For example, HMC removes the previous limitations to NMP by stacking multiple DRAM dies atop a CMOS logic layer and using through-silicon-via (TSV) technology, the researchers noted. Doing so not only delivers better random access performance through higher memory-level parallelism when compared with DDR DRAM, “but also supports nearmemory operations, such as read-modify-write, locking, etc., on the base logic layer, making it possible for accelerating these operations near memory,” they wrote.
The team at the University of Wisconsin looked to see how FPGAs could be combined with HMC modules to address the needs of applications using the emerging memory technology and to develop collaborative hardware and software tools to more efficiently map algorithms. In one effort, the research group developed a near-memory graph processing system on such an FPGA-HMC platform that relied on specially designed and optimized software and hardware. Large-scale graphs are increasingly being used in such applications as machine learning and social sciences but can be difficult to process efficiently due to their large memory footprints and the fact that “most graph algorithms entail memory access patterns with poor locality and a low compute-to-memory access ratio.” By pulling together the flexibility of FPGAs and the high random access performance of HCM and offering innovations such as new data structure/algorithm and a platform-aware graph processing architecture, the researchers were able to develop a platform that outperformed others that used CPUs and FPGAs. Using the GRAPH500 benchmark on a random graph with a scale of 25 and edge factor of 16, the group’s platform hit 166 million edges traversed per second (MTEPS), they said.
In the other projects, they used a high-performance near-memory OpenCL-based FPGA for deep learning applications to discover that the primary performance bottleneck was on-chip memory bandwidth, thanks in large part to what they said are few memory resources in modern FPGAs and a memory duplication policy in the OpenCL execution model. The researchers developed a new kernel design that eliminated the limitations and improve memory utilization, which led to a balanced data flow between the compute units and both on- and off-chip memory. The design was done using an Altera Arria 10 BX1150 board and resulted in a 866 Gop/s floating point performance at 370MHz and 1.79 Top/s 16-bit fixed-point performance at 385MHz.
IMP is an outgrowth of NPM, with data stored in a RAM to speed up database performance by enabling the data to be accessed from the chip’s memory bus rather than from traditional storage units, making the process faster and more energy efficient. Fueling the use of IMP will be the rise of non-volatile memory (NVM) technologies – including spin torque transfer RAM (STT RAM), phase-change memory (PCM) and restive RAM (RRAM) – which have been developed by such established vendors Micron, Samsung, Intel, Toshiba, Sandisk and Sony. However, there have been challenges to getting the NVMs into current systems to replace current DRAM or Flash, the researchers wrote. It’s not easy to get the NVMs to mesh with main memory or persistent storage in such areas as cost, latency or retention, it’s costly for vendors to revamp their manufacturing facilities to make the NVMs and users tend to want to stay with DRAM and Flash rather than make the switch. Driving adoption of NVMs will mean creating new usage models, and IMP is a key one. “We believe that the emerging NVMs will become an enabling technology for IMP,” they wrote.
The new architectures include using the RRAM crossbar design to speed up matrix multiplication, which is used in such applications as machine learning and optimization, and neuromorphic systems, which aim to create computers that mimic the human brain. Another type of architecture is “associative processor (AP), also known as nonvolatile Content addressable memory (nv-CAM) or Ternary content addressable memory (nv-TCAM), which supports associative search to locate data records by content rather than address.” The researchers developed a reconfigurable in-memory architecture that is similar to FPGAs but also provides flexible on-chip storage and routing resources as well as enhanced hardware security. It can be used for pure data storage, pure compute or anything in between, blurring “the boundary between computation and storage. We believe it may open up rich research opportunities in driving new reconfigurable architecture, design tools, and developing new data-intensive applications, which were not generally considered to be suitable for FPGA-like accelerations.”
While adoption of NMP will be easier than IMP, both are facing challenges in areas such as virtual memory support and compatibility with modern programming models, which will call for greater collaboration among systems engineers, IC designers and others, they said.