High performance computing workloads simulating all manner of things can produce a veritable mountain of data that has to be sifted through. In fact, that is what makes HPC the opposite of AI: You take a small amount of data and explode it into a massive simulation. AI takes a massive amount of information and boils it down to some parameters and weights embodied in a neural network.
However, of the trillions of particles generated in an HPC simulation, researchers may be interested in the behavior of a just few hundred or thousand.
“You are not necessarily looking for a needle in a haystack, but you’re looking for something and it is usually a small subset of the data,” Gary Grider, deputy division leader at Los Alamos National Laboratory, tells The Next Platform. And while this might not be a big problem on smaller datasets, it can be particularly challenging at the scales LANL is accustomed to. “We might run a job that might be a petabyte of DRAM and it might write out a petabyte every few minutes,” Grider emphasized. And do that for six months.
To sift through this data, scientists employ a suite of analytical tools to help pinpoint the specific data they are looking for, which itself can take a while. So, for the last few years, Los Alamos has been investing ways to speed this up by moving the analytics workload close to the data – specifically onto the controllers of flash or disk drives. In a sense, what Los Alamos and its partners are trying to do is create a large cluster of disk controllers, and use their spare clock cycles to offload analytic functions onto them.
Los Alamos researchers have already had some success in this regard. Working with SK Hynix, they were able to prove the concept by shifting the reduction function onto the drive’s controller, achieving multiple orders of magnitude improvement in performance in the process.
“We’ve shown that when we can actually do analytics – simple analytics like reductions – at the full rate that the disk drive can pull the data off the disk, and what that means is there’s no cost to it from a bandwidth point of view,” Grider says.
Los Alamos, like many other Department of Energy HPC labs, employs a tiered storage architecture, and so the lab began investigating ways to achieve similar results on their larger disk pool. And to do that, Los Alamos has entered into a collaborative research and development agreement with Seagate.
“It turns out Seagate had already been working on some offloads to devices,“ Grider explains. “They have this prototype device that has a processor that is right next to the disk drive.”
The Resiliency Problem
People sometimes have the perception that flash drives are more reliable than disk drives, but this is not true. It all depends on how you use either device, and if you overwrite either flash or disk beyond their rated frequency, they will fail earlier than you might otherwise expect. Data protection, therefore, is more about the longevity of the data on the device than it is about wearing it out.
“If you keep data on an storage devices for a long time, you should protect that data with something like RAID protection or erasure coding. That complicates how you have to do the analytics,” says Grider. “To do analytics, you have to understand what the data is. Normally, disk drives don’t have to know that, they just see blocks of data.”
This means that unless Grider and his team wanted to build a file system for storage accelerated analytics from scratch – something that would be no small task – they were going to have to get creative.
“We don’t want a LANL only solution here because we need to buy the stuff and have it be supported,” he emphasizes.
Instead, LANL opted to modify an existing file system – the Zettabyte File System created by the former Sun Microsystems so long ago specifically for large, resilient pools of disk drives we all lovingly call spinning rust these days. ZFS can also be scaled to multiple nodes using Gluster, which is a clustered file system that was acquired by Red Hat a long time ago. Meanwhile for analytics, engineers worked to adapt the file system for use with Apache’s analytics stack.
“We tried to stick with standard tools anyone can use,” Grider says.
But while Los Alamos and its partners were able to cobble together a working file system capable of on-drive analytics processing, it’s not exactly something that’s ready for just anyone to deploy.
Instead, Grider hopes that by using well-established file systems and analytics tools, they stand a better chance of convincing standards bodies and software developers to add the necessary features to bring this functionality to a more mainstream audience.
A Long Road Ahead
In that regard, Grider notes there is a lot of work to be done. “It’s going to be a fairly long road to get to the point where this is consumable,” he says. “The next thing that we’ll be doing is to turn this into some sort of object model instead of blocks under files.”
For now, Los Alamos is content with offloading a handful of analytics functions to the disk controllers.
“We’re not moving the entire analytics workload to disk drives – largely it’s just a reduction, and some joining and things like that.” Grider says. “But from a simplified point of view, reductions happen at the device level, and then joins and sorts and things like that typically happen somewhere in flash or memory.”
The limiting factor, however, isn’t processing power so much as the tiny amount of memory built into each disk. “It doesn’t have enough memory to do a sort, it can only do a select,” Grider said of the individual drives. “The question is more how do you put enough memory on there to do something more than fairly simple things.”
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Could be a great component of the DOE’s hub-and-spoke HPDF initiative (with POWER10 and H100)!