“Ingest it all, keep everything, find that needle in the haystack,” they exclaimed.
“Storage is cheap, analytics are powerful, there’s nothing you can’t do if you just keep all that data. Someday you might use it because business insights!”
Remember that big data party? The one that spawned its own industry just a decade ago? Well, it’s over and now there is quite a mess to clean up.
This is not to say that organizations of all sizes are no longer keeping everything or trying to get more sophisticated with analytics. The smartest and best-funded can feed at least some of it into AI/ML training algorithms these days. But most are just struggling with storehouses of data, laden as that data is with the fading promise of golden info-nuggets if one just looks carefully enough.
This is not necessarily a problem of retention. Many industries require it. It is, however, problem of intention.
This creates a new set of problems and an emerging opportunity for startups to help intelligently pick through the heap and automatically decide what gets trashed and what still holds value. What is interesting about this is that the same technology wave that stopped big data analytics in its tracks is the one that might be best suited to clean house. That is AI/ML, but there are some technical catches.
For instance, how does an algorithm know what to pick? Who sets those parameters? And with data that can be broadly similar in format but not content, how can training be guided without vast over-fitting? And oh, the insane costs of training on data that is, by definition, too large to deal with in the first place. Such software/services would have to provide this on a specific customer-by-customer basis and there’s no way to scale such a startup aimed at helping solve this problem in a way more meaningful than just shoving it more elegantly into larger boxes.
And here’s another contemporary point to consider. For large enterprises, what’s to be done when all that data is no longer valid? As in, take all that historical data we have about user/customer behavior up until 2020 and throw it out. Pitch it. Because for some organizations — particularly those driven by patterns in consumption of everything from travel to food to clothing — that old data no longer fits the present. For many businesses that could forecast with clockwork certainty, that clarity is no longer possible. What good is all the data in the world if patterns no longer exist? They can toss it. They can cheaply store it all indefinitely. They can keep trying to mine it for old times’ sake. But 2020 rendered exabyte upon exabyte useless.
For those who are still recovering from the big data party and still believe in those gold nuggets in their data even after the Great Data Shakeup of 2020, the path forward is just as complex as for those who need to start from scratch.
The easiest path is compression. It isn’t new, there are even some AI-driven techniques that update it. But are new compression techniques enough in cases of vast data, and is compression too blunt a tool for the job?
These are questions the U.S. Department of Energy is exploring. And here’s why that agency matters most in this conversation: it controls purveyors, collectors, warehouses, and weigh points for many exabytes across many sites. And it’s not just about volume, it’s about variety too (simulation data, instrument collections, etc.). So how is it going to garbage sort? And how much of that is on tape? On disk? In-situ? It pains the brain.
The irony is, after years of supporting efforts around big data collection, storage, and analysis, the DoE has a new expensive problem to tackle: trimming down and selectively ditching all that data.
The agency announced this week close to $14 million to support research to address these problems under the banner of “data reduction”.
The nine funded approaches under this investment range from workflow-specific workarounds to limit data in the moment, to system co-design concepts, to summarization and good old compression. But a survey of those nine selected projects puts the brunt of this task on compression — or projects with wordy titles that don’t mention compression but when you read further, they’re about compression, just in different layers of the stack.
Oak Ridge National Labs’ award is around compression methods for streaming data; Lawrence Livermore’s work to expand a framework called “ComPRESS” — a compression and retrieval mechanism for exascale simulations and sensors; work funded at Texas State is on the automatic generation of algorithms for fast lossy compression.
All of this begs the question: is “data reduction” the replacement term for compression? Mostly.
Plenty of storage companies push it this way too with so-called compress and cram techniques while other approaches simply limit the number of calculations within the algorithm from the get-go, which is not a broad technique but one left to each application owner. This is an oversimplification for the sake of brevity, but the point is data reduction is difficult to generalize as a service beyond compression alone.
And how can data reduction be a new frontier when compression and the problem of overwhelming data isn’t fresh?
Some of the projects funded point to new directions. For instance, researchers at Fermi National Accelerator Laboratory will develop techniques for encoding advanced compression and filtering, including those based on machine learning methods, as custom hardware accelerators for use in a wide array of experimental settings — from particle physics experiments to electron microscopes.
Blending co-design and AI/ML might be the next opportunity for data reduction. Some of these require site-specific, hands-on work, but we might see a new generation of startups focused on hardware (compute and storage) that does this on the fly and AI techniques for deciding what to keep or throw away. Watching what the DoE does on this front is critical as we will get a bird’s eye view of what works at scale and what will take the most manual intervention — something that is nearly impossible at exascale.
“Scientific user facilities across the nation, including the DOE Office of Science, are producing data that could lead to exciting and important scientific discoveries, but the size of that data is creating new challenges,” said Barb Helland, Associate Director for Advanced Scientific Computing Research, DOE Office of Science. “Those discoveries can only be uncovered if the data is made manageable, and the techniques employed to do that are trusted by the scientists.”