Layering on Machine Learning to Speed Data Transformation
February 15, 2016 Nicole Hemsoth
There are few more widely recognized names in modern database research than Dr. Joseph Hellerstein. The Berkeley professor and Trifacta co-founder has spawned new approaches to relatively old problems on the programmatic and database design and implementation fronts.
Well before the tech world was awash in tales of “big data” woes, Hellerstein and teams were looking ahead at the future problems of data manipulation, transformation, and visualization, which culminated in the Wrangler project, which matched data manipulation and visualization tools with several new layers of automation and flexibility. At the time, around 2011, these allowed additional capabilities in terms of what databases could do—and just as important, the focus on performance made sure it could handle it all faster and more efficiently.
For anyone that has followed news about the open source Wrangler project Hellerstein and collaborators from Stanford development and how that fed into startup, Trifacta, it will be clear that the work had value outside of research contexts. And chances are, if you’ve been following Trifacta beyond research, it’s likely because the company has scored an incredible amount of funding since its launch ($76 million, including last week’s most recent influx of $35 million) and has notable use cases across a large swath of the Fortune 500 with companies like Time Warner, Intel, Thomson Reuters, Dow, Capital One, and many others climbing on board with their approach to data transformation, preparation, and exploration. What is notable here is that in an ecosystem that is so crowded with analytics, visualization, and data staging vendors, Trifacta has managed not only to stand out—but to stand alone. And in a relatively short amount of time to boot.
With a start in 2012, the company relied on strategic partnerships with all three of the leading Hadoop distribution vendors, as well as a number of other data source providers and SaaS providers. The research basis, as noted previously, was on considering the different ways people were interacting with data—and that almost 80% of that time was spent simply getting data in shape before it could be analyzed. As Hellerstein tells The Next Platform, “as we talked to people in the field, a large percentage of their day was being spent manipulating data so running an analysis, whether it was clustering or machine learning or something simpler, was often prefaced by many hours of transforming the data to get it into shape so it could be plugged into these algorithms. So even then before we started, we were finding the thing that presented the most interesting research problem was also the problem that represented the lion’s share of the workload.”
At the time Trifacta got its start, the main way to tackle the problem of transforming and preparing data was to write programs. By nature, technical and laborious, it had the added weight of lacking in intuitiveness. “This approach doesn’t provide the right intuitive feedback while you’re doing it—you’re not seeing what’s happening to the data; you’re manipulating programming statements that don’t align to the real structure and content of it.” The other way is via a graphical programming approach where icons can be connected on a canvas to build a data flow graph. This moves things up to a slightly high level, which is valuable, but it is still abstract and disconnected with the data and problems one is trying to solve as well, Hellerstein says. “With these approaches, you’re not seeing actual data, but a description of what you’ll do with the data. It’s all programming in the abstract. That was the state of the art when we started working on these problems.”
Even with the emerging approaches, existing practices, including spreadsheets, offered some early inspiration. With spreadsheets, the data is right there all the time and can be directly manipulated. This is good at small scale, but moving it over to a larger dataset becomes yet another challenge as it doesn’t translate. Hellerstein and team decided to take a best of all world approach and blend that direct manipulation benefit with more specific features and interfacing options that were cropping up in data flow graphs and other areas. The ultimate result of this work was something Trifacta calls “predictive interaction” which blends these capabilities and moves along with the user, learning from the process under the covers.
The concept is no so unlike Google’s search function, which predicts what you might be looking for in the text entry window. “When you enter anything on screen, whether you’re looking at a table, a bar chart, we’re translating that to cards at the bottom of the screen that contain a visualization of what the outcome might be, along with a rank order list of the operations that might be needed so users can refine them.” While Hellerstein didn’t give much of a look at the machine learning algorithms that underlie this, it is worth mentioning that the problem being targeted—the very lengthy time required for data transformation—can be significantly reduced in this manner, just as Google searches are (and would be noticeably so if done in great volume throughout the day).
These first functions, which have since been honed in Trifacta’s offerings, as users have been able to highlight features in their data, get suggestions, and move forward faster. There is no automatic algorithm to magically transform and clean data, but this approach can be considered as “having the algorithms participate” in the process by speeding it along in a visually interactive mode.
In practice, one might receive a massive file of data that looks, in the raw, like a bunch of garbled text. The manual challenge would have been to find out where this fits into rows and columns—what the structure of that data might be. However, with the predictive weight of a pattern recognition algorithm offered as a suggestion, the system can layer on the patterns, refining that data into where the natural rows and columns are, then how further patterns in even one row or column should fit together. It is difficult to express how much time this might save, particularly at the beginning stages of being handed a raw file of nonsense and making it all fit together.
It is not always correct the first time, Hellerstein says, pointing again to the algorithm as a “participant” versus automatic tool for data transformation, but there are many suggestions and algorithmic approaches that can leverage various machine learning approaches to help users get to where they need to be. With the human and computer interacting more closely, in essence, and with the human having ultimate control and understanding of the patterns being found and their relationship to the problem, that time-consuming task of data transformation can be cut down dramatically.
The ease of use is another feature that Trifacta is trumpeting, especially as its user base has grown. Putting that kind of analytical power into an ever-growing set of non-specialists and programmers means the ability to discover connections is more democratized. And after all, isn’t data democratization part of what this whole “big data” craze has been about?
For a company like Lockheed Martin, which has been lending its computer science skills to the Centers for Medicare and Medicaid Services to combat fraud, sifting through claims data meant analyst teams needed to standardize on a platform and then work through several disparate data types. The transformation process, according to Trifacta, took six weeks initially—a time they were able to reduce down to a day. Storage and technology vendor, EMC, was stuck building complex scripts to prepare data based on performance, maintenance and other data from their products at customer sites for analysis. Teams there were burdened by this challenge, both in terms of people and time. Pepsi Co, LinkedIn, RBS, and a number of others, who had either been doing transformation using custom scripts or other manual approaches, have moved to Trifacta and apparently, those success stories are working. The company is adding more people around the world, and armed with the additional funding, is set to be one of the great data-driven success stories of the year—if it wasn’t already for 2015.