Layering on Machine Learning to Speed Data Transformation

There are few more widely recognized names in modern database research than Dr. Joseph Hellerstein. The Berkeley professor and Trifacta co-founder has spawned new approaches to relatively old problems on the programmatic and database design and implementation fronts.

Well before the tech world was awash in tales of “big data” woes, Hellerstein and teams were looking ahead at the future problems of data manipulation, transformation, and visualization, which culminated in the Wrangler project, which matched data manipulation and visualization tools with several new layers of automation and flexibility. At the time, around 2011, these allowed additional capabilities in terms of what databases could do—and just as important, the focus on performance made sure it could handle it all faster and more efficiently.

For anyone that has followed news about the open source Wrangler project Hellerstein and collaborators from Stanford development and how that fed into startup, Trifacta, it will be clear that the work had value outside of research contexts. And chances are, if you’ve been following Trifacta beyond research, it’s likely because the company has scored an incredible amount of funding since its launch ($76 million, including last week’s most recent influx of $35 million) and has notable use cases across a large swath of the Fortune 500 with companies like Time Warner, Intel, Thomson Reuters, Dow, Capital One, and many others climbing on board with their approach to data transformation, preparation, and exploration. What is notable here is that in an ecosystem that is so crowded with analytics, visualization, and data staging vendors, Trifacta has managed not only to stand out—but to stand alone. And in a relatively short amount of time to boot.

With a start in 2012, the company relied on strategic partnerships with all three of the leading Hadoop distribution vendors, as well as a number of other data source providers and SaaS providers. The research basis, as noted previously, was on considering the different ways people were interacting with data—and that almost 80% of that time was spent simply getting data in shape before it could be analyzed. As Hellerstein tells The Next Platform, “as we talked to people in the field, a large percentage of their day was being spent manipulating data so running an analysis, whether it was clustering or machine learning or something simpler, was often prefaced by many hours of transforming the data to get it into shape so it could be plugged into these algorithms. So even then before we started, we were finding the thing that presented the most interesting research problem was also the problem that represented the lion’s share of the workload.”

At the time Trifacta got its start, the main way to tackle the problem of transforming and preparing data was to write programs. By nature, technical and laborious, it had the added weight of lacking in intuitiveness. “This approach doesn’t provide the right intuitive feedback while you’re doing it—you’re not seeing what’s happening to the data; you’re manipulating programming statements that don’t align to the real structure and content of it.” The other way is via a graphical programming approach where icons can be connected on a canvas to build a data flow graph. This moves things up to a slightly high level, which is valuable, but it is still abstract and disconnected with the data and problems one is trying to solve as well, Hellerstein says. “With these approaches, you’re not seeing actual data, but a description of what you’ll do with the data. It’s all programming in the abstract. That was the state of the art when we started working on these problems.”

Even with the emerging approaches, existing practices, including spreadsheets, offered some early inspiration. With spreadsheets, the data is right there all the time and can be directly manipulated. This is good at small scale, but moving it over to a larger dataset becomes yet another challenge as it doesn’t translate. Hellerstein and team decided to take a best of all world approach and blend that direct manipulation benefit with more specific features and interfacing options that were cropping up in data flow graphs and other areas. The ultimate result of this work was something Trifacta calls “predictive interaction” which blends these capabilities and moves along with the user, learning from the process under the covers.

The concept is no so unlike Google’s search function, which predicts what you might be looking for in the text entry window. “When you enter anything on screen, whether you’re looking at a table, a bar chart, we’re translating that to cards at the bottom of the screen that contain a visualization of what the outcome might be, along with a rank order list of the operations that might be needed so users can refine them.” While Hellerstein didn’t give much of a look at the machine learning algorithms that underlie this, it is worth mentioning that the problem being targeted—the very lengthy time required for data transformation—can be significantly reduced in this manner, just as Google searches are (and would be noticeably so if done in great volume throughout the day).

These first functions, which have since been honed in Trifacta’s offerings, as users have been able to highlight features in their data, get suggestions, and move forward faster. There is no automatic algorithm to magically transform and clean data, but this approach can be considered as “having the algorithms participate” in the process by speeding it along in a visually interactive mode.

In practice, one might receive a massive file of data that looks, in the raw, like a bunch of garbled text. The manual challenge would have been to find out where this fits into rows and columns—what the structure of that data might be. However, with the predictive weight of a pattern recognition algorithm offered as a suggestion, the system can layer on the patterns, refining that data into where the natural rows and columns are, then how further patterns in even one row or column should fit together. It is difficult to express how much time this might save, particularly at the beginning stages of being handed a raw file of nonsense and making it all fit together.

It is not always correct the first time, Hellerstein says, pointing again to the algorithm as a “participant” versus automatic tool for data transformation, but there are many suggestions and algorithmic approaches that can leverage various machine learning approaches to help users get to where they need to be. With the human and computer interacting more closely, in essence, and with the human having ultimate control and understanding of the patterns being found and their relationship to the problem, that time-consuming task of data transformation can be cut down dramatically.

The ease of use is another feature that Trifacta is trumpeting, especially as its user base has grown. Putting that kind of analytical power into an ever-growing set of non-specialists and programmers means the ability to discover connections is more democratized. And after all, isn’t data democratization part of what this whole “big data” craze has been about?

For a company like Lockheed Martin, which has been lending its computer science skills to the Centers for Medicare and Medicaid Services to combat fraud, sifting through claims data meant analyst teams needed to standardize on a platform and then work through several disparate data types. The transformation process, according to Trifacta, took six weeks initially—a time they were able to reduce down to a day. Storage and technology vendor, EMC, was stuck building complex scripts to prepare data based on performance, maintenance and other data from their products at customer sites for analysis. Teams there were burdened by this challenge, both in terms of people and time. Pepsi Co, LinkedIn, RBS, and a number of others, who had either been doing transformation using custom scripts or other manual approaches, have moved to Trifacta and apparently, those success stories are working. The company is adding more people around the world, and armed with the additional funding, is set to be one of the great data-driven success stories of the year—if it wasn’t already for 2015.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.

Subscribe now


  1. They need to understand that not all the businesses in this world are HADOOP/SPARK ready. I am just wondering why on earth they do not sell a PAID desktop edition able to prepare, clean & manipulate data stored in old EDW like Oracle, MS-SQL, Teradata etc.

  2. Hello All,
    Its really relevant post and You have shared great information regarding machine learning. This things solves the problem of data manipulation,transformation and visualization . By using the new concept of artificial intelligence more complexity has gone solved and there is lots of benefits of it. I am much interested in this techniques and got more information about big data and all. I follow a sources where The host lots of conferences on new technology and more related techniques . To see conferences You can visit here

    problems of data manipulation, transformation, and visualization, which culminated in the Wrangler project, which matched data manipulation and visualization tools with several new layers of automation and flexibility.

Leave a Reply

Your email address will not be published.