Arrow Hits the Mark for Petabyte-Class Analytics Problems

When we first talked to Voltron Data following their launch in early 2022, we had to take care to explain why Apache Arrow was worth paying attention to and why it might warrant the level of enterprise support the startup promised.

Even more explanation was needed to justify the $110 million in funding to back an open source-driven company, especially around Apache Arrow, which lacked the status of Spark and similar platforms. It turns out that Voltron Data was on the front end of a new wave of interest in Arrow, which provides a standard, high performance and language-agnostic in-memory data format for passing info between languages.

The real value, beyond that, was that it let users seamlessly share and chew on data across diverse analytics tools without hits to performance. This was being proven out at petabyte scale among hedge funds, and Meta, as well as among vendors Snowflake and Datastax, according to the startup’s CEO, Josh Patterson, who spoke with The Next Platform this week following some news at HPE Discover.

Since that timely launch, the company has picked up a number of customers in financial services and government, Patterson says. Voltron Data’s Arrow-driven “Theseus” engine will be available across HPE systems as part of the HPE Ezmeral Unified Analytics Software platform and on HPE GreenLake.

If Apache Arrow flew farther than anyone expected since emerging in 2016, it’s because of Patterson himself. He led the team at NVIDIA that created the RAPIDS product–a suite of GPU native software libraries and APIs to speed data science and machine learning tasks. Apache Arrow served as a cross-language development platform for in-memory data, which made data interchange between Rapids and other data processing frameworks possible while letting the GPU speed data preprocessing, modeling, and analysis.

The next iteration for the Arrow-driven work is in Theseus, which will be soon be part of HPE Ezmeral. It takes many of the same lessons from RAPIDS, serving as a distributed execution engine for data processing at large scale that goes beyond the capabilities of CPU-based analytics systems like Apache Spark, allowing what Patterson describes as bringing “GPU native compute engine directly to their data”. It has hooks into common data platforms including Apache Arrow, Ibis, RAPIDS, Substrait, Velox, and others and can leverage all common hardware platforms.

With Theseus, Patterson says, “Arrow is at the core of everything we do. It lets us break down silos so if you look at someone like Vast Data, their new data compute engine/computational storage layer is built on Apache Parquet and Arrow with Arrow Flight for serving data. Snowflake needed a lighter weight version of Arrow and has become a popular connector for several frameworks and applications across languages,” Patterson says. He adds that pushing enterprise support for Arrow, which is the root of Data Voltron, has helped build trust in the platform and now with Theseus, they can connect many data sources and languages to accelerate different data pipelines.

The company’s co-founder, Rodrigo Aramburu, adds that many of their customers have built a lot of their own data systems in house from scratch and the lack of Theseus, which doesn’t have its own user interface (they use open source standard interfaces), doesn’t have a storage layer, and uses all open sources distributed file systems and file formats means they can just show up with a query execution engine and go.

Patterson says that the appeal of Theseus will be keenly felt in areas like financial services, cybersecurity, media and entertainment, recommendations, among telcos, and in defense, among other areas. “Financial services has always been on the forefront of building state of the art models, retraining rapidly, and deploying these models to do massive amounts of back testing. It’s that constant re-evaluating of massive amounts of data with a short time to analyze it. That’s a wall, and this wall of performance is where CPU systems can’t keep up.”

He adds that the entirety of the Fortune 500 will hit this wall in the next five years.

“We’re working with those massive data footprint customers and helping them integrate a lot of their data silos together to solve some of these software problems at petabyte scale and enterprise scale. From Greenlake to Ezmeral, it’s possible to consolidate a bunch of data silos through the HP data fabric and make it simple to deploy to a hybrid cloud platform. This can bring accelerated GPU-native compute directly where the data lives,” Patterson tells us.

“We believe data has gravity and the more data that people have and start to use, the harder it gets to move around. Because of that, we didn’t want to build a database or a data lakehouse, we wanted to focus on building a high performance accelerated native query engine and bring that to the data,” Patterson concludes.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. Exactly! Moving those portly gravitational data silos around, through multiple load-bearing walls, in a minotaur-infested labyrinth, and over to centralized computational engines — just to query them — is a Herculean task typical of brawn-over-brains approaches to data processing. I’m with Theseus on this, following Ariadne’s French rocket threads through Apache’s flying Arrow of GPU-accelerated time, and sending those queries themselves direct into the minotaur’s data pipeline, for massive back testing (or somesuch!)! 8^p

    If it’s good enough for the Fortune 500, it’s good enough for me!

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.