The ever-growing open source data stack ushered in yet another piece this morning with Apache Arrow, a new top-level project providing a columnar in-memory layer that focuses on higher performance data processing. The leads behind the project made the bold prediction that over the next few years, “the majority of the world’s data will be processed through Arrow” but sit tight developers, this is not necessarily for you to implement directly.
As it stands, Arrow is being cobbled together from the foundation point of Apache Drill, which the Arrow heads helped get off the ground and integrated into the broader Hadoop, SQL, and NoSQL stack. Key aspects of its integration are already coming from lead developers behind other notable projects, including Cassandra, Hadoop, HBase, Parquet, Spark, Storm and newer efforts, including Pandas, Ibis, and Kudu. Which could also be rattled off as a very chaotic day at the zoo. And maybe that’s not so far off…
That above list of participants makes one wonder how long the parade of new additions will continue and when that highly sought-after “big data” stack will get too top-heavy. The image of a mighty elephant, covered in so many pieces of contrasting fabric so that it is blindly bounding forward as a mass of screaming colors and undiscernible shape is a strong one these days. But according to Jacques Nadeau, head of the Apache Arrow effort, those tools were setting the stage and now, projects like Arrow can come along, play well with the other critical functional components in that emerging stack, and start focusing on performance and efficiency of compute—something that has played a distant second outside of projects like Cloudera’s Impala as a most prescient example.
Nadeau has a long history in open source data projects, including some of those mentioned above. As one of the founding members of the Hadoop query engine, Apache Drill and data management framework, Apache Calcite, as well as over two years leading a team of architects at Hadoop distribution vendor, MapR, he has been in the trenches understanding problems at scale—and where the performance bottlenecks for big data workloads tend to lie. He tells The Next Platform, that it is time to turn attention to performance and the role of data in relation to the hardware—and at that turn of the conversation, our ears perked up. It is uncommon in data stack conversations to talk about the role of systems and performance, but beyond the same old in-memory processing approaches we’ve seen for years now, what could be added to keep pace with the CPU capabilities on even the most vanilla big data setups?
In many workloads, 70-80% of CPU cycles are spent serializing and deserializing data. Arrow solves this problem by enabling data to be shared between systems and processes with no serialization, deserialization, or memory copies.”
This brings us back to the statement that most of the data in the world will pass through Arrow. At its core, and at a high level, Arrow is focused on a couple of elements. First and foremost, it’s task is lining up the data so CPUs can consume it efficiently and take advantage of CPU features, most notably cache locality, pipelining, superscalar operation, and super-word operation. Storage too has to line up and be efficient. By providing an in-memory representation that can be written to, this means that ultimately, there do not need to be multiple integrations for different elements of the stack to work together. So Parquet can work better with Drill and Impala, for example, as a single layer versus multiple integration efforts. This is a time saver, allows all the data to run through Arrow as that single layer, and thus backs up the idea that a lot of data will float through it—just as a foundational part of many other projects and software elements (namely, the zoo of names mentioned above).
According to Nadeau, “Cache locality, pipelining and superword operations frequently provide 10-100x faster execution performance. Since many analytical workloads are CPU bound, these benefits translate into performance gains, or more plainly, the potential for faster answers and higher levels of user concurrency.”
Columnar in-memory is efficient CPU-wise but Nadeau says the reason people don’t immediately do it is because it’s quite complicated. “There are three levels of complexity; in-memory, in-memory columnar, then in-memory columnar complex data (hierarchical nested data, multi-schema data, etc). So each level of work is more complicated and more difficult to do. Arrow does all three of those levels as a shared approach (versus each project implementing their own).”
As mentioned earlier, the data stack is evolving to focus on what were “luxury” problems five years ago—performance and efficiency. “There are communities that want to get the next level of data processing performance. For example, the Impala research community cited in a paper that to get to that next stage, they needed to move to a canonical in-memory, columnar representation of the data. And if you look at the discussions that Spark developers have had around things like the Tungsten project, it was all about can you improve the in-memory representation to take advantage of cache locality and make sure CPUs are used as efficiently as possible. If you look at Drill and the concept of value vectors, they were trying to improve in-memory columnar processing through specialized columnar representations. The next generation meant moving to an in-memory columnar representation,” Nadeau notes.
It should be clear by now as well that this isn’t necessarily something set to be consumed by data developers, but could nonetheless become a critical part of the open source big data stack as we know it. There is no lockdown, so someone working on Drill, for instance, can layer their own tailoring on top of the custom pile—and the same goes with other projects. Nadeau agrees that there is a fair bit of complexity left, but hopes the community developing around this will work to make development more of a breeze.
In addition to developing the future of Arrow, Jacques heads up a company called Dremio, which is still in stealth. Although a natural assumption would be that they will emerge offering support for Arrow, this is not the case since the tooling is not designed for end user development. Rather, Nadeau expects that vendors will adopt it and integrate it into existing frameworks. As a hypothetical (for now) instance, for Hadoop users that are using Cloudera’s Impala, it might be possible to use Arrow-enabled Impala via Cloudera.