Graph Processing Inches Beyond Pregel
October 27, 2015 Nicole Hemsoth
Although traditional relational databases will not be disappearing anytime soon, for a growing set of data-intensive problems, graph-based approaches are still finding a fit, even well after the boom from web-scale workloads forced companies like Google, Facebook, and others that direction.
This is partly due to the increased functionality, scalability, and programmatic ease of graph processing approaches, but also because if it wasn’t clear before, the days of the traditional relational database are numbered—at least for companies chewing on more than run of the mill back office workloads.
What became apparent in the wake of these graph processing use cases down the scale chain is that for vast amounts of data, pushing a graph query along traditional relational lines is cumbersome and slow. This is certainly the case at the billions of vertices and trillions of edges point, but it is also true far down the chain as well. Accordingly, the goal of several newer approaches that have sprung up over the last decade, including the Google-developed Pregel framework, is to pick up the performance pieces and provide a faster way to query large data volumes without the overhead of relational database queries.
Google designed Pregel as a graph processing engine that would be able to deliver programmability and efficiency for graph problems at scale and has been used in production at the search giant since at least 2009 to power services that require deep relationship discovery. As one might imagine, the PageRank algorithm is the best way to gauge performance, and as a host of new Pregel-like approaches, including GraphLab and Giraph (among others) have emerged, new waves of PageRank benchmarks keep coming. While Pregel and its spin-offs have found their sweet spots in a few select niches (Pregel use cases at scale in enterprise appear to be limited), there is still work to be done. According to a group of researchers from Argonne National Laboratory and Hortonworks, all of these have suffered from a few key limitations, including being limited to working only with data in memory and the inability to accept new data as soon as loading completes. Furthermore, “the master node coordinates both synchronization barriers and checkpointing for fault tolerance, which creates a significant bottleneck.”
To get around these issues, the Argonne and Hortonworks team developed another Pregel-like approach to graph processing called Graph/Z. At its core is ZHT, which is a zero-hop distributed key value store that emphasizes scalability and fault tolerance. For these purposes, ZHT is foundational in its support of random access, making changes at vertices and edges, and as the key value entry source to wrangle communications between nodes. It is also the key for fault tolerance since the entire list of messages, active vertex list and other data needed is stored in the ZHT server for both result storage and for checkpointing. “If one node fails, we can simply read these data from ZHT and restart this superstep without having to restart over—this is useful for big data sets” and is an important difference between GraphZ and how other approaches, in particular GraphLab and Giraph handle fault tolerance and checkpointing.
As the Argonne and Hortonworks researchers describe, “Graph/Z follows Pregel’s computing paradigm, to ‘think like a vertex’ wherein graph computation jobs are divided in terms of what each vertex needs to compute; edges are communication paths for transmitting results from one vertex to another and the computation is split into ‘supersteps’ where at each one, a vertex executes a computation task, sends or receives messages to its neighbors with super-steps ending with synchronization barriers.”
At the end of the day, however, performance is one of the main targets—something the Argonne and Hortonworks team was able to demonstrate in their first-run testing of Graph/Z. They found their framework ran best on eight nodes (based on AWS virtual machine runs—more details in the full paper), but that performance dropped off dramatically at 16 nodes. This is, as they explain, because the average workload on each node is too small and relatively more cross-node communication is involved at scale. To help put these performance numbers in some context, take a look at some benchmarks run from the PregelIX folks as the comparisons highlight similar benchmarks from other Pregel-like approaches that use the bulk synchronous parallel computing model, including their own, plus GraphLab, GraphX, Giraph, and Hama.
While the project is an ongoing research effort to improve the scalability, performance, and reliability of large-scale graphs, it can be added to the still-expanding list of companies and open source efforts that are seeking to break into the evolving graph processing market. Insurance companies and banks have been long-time adopters of graphs for fraud detection (not to mention government uses to determine complex, dispersed relationships) and of course, there is still a requirement for large web companies to provide the same mapping for a range of services, including, for example, matching one user with another based on a host of related possible likenesses.