While graph analytics is not a likely replacement for the standard relational databases that many companies will stick with for many years to come, the value of graphs for a particular set of knowledge discovery applications has become clearer with a widening set of use cases in areas ranging from security, fraud detection, medical research, financial services, and a number of other segments.
Although other approaches, including Hadoop and its related cadre of analytics add-ons, promise the same “needle in a haystack” problem solving prowess, for extremely large datasets where the hidden connections and associations are problems, graph discovery and analytics are the mode of choice. It is quite possible to stitch together platforms based on standard iron with a range of open source tools, but some, including Oak Ridge National Laboratory, have taken the appliance route to further interesting new work to make graph analytics and discovery more robust (and integrated) on hardware tuned for such workloads.
Oak Ridge is a familiar site for supercomputer maker, Cray, which built its superstar Titan machine, which still sits near the top of the bi-annual Top 500 list of the most powerful supercomputers. However, in an effort to move farther into the enterprise, and beyond government supercomputing, Cray spun out a data-focused division, which among other things, built SPARQL-based graph discovery appliances with the needed deep memory and other elements tuned for such workloads. One of the first products Cray rolled out to support its data-intensive computing focus was the Urika Graph-GD appliance, which was aimed at graph discovery workloads or, in simpler terms, applications that sought to discover connections, patterns, and associations in large volumes of data. Among some of the more noteworthy uses of the machine has been one (still unnamed) Major League Baseball team’s adoption of the Urika Graph-GD to deeply mine through player, game, and other statistics.
Oak Ridge bought a Urika Graph-GD machine back in 2011 specifically for the purposes of solving healthcare fraud challenges using HPC techniques. The appliance was put to the task of discovering connections between potential fraud perpetrators by looking at vast swaths of data that would have otherwise not been correlated, and if so, would have been too data-intensive to cull through and connect with standard methods. Following the end of that program, the machine sat mostly unused until Oak Ridge computational division leads sought to put it to the test on other problems in science.
One might expect that in that broad range of scientific applications there would be many domains where the “hidden connections” problem would be pressing. While this is true, lining up that data and more importantly, using an appliance like the Urika machine to do more than just discovery, but also analysis, was no simple task, especially since the system used the relatively easy but more obscure SPARQL query approach, which many scientific problems aren’t a good fit for, at least from the outset. With the efforts of an Oak Ridge team, including Sreenivas Sukumar, a new software stack for graph analytics on the Urika GD platform let the machine speak to both the discovery and analytics sides of graph problems, thus opening the door to new uses cases.
“The Urika machine is great for graph pattern matching, so it’s useful if the structure is known and it’s known what you want to find for the graph. However, in these domains, researchers need to be able to do some automatic analysis, which was the missing piece we targeted,” Sukumar tells The Next Platform.
Since Oak Ridge bought the early machine, Cray has since rolled out another graph platform, called the Urika-XA, which focuses more on the analytics side versus the discovery and associations-oriented Graph-GD system. Still, for researchers at Oak Ridge, being able to do both aspects on the same machine from a single project-focused investment has proven useful. In the last six months, using the software stack the Oak Ridge developers put together called “computer assisted serendipity” applications, ranging from network simulations and genome-wide association studies, have been carried out. Other uses of the machine include a partnership with healthcare group, Humana, to determine why particular doctors are better than others based on a large volume of disparate data sources. The goal, according to Sukumar, is to look at this new capability as more of a potential simulation engine for graph problems versus simply a discovery or analytics platform.
Although it’s difficult to get a sense of how widely adopted Cray’s graph appliances are, when we spoke with the company’s Barry Bolding at the most recent Supercomputing Conference, he said the data division is still working hard to build more robustness into its Urika systems—platforms it expects to continue with in coming years. With the additional compute and memory with latest generation processors hitting the market and extending across their existing HPC systems line, it is not unreasonable to expect new Urika machines to hit datacenter floors in the future, possibly combining the discovery and analytics capabilities that were too far removed from other another for the Oak Ridge developers without the software glue they created to seamlessly do both, including one of the weightiest parts, the movement of data between the two functionalities.
Sukumar and his team have been looking extensively at how graph applications are evolving in both research and industry. He says that they aren’t going away anytime soon and while they will never replace traditional relational databases, it is true that boxes like those from Cray will be a niche. However, an important niche, because for some truly intractable analytics and discovery purposes, it’s the most efficient way to arrive at associations and ultimate findings.
“One reason why graph is emerging yet again is because companies want flexibility with their data across many sources, they want to integrate it all, and they want to see the ROI quickly. All of these new problems are also coming into the picture, which means that for some companies, including one [unnamed] working with us at Oak Ridge, there needs to be a way to make big associations, even in the face of missing exact data points.” In this case, he says, the company that came to Oak Ridge for help has a problem wherein triangulations and connections need to be made between geographical regions. Even though it’s the same company, the full data from one country can’t be shared with another. The requirement is to use a large graph to at least see if the possibility of a connection exists, even if it cannot locate an exact answer of who, what, when the data exists.
For graph appliances like the Urika to push into new areas, Sukumar says that most discovery-oriented business models will expect to have, in the same platform, discovery through interrogation, discovery through association, and the ability build a model, recommender system, or set of workable suggestions. The Graph-GD appliance has been a successful platform for this, particular with the additional software layer the team built to capture more scientific problems (versus the enterprise domains Cray was targeting with this appliance at the outset) and as Sukumar notes, more performance in future systems will provide a linear path to systems more capable of even more demanding problems.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
The GD series use a descendent of the Threadstorm (nee XMT / MTA) processor, spun as a ‘graph accelerator’. Is Cray still investing in improving this processor? As far as I know, it would be the only custom processor they have left, since the demise of the homespun vector processor.
It remains to be seen how the XA does on graph processing where the lack of locality starts to bite. No doubt significantly newer and more aggressive Intel processors can brute force a higher benchmark score than the very aging Threadstorm, but that’s not to say that the efficiency will be there.
Funny you should ask. In a separate story, Cray CTO Steve Scott said that the future graph analytics machines will be based on Xeon, not ThreadStorm.
Great interview with Steve, thanks!
“But it does take some software technology that comes out of the ThreadStorm legacy.” is a bit vague — could be anything from porting the highly optimized graph code, to (more likely) doing something like SoftXMT (http://cass-mt.pnnl.gov/docs/SoftXMT.pdf ).