Seismic processing and analysis at scale take scalable HPC resources but also need an analytics backend that can scale with massive datasets. And as if that’s not enough, the requirement to provide support, libraries, and formats for emerging AI/ML is an increasingly important need.
It is no surprise that Apache Spark has been of interest in oil and gas and even less surprising that companies, including ConocoPhillips, are building on the back of the open source framework to take advantage of that scalability and all the libraries for analysis and machine learning that can push new capabilities.
ConocoPhillips software engineer, Tim Osborne, provided an evolution of his team’s journey into Spark, beginning with Spark Sort, the first algorithm they used. Since sorting is a data movement bottleneck in all seismic processing workflows and this is where Spark shines, this was a natural baseline. Their efforts around Spark grew from this first experimentation into an internal project called SparkSeis, a complete seismic processing, data analytics, and machine learning platform that lets Spark do the heavy lifting of interprocess communication and data movement.
While the ConocoPhillips team is still chewing on whether or not they’ll open source the effort, if they do it is sure to light a fire under proprietary vendors’ behinds. For instance, following the Sort algorithm success and ease of use with the Apache Spark shuffle operations that lent ease of use in parameterizing and in determining output framework size, they found they outperformed the Inline Merge Sort tool found in the proprietary SeisSpace tooling. That process was 3X faster on a small dataset, 5X on a medium-size set and 6X faster on a large one. “Definitely worth the effort,” Osborne added during his presentation of the work at this week’s Rice Oil and Gas event.
Osborne says the ConocoPhillips engineering team knew that Spark and HPC could go hand-in-hand for the scalability and performance they needed but they also wanted to be able to integrate new capabilities in machine learning. They key initially was that Spark provided “in-memory processing using resilient distributed datasets with workflow optimization at runtime and could reduce expensive shuffles.” Further, he adds, many of the machine learning efforts they want to pursue use spark as well.
Interestingly, Osborne’s team began their journey by trying to evaluate cloud for seismic imaging and processing. “We were going to develop software designs using cloud APIs and initially worked with Databricks on Azure for Spark. In the end, we settled on using our on-prem HPC cluster due to performance and cost,” he explains.
“Spark provides access to a wide variety of data analytics and machine learning libraries (SparkML for example) and is cloud-capable, so we can run on Databricks,” Osborne says. “The goal was to have a lightweight framework for adding in tools—an ‘insert geophysics here’ interface with support for functional programming and for both Spark RDDs and Spark Dataframes.” Over time the team has added its own host of tools for seismic-specific I/O, sorting, fault prediction, 3D smoothing, and 3D FFT, among others.
Another key to actually using the Spark-based tooling is making in multi-environment, something that goes beyond using the Databricks, cloud approach. They have built a conventional flow-based system, have Jupyter notebooks for flexibility) and can mix and match Scala and Python cells as they explore new ideas and prototype. Spark itself provides scalability, allowing ConocoPhillips to read and operate on large datasets. The team also built an interactive GUI app.
Look out proprietary oil and gas software vendors. Companies want scalability, flexibility, and specificity out of their tooling and while it’s not new that there are open source options out there (Spark has always had some sway in oil and gas) the next big hurdle is integrating machine learning. It will be more important going forward to have a seamless platform that takes into account high performance hardware resources, scalable codes, and the ability to have data at the ready for AI/ML and it’s a race to see who can pull together a package that does all of this seamlessly.
A demo of the Spark functionality and tooling they’ve built around is featured here. Registration required but it’s free. https://rice2021oghpc.rice.edu/technical-program/