Data security has always been a key concern as organizations look to leverage the operational and cost efficiencies that come with cloud computing. Huge volumes of critical and sensitive data often are in transit and distributed among multiple systems, and increasingly are being collected and analyzed in cloud-based big data platforms, putting them at higher risk of being hacked and compromised.
Even as encryption methods and security procedures have improved, the data is still at risk of being attacked through such vulnerabilities as access pattern leakage through memory or the network. It’s the threat of an attack via access pattern leakage that a group of researchers from UC Berkeley wanted to address when they developed an “oblivious” distributed data analytics platform that leverages the hardware enclave technology available in Intel’s Software Guard Extensions (SGX). The sensitive nature of much of the data that is being collected and analyzed in the cloud – from medical and financial data to user information like emails and shopping histories – makes it an attractive target for cyber-criminals, so the need to protect the data in such cloud environments is growing.
“Many systems run rich analytics on sensitive data in the cloud, but are prone to data breaches,” the researchers wrote in a recent paper introducing Opaque, their cloud-based oblivious analytics platform. “Hardware enclaves promise data confidentiality and secure execution of arbitrary computation, yet still suffer from access pattern leakage. … To truly secure the data, the computation should be oblivious: i.e., it should not leak any access patterns.”
The risk of access pattern leakage comes primarily from the memory or network level. At the memory level, such leakage happens when a “compromised OS is able to infer information about the encrypted data by monitoring an application’s page accesses. Previous work has shown that an attacker can extract hundreds of kilobytes of data from confidential documents in a spellcheck application, as well as discernible outlines of jpeg images from an image processing application running inside the enclave,” they explain.
At the network level, actions like sorting or hash-partitioning can create traffic on the network. Hackers can glean information from the traffic even if the data and messages sent over the network are encrypted. The authors cited a study that showed that an attacker who could see the metadata of network messages – such as source and destination, rather than the content itself – in a MapReduce computation was able to identify the age group, marital status and birthplace in some rows from a census database. With Opaque, the researchers wanted to create a platform that would offer such security guarantees as computation integrity and obliviousness.
A key to Opaque – or Oblivious Platform for Analytics QUEries – was deciding to implement the oblivious functionality at the query optimization layer, where such big data workloads like complex graph analytics and machine learning can be expressed. The group leveraged the Apache Spark big data processing engine, putting Opaque atop Catalyst, the Spark SQL query optimizer. There were no changes to Spark needed and few extensions to Catalyst.
The researchers also needed to figure out how to make Opaque more efficient in delivering access pattern protection. Other oblivious computation frameworks came with high overhead. For example, ObliVM not only came with high overhead, but also was not built for distributed workloads, while GraphSC, which is made for oblivious parallel graph computation, comes with a 105x slowdown. To do this, the researchers develop new distributed relational operators that can simultaneously protect against memory and network access pattern leakage and new query planning techniques – rule- and cost-based – that improved the performance of oblivious computation.
The analytics platform was implemented using Intel SGX atop Spark SQL, and it can run in three modes: encryption, which provides data encryption and authentication, with guarantees of the correct computation execution; oblivious mode, with oblivious execution that eliminates access pattern leakage; and oblivious pad, which adds to the oblivious mode with size leakage prevention. Secure enclaves in processors – like Intel’s SGX and AMD’s Memory Encryption – are areas in the chips for protected execution designed to protect particular data and developer codes from being disclosed or modified.
Opaque doesn’t modify Spark and Spark SQL, but it does move the query planner from the server to the client side “because a malicious cloud controlling the query planner can result in incorrect job execution,” the researchers note. “However, we keep the scheduler on the server side, where it runs in the untrusted domain. We augment Opaque with a computation verification mechanism to prevent an attacker from corrupting the computation results. The Catalyst planner resides in the job driver and is extended with Opaque optimization rules.”
“Opaque contributes a set of distributed oblivious relational operators as well as an oblivious query optimizer…we show that Opaque is three orders of magnitude faster than state-of-the-art specialized oblivious protocols.”
The job driver – which includes the Catalyst planner that is extended via the Opaque optimization rules – takes a particularly task and creates an encrypted directed acyclic graph (DAG) and unique job Identifier. The input data is split into partitions, with each partition given its own identifier.
The security guarantees are key to Opaque. In encryption mode, the platform provides a self-verifying integrity protocol that ensures that if the client verifies the successful receipt of the results of the computation, then it proves that the result wasn’t impacted by an attacker. In the two oblivious modes, Opaque guarantees oblivious execution around memory, disk and network access for each sensitive SQL operator.
“These are operators taking as input at least one sensitive table or intermediate results from a set of operators involving at least one sensitive table,” the authors wrote. “Opaque does not hide the computation/queries run at the server or data sizes, but it protects the data content. In oblivious mode, the attacker learns the size of each input and output to a SQL operator and the query plan chosen by Catalyst, which might leak some statistical information. The oblivious pad mode … hides even this information by pushing up all filters and padding the final output to a public upper bound, in exchange for more performance overhead.”
The researchers examined Opaque using SQL, machine learning and graph analytics workloads in single-system experiments using SGX hardware on a system powered by Intel’s quad-core Xeon E-3-1280 v5 chips with 64GB of RAM and distributed experiments on a five-node cluster of SGX machines powered by Intel’s quad-core Xeon E3-1230 v5 processors, also with 64GB of RAM. They used the Big Data Benchmark and PageRank. In the cluster experiments, the encryption modes performance ran from being 52% faster to 3.3 times slower. Oblivious mode came with 1.6 times to 46 times the overhead, due to SGX not being set up for big data analytics processing. The Enclave Page Cache (EPC), an encrypted cache of memory pages, is small relative to the main memory. If a page is removed from the EPC, its decrypted, then re-encrypted under a different key and stored in main memory. In addition, an encrypted page in main memory is decrypted again when accessed. The going in and out of the EPC creates large overhead, though it will be eased in future generations of GSX, which promise to have larger EPC size and will reduce the costs. They also compared Opaque with GraphSC on PageRank, with Opaque showing significant performance gains (2,300 times) and general SQL functionality. “While obliviousness is fundamentally costly, we show that our new query optimization techniques achieve a performance gain of 2–5x,” they wrote.