Oracle Puts Together RDMA, Bare Metal for HPC

Oracle was famously behind the cloud computing curve, with co-founder and then-CEO Larry Ellison several years ago dismissing it as little more than an empty tag that was more on par with fashion trends than anything serious in the tech world. Since that time, the company has furiously been trying to make up lost ground with its Oracle Cloud services in a highly competitive and increasingly crowded market that is dominated by Amazon Web Services and includes other high-profile players like Microsoft Azure and Google Cloud, not to mention massive global companies like Baidu.

Still, Oracle hasn’t shied away from the challenge and in entering the market, the giant enterprise software company put a focus on infrastructure, particularly bare metal, as a way to differentiate from other public cloud providers.

“The journey for us started two to three years ago, when Larry announced bare metal in the cloud,” Karan Batta, senior principal product manager for Oracle Cloud Infrastructure, tells The Next Platform. “When we sat down to design what our vision was going to be, the thing that really matters today is that most people think that planning for infrastructure is done or that there’s no more money to be made in infrastructure, that there’s no more innovation there. If that was true, then I would say that 100 percent of everything would be running in the cloud today, but as you probably know, most people are still running on-premise datacenters.”

Oracle began its cloud efforts with bare metal as a fundamental concept, according to Batta.

“With bare metal, we truly meant from a sense of true multi-tenant bare metal, meaning that a customer could come in, sign up with their credit card and start a true bare-metal instance,” he says. “To them it just looks like another instance that could be a virtual machine, that could be bare metal, it just so happens that we have a bare-metal instance. Essentially it gives them on-premises levels of performance. If performance was the initial goal, it just so happens that, ‘Hey look, with bare metal you also get flexibility.’ By flexibility, I mean things like, if you’re a large enterprise and you’re running virtualized environments, then you can literally lift and shift your hypervisor or your virtualized environment directly onto our bare metal.”

The company’s push into the cloud has come with the usual pushback at competitors from Ellison, in particular AWS. Oracle last year announced it was changing its licensing policies in a way that drove up the cost of running the company’s widely-used databases on AWS. In addition, he has blasted AWS and others, saying their services are more expensive and less secure, and noting that they all use Oracle databases among their services. AWS recently hit back, with its consumer business turning off its Oracle data warehouse in early November and moving customers to AWS’ own Redshift data warehouse. The consumer business eventually will have 88 percent of its Oracle databases and 97 percent of critical system databases shifted to AWS’ Aurora relational database and DynamoDB NoSQL database. AWS CEO Andy Jassy in a tweet called the move the ‘latest episode of ‘uh huh, keep talkin’ Larry.’”

None of that has slowed Oracle’s efforts to grow out its cloud infrastructure. At last month’s Oracle Openworld show, the company unveiled several moves that included new security technologies in the Oracle Cloud Infrastructure, including a new web application firewall, protection against distributed denial-of-service (DDoS) and an integrated cloud access security broker (CASB) to enforce secure configurations.

At this week’s SC18 supercomputing show in Dallas, the company is adding to its cloud offering with new instances aimed squarely at the HPC crowd. The new instances, which are part of Oracle’s new Clustered Network, comprise bare-metal servers running a low-latency, high-bandwidth RDMA network on top of the infrastructure. The advantage of RDMA is that it’s faster and secure and it doesn’t have to run through the operating system, instead forming a direct link between compute nodes. The result is that HPC organizations can run their big and complex mission-critical workloads in the cloud while getting the same performance they see in their on-premises datacenters, Batta says.

Leveraging RDMA will give organization the performance levels for such use cases as car-crash simulations, DNA sequencing and reservoir simulation in oil exploration.

“You’ll essentially be able to glue together types of instances and use them for these types of mission-critical workloads,” he says. “It’s completely scalable, you can pay for it by the core by the hour, you can add nodes and delete nodes and essentially glue these things together and you have a 100 gig, single-digit microsecond latency between these nodes, and you’ll be able to run these very intensive workloads on our cloud.”

The initial instances will come with 36 Intel Xeon cores running at 3.7 GHz, 6.4 TB of NVM-Express drives and the ability to attach up to a petabyte of block storage, which Batta says will drive performance to almost 500,000 IOPS per instance. The network bandwidth will be 100 Gb/sec and latency from one system to another will be as low as 1.5 microseconds, with pricing starting at 7.5 cents per core per hour. The new instances will be available in the U.S. and European regions first, expanding to other regions in the future.

The plan going forward will be to expand RDMA into the company’s GPU instances so that data can be shared between CPUs and GPUs in heterogeneous instances, he says. Eventually it will grow even more into the larger architectural environment, so “what you’ll be able to do is essentially spin  up a GPU instance, spin up an Exadata instance, and then you’ll be able to share data between the two because you want to add data by being able to run some machine learning or AI-based workloads. That’s the true motivation for customers, to be able to run these in the cloud.”

Bringing RDMA into the Oracle Cloud Infrastructure hasn’t been an easy lift, according to Batta. Oracle has been working on it for the past couple of years to address demand from HPC companies.

“What makes it really hard is datacenter planning and placement,” he says. “Additionally, [there’s] all the software infrastructure stack that comes on it. Most cloud providers are virtualized, so that means you have to have interops, you have to have pass-through mechanisms to run some of these workloads. Because we have bare-metal RDMA, it means you can run any framework you like. We don’t differentiate. And it makes it super easy for customers to compare and contrast between their on-prem [and cloud environments]. The other reason why it’s very hard is because of the placement because latency is such a factor. You literally have one hop, but if you’re cable’s too long you can double your latency. It’s basically physical placement, so we truly at the datacenter level have designed it in such a way so that it’s flexible and it’s scalable. We consider RDMA as a first-class citizen in our datacenter all the way up to a customer experience in our web console. Other cloud providers haven’t thought of it in such a way. They’ve just thought about it as, ‘Hey, it’s just another instance and it’s just going to be another virtual machine and, cool, we can launch it.’”

The work getting the new instances up and running will be worth the effort, according to Batta. THe opportunity in the cloud is significant, with every customer that runs Oracle having HPC workloads they’re also running. He sees the competition coming less from other cloud providers and more on-premises clusters and datacenters.

“These industries from an HPC standpoint are going to be in the range of billions of dollars by 2020, just on-prem,” Batta says. “From a cloud standpoint, the opportunity is in the multi-of-billions. The real value these customers want out of the cloud is the hardware innovation. Most of these on-prem clusters are five years old or three years old, and unlike a lot of the cloud providers, these enterprise customers can’t afford to refresh their hardware every six to 12 months. We’re trying to provide that value to these customers because they are really going to benefit from better hardware, better networking, better storage.”

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. Lets face it!! Oracle support truly sucks!!!!! If Larry doesnt give priority to fixing oracle support, whatever products he creates will not have an edge. No wonder players like Amazon are taking over Oracle. SUPPORT is the LIFELINE!! But they are very slow to respond even for a SEVERITY 1 ticket. And forget about speaking with a manager. And their shift system, oh boy!!, Every time i worked with a support person (not one exception so far), they spend more time in handing off/reviewing the tickets than actually supporting the customer. I am ranting because of my very poor experience with oracle support personnel. Larry, please fix the support first!!!!

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.