The oil and gas industry has been on the cutting edge of many waves of computing over the several decades that supercomputers have been used to model oil reservoirs in both the planning of the development of an oil field and in quantifying the stored reserves of a field and therefore the future possible revenue stream of the company.
Oil companies can’t see through the earth’s crust to the domes where oil has been trapped, and it is the job of reservoir engineers to eliminate as much risk as possible from the field so the oil company can be prosperous and profitable. This is particularly important when oil prices fall dramatically, as they have in recent years, because there is less margin for error, and while the majors and their partners up and down the supply chain have cut back on their spending to find and develop oil fields, they invest in the downturn and wait for the upturn.
To get a sense of the start of the art of reservoir modeling, The Next Platform had an extended conversation of Vincent Natoli, founder and CEO of Stone Ridge Technology, which has developed a new generation of modeling software called ECHELON and which has gotten on the cutting edge of GPU acceleration and is showing excellent performance on hybrid CPU-GPU iron. Natoli and Sumit Gupta, vice president of high performance computing and analytics for IBM Systems group, gave us a deep dive on the issues that reservoir engineers are wrestling with, and how various technologies can be employed to meet these challenges.
TPM: This is the kind of situation that always warrants doing a before and after comparison. I want to understand what the workload is and the results you have had on the Power9 “Minksy” server with direct NVLink to the GPU. I assume the performance is good.
Vincent Natoli: It is good. We don’t have an apples-to-apples comparison just like that because we don’t have a big cluster of X86 nodes to run on because we’re a small company. IBM was very generous to allow us to run on its cluster, so we were able to do that there and make use of Minskymachines, but what we can compare to is what other people have published with CPU code. The big difference here with our code is that it’s developed from scratch to run completely on GPUs.
Let me start from the beginning.
We have set a performance milestone in this area of reservoir simulation, which is one of the two really important HPC applications for the oil and gas industry. The first one is seismic and the other one is reservoir simulation. In the first one, you are kind of finding where the oil is, in the second one you are figuring out how to get it out of the ground. So our code is called ESCHELON. It is a high performance petroleum reservoir simulator, and it distinguishes itself because it is written from the very beginning to run on GPUs. We did do this billion cell calculation, and, I have to say, we were prompted a little bit by an article that you guys did a little while ago on Exxon’s calculation, Exxon did a billion cell calculation using essentially the entire Blue Waters facility at the National Center for Supercomputer Applications.
They used 717,000 processors and 22,000 computers, and we sat around in our little relaxation area and were thinking, can we do this? And what would it look like? And we did a back-of-the-envelope calculation and figured we can do this. We thought it would take about six weeks of work to get the code to be able to handle that large a system. But, in the end we did it on 30 Minsky nodes, with a total of 120 Nvidia P100 GPU accelerators, and we did the whole run in about 90 minutes. This is against 45 years of production of a gigantic billion cell reservoir, that was kind of created to look like one of the big Middle East reservoirs. The point was, though, that Exxon, with all due deference to their group, which is a fantastic group, and they have their own priorities and motivations as a large company, but we just didn’t see doing reservoir simulation on almost essentially 1 million processors as something most companies can relate to. It’s not accessible, and it’s certainly, in my opinion, not an efficient way to do it.
If you work it out, 1 billion cells on 700,000 processors, that’s 1,300 cells per processor. So it turns out they’re only really running in L3 cache, they’re not even using main memory. They’re using an entire cluster to basically use the L3 cache of the AMD “Bulldozer” Opteron chip, which is what the Cray nodes use. And also, since reservoir simulation is memory bound, after two or three, maybe four, cores on a single CPU chip, you don’t get much more out of it. You can add more cores, but you’re not gonna get more performance.
TPM: So it was kind of like a stunt then, it’s exactly the wrong machine to do this workload.
Vincent Natoli: Well it’s part of a way of thinking.
And there’s this contention in HPC now between scaling up and scaling out. So, scaling out is this idea popularized by Google. The idea that cores are cheap, you just write your code so they can be massively parallel, and you just scale up with gigantic clusters, which is what Google does with MapReduce. And that was the thinking in the early 2000s and the mid 2000s because that’s what people saw up the road, lots of cheap cores.
One thing they didn’t see, however, was the emergence of the GPU in 2007, when Nvidia enabled people to compute on GPUs using CUDA, and opened up an entirely new platform. And so I think the real story here – it is important to energy companies and to the larger HPC community – is that there has been this divergence, very significant divergence in performance between your standard Xeon chip and the Nvidia GPU. So at this point in time, you’re comparing state of the art Nvidia, which is the P100, against a Skylake chip, and it’s about a 10X on two things we care about in HPC: memory bandwidth and flops. That makes a huge difference in performance.
Not only that, but with Minsky you can stuff four Nvidia cards in there, so you get a total of 2.88 terabytes a second. I mean, that’s an enormous amount of memory bandwidth. Just to match that, you’re going to need 18 standard nodes. Reservoir simulation is pretty typical of scientific application – most scientific applications are going to be memory bound – what you really want to buy is memory bandwidth, and you want to get it in the densest configuration possible. Minsky delivers that for us, together with the Nvidia GPUs. So, there is a significant latent performance gap, it’s latent because you need software to enable it.
The only thing that we really did was we were clever enough early on to realize that this was going to be a big deal in HPC. We didn’t start with the CPU code, we wrote, from scratch a code to run on GPU, and to be the fastest code in the world making use of everything there is, every capability that the GPU has. Because we did that we can experience, and we can demonstrate this very graphically or very visibly in terms of performance and also in the space that we require for our calculations.
There are a couple of calculations we can compare to. The published ones are all from very large companies like Saudi Aramco and Exxon. Aramco has done the most work in this, and their last published result they’re using, everybody’s using about 500 nodes except for Exxon. Exxon could use 500 nodes, of course, but it would run presumably slower. The other thing about the Exxon calculation is they didn’t release any information. There was no time, they didn’t talk about how it’s scaled, which would have been, actually, very interesting. But with the Aramco calculation they did release information. It’s about 500 nodes, we use 30, so we’re using an order of magnitude less compute servers, or I like to think of it as instances of the operating system, because that’s kind of a level of complication that you’re going to deal with. And they took about 20 hours, and we’re about 90 minutes. We’re using an order of magnitude fewer resources and getting an order of magnitude faster time. I think that’s very significant. I think that’s something that most energy companies that aren’t super major – if you just happen to not have a national supercomputing center lying around to run your calculations on –then this is a better solution because 30 Minksy nodes is accessible. You can get to that in the IBM Cloud and other cloud providers.
The other thing is that most people aren’t doing billion cell calculations. Typically it’s in the tens of millions, and even one to ten million cells is probably typical for most energy companies these days, and you can do that on one Minsky server.
TPM: When you’re doing these simulations in the range of 1 million to 10 million cells, do you do an ensemble type of calculation where you change some of the parameters? Weather simulation does that. When they run weather codes it’s not just one giant piece of code running, but dozens or hundreds of copies of code running across a system, and then they kind of average the answer to get what they think the best model might be for this particular weather pattern. Do you do the same thing? And I would guess, if you do, you have a modest cluster, you do them in series – run it this way, now run it that way, now run it this way – and you figure out what you think the right answer is, as opposed to doing them all in parallel across one giant machine. Is that an accurate description of what you do?
Vincent Natoli: Yes, the oil companies do that. It’s called uncertainty quantification.
There is no single model for the subsection of the earth, because there’s a lot of variability in what we know. The best thing you can say is there’s a probability distribution, for example, of permeability. There’s a probability distribution of the geology. What makes more sense is to generate, like you said, an ensemble of models, and then to take statistical averages over them, and that is what people have always liked to do. Although for the longest time, people didn’t really do that, because the computational burden was so high. I mean, up until recently it could easily take a day or two to run a single model. But no one was really thinking about how we were going to run a thousand models. That wasn’t really a possibility. There’s a new generation of codes that now, and Exxon’s codes is one of them, that take advantage of massive parallelism. Ours is part of a new generation of codes, but we’ve gone a different route, working on the GPU instead.
TPM: What is ESCHELON written in? Is it a Fortran or C++ code?
Vincent Natoli: It’s C++ and Nvidia CUDA.
TPM: And are you making use of the NVLink ports on the Power systems machine? I assume that is part of how you’re getting the bandwidth in and out?
Vincent Natoli: Yes
TPM: Have you been able to turn NVLink on and off and see what effect that has on performance? Because that’s an interesting thing.
Vincent Natoli: I agree that that is an interesting thing, and makes even more of a difference when you’re doing a big model like this, because now we have 30 of these Minsky servers that are connected together. Internally they’re using NVLinks. Minsky is also unique in that it uses the NVLink between the CPUs and the GPUs, not just between the GPUs. That is something that we’re going to investigate, but I don’t have hard numbers for you right now, because I don’t think we can just turn that off. You can’t just not use the NVLink and use PCI because the GPU card is on NVLink.
TPM: That’s fair. And how did you network the nodes together, were they 100 Gb/sec InfiniBand or Ethernet?
Sumit Gupta: This is all InfiniBand. I think you kind of touched upon this, and I was trying to kind of insert myself in that discussion, there was quite a lot value that you got out of NVLink, but it was after the optimizations that Stone Ridge did.
Vincent Natoli: There was six weeks of work, and part of that was doing a good job of overlapping communication and computations, so that you can send messages but you’re also computing at the same time. And also, kind of working with the NVLink transport layer and getting to understand that and what it can do for us. But we didn’t see any problems with that. And actually it’s pretty amazing that we can do 1 billion cells on just 30 nodes, with that fabric. I think things worked out really well with that.
TPM: What was the memory configuration, and did you use flash storage on this thing?
Sumit Gupta: There is no flash, and the memory is probably 256 GB per node Vinny?
Vincent Natoli: Yea. I think that’s right. Yea. The memory wasn’t as important to us because we were running everything on the GPU. But, it is when we’re doing initialization, obviously. And the Power8 chips are actually really cool because they have higher bandwidth than the Intel Xeon line. So they’re very fast. And initialization when you’re doing a model that big can actually become a bottleneck, you know reading the files and setting up the data and then sending it over to the GPU.
I wanted to expand a little bit on this scale up versus scale out, because one of the big advantages when you’re scaling out on thousands of cores. So in Aramco’s case they scaled out on 500 nodes, and each one probably had about 20 cores, so that’s like 10,000 cores, and each one of those has an MPI domain, so they ended up chopping up the model into 10,000 tiny pieces, and all of those pieces have to do some communication. Not with everybody but with neighbors.
But when I compare that to with what we did on Minsky, we basically had 120 domains, because each GPU is a domain. So our domains are much bigger, which is good, because we’re doing a lot of calculation on those domains, and every once in a while we’re communicating. The communication that we do, by the way, is really, really fast over the NVLink. From an efficiency point of view, I think there’s a big advantage to using these kind of fast nodes. And then the extreme case with the Blue Waters facility is you have 717,000 MPI processes going on, MPI ranks, all communicating with their neighbors. It’s an enormous amount of calculation. I mean, it’s an enormous amount of communication between them, and not a lot of calculation. As I said it’s only 1,300 cells per processor.
TPM: There’s a lot of thought and no action.
Vincent Natoli: Yes, and there’s one more nuance to that, which is important from an algorithmic point of view. If you want to be able to scale out – and by the way, it is an achievement to even get your code to run on 700,000 processors, so I do hand that to the Exxon group certainly – it forces you to choose algorithms that are easily parallelized over 700,000 processors. In this case, it leaves you to choose weaker solver algorithms, because the complex solver algorithms, the algorithms that we’re using are not as straightforward to parallelize over that massive number of cores. Hundreds of thousands of cores. And when I talk weak and strong algorithms, it’s a question of how fast it can converge a solution. So a weaker algorithm – they are doing a tradeoff, saying “ok this parallelizes really easy, there’s a lot of work we can do independently here, but we’re going to have to do more work,” because you have a weaker solver algorithm. That is a tradeoff that people make. But, it does lead you to, in my opinion, this kind of extreme case, where you’re taking up an entire machine to solve a problem that we could do on 30 nodes.
TPM: I agree. This is the battle, the battle we’re going to see play out this year.
Vincent Natoli: And also just from Exxon’s point of view, I’m sure they would say that they have to weigh the cost of recoding and porting to a new hardware platform, and that’s true. But, I think it’s also the industry’s job to point out how much performance you’re leaving on the table. There’s a cost to doing it, but there’s also a cost to not doing it.
Sumit Gupta: I think the perspective is, for most companies this translates directly to more revenue. The more you model here, the more oil you’re going to extract, and therefore the more efficiency. Especially in these days of oil prices, getting more efficiency out of these oil wells is a big benefit. And if you can use technology to do that, then it’s a big value.
What Stone Ridge ran was a very large model, a billion cells. Most real customers, or real oil fields out there, are going to be in the 10 million and 100 million cell range, and it’s not even going to be 30 servers, it’s going to be maybe ten servers that you need for that. And that really opens up the aperture for the entire oil and gas market to take advantage of this technology to improve their oil production, and therefore in these times of low oil really maximize their revenue, maximize their output, and improve their efficiency.
TPM: Is Stone Ridge, are you new to this business? Not personally, but the company, is this a brand new code and a brand new company you’ve set up to do this, or have you wrote codes all along? What’s the story there?
Vincent Natoli: I started the company in 2005, and we started doing consulting work in HPC. Very quickly, in 2007, got excited about GPUs, and we did a lot of GPU porting. We wrote GPU codes from scratch for companies, a lot in oil and gas for seismic codes, but we also worked a lot in bioinformatics, we worked a little bit in finance, we did a little work five or six years ago in FPGAs. So, we’re really excited about this whole idea of accelerators, and about five years ago we really hit on this reservoir simulator. We were doing the project with Marathon. And, to make a long story short, we convinced Marathon, we were helping them with their simulator and we were making modifications with GPU, we convinced them to partner with us and allow us to develop a new simulator from scratch, where they get rights to the code, but we have commercial rights. So Marathon was an early partner in this, and they’ve been using the code for two years, and that’s how we got our start. I dropped everything else because this was really taking off and was very exciting, and we thought we could make a big difference in this area, because there’s a big differential performance to delivery, and the kind of performance people were used to.
TPM: Do you intend to sell this software, or run it on the IBM Cloud and sell a service? How do you sell this thing?
Vincent Natoli: We license the software as seats, and it can run on a system as simple as a workstation with one GPU, it can run on a single node, it can run on a whole cluster. And Sumit and I have been talking about cloud implementations and other ways to make it easier for companies to get access to it.
It’s interesting you should mention cloud, because it was just one thing I was going to make note of, before the oil downturn of the last two years, which we’re just kind of emerging from in the last six months, when you talked about cloud to oil companies they’d get all defensive and put up kind of a shield and start muttering words like proprietary data. I think that more than anything else, it was just, they didn’t want to deal with it, because the security safeguards that cloud organizations have are probably better, certainly as good as if not better, than what oil companies have themselves. But now, after the downturn, it’s a totally new landscape now, and companies are looking for any way that they can get more efficiency, they can save cost and cloud does have a lot of advantages. It’s an elastic resource. Reservoir simulation is often very bursty. Sometimes you have a big demand, for a month, then not use it for a few weeks, then a big demand. It’s nice to have a system that can adapt to your needs. And also, since hardware changes on a timescale of one and a half years, it’s frustrating. You buy a cluster, and then you’ve got it for five years as it depreciates, and meanwhile there’s new hardware out in a year and a half that’s twice as fast. With the cloud you can stay on top of that.
TPM: I mean there’s always going to be one cloud that’s putting in the new stuff, and four other clouds that are dragging their feet and not putting it in, because they just bought last year’s model.
Vincent Natoli: But they’re going to price it accordingly too. The market is going to drive that price.
TPM: Well to be more accurate, they’re always putting some of their fleet out with the latest stuff if they get enough business for it. You have the Skylakes and soon Voltas and there’s other things out there, and there will be Power9 soon. So have you gotten your hands on Power9 and Volta to test that yet? Or is it too early for that?
Vincent Natoli: It’s a little too early. I’m not even aware there’s a Volta out there. But we usually do get access to the new Nvidia hardware, and that’s another dimension to this exciting story, in very briefly stating that our code’s performance is very linearly proportional to the memory bandwidth of the Nvidia cards. So when P100 came out, and it was two and a half times more bandwidth, than the K40, then our P100 times were two and a half times better than our K40 times. It’s that direct. We fully expect that when Volta comes out – whatever that is 50 to 80 percent more bandwidth than P100 – we’re going to continue to see that gap grow between the performance of our code, and the performance of all of these CPU based codes. So, for example, our hour and a half for our billion cells, if we’re on a Volta cluster that would go under 60 minutes, I’m sure.
TPM: It’s all about the bandwidth, and time and time again, when I talk to people in this business they say the bulk of codes are memory bandwidth bound. And the reason why Knights Landing was architectured the way it was, was because they were trying to do the same thing in a slightly different way. I don’t know what the right answer is, but I keep seeing more people interested in doing the fat node machine because, and this is a story I’m working on now and I’ll run it past you, but you’ve got GPU acceleration now, you’ve got machine learning acceleration for training, you’ve got HPC simulation. All three of those things can very easily run on the exact same hardware, and in some cases you’re going to want to run the analytics and the database, or the machine learning and the HPC, because they’re going to merge these things. They’re going to teach computers how to watch the simulations and that will be the application. It won’t be that you model the weather, so much as you throw out models and you have a machine learning algorithm watch how the weather evolves, and let it predict what it thinks the weather is going to be.
I don’t believe in magic. Some days I wonder if I ought to start. But it strikes me that this fat node thing is the way that interesting things are going to be done.
Vincent Natoli: I think the fat node, it just makes logical sense, for some of the reasons I just gave you. You’re going to have less communication. It’s the difference between, let’s say, in one extreme, I might have 100,000 cores all doing some separate little piece of work. Well why not gang them into 100 separate units of 1,000, and then those 1,000 are communicating on die, so it’s really low power low latency, high bandwidth communication between them. That just makes sense. But what’s holding people back is the switching costs. Some codes are easier to do.
For example, the oil and gas industry was one of the earliest to adopt GPUs because it was such a beautiful fit for seismic, and they burn millions of cycles on seismic every year, or hundreds of millions of cycles on seismic. And seismic is a beautiful code that can be reduced to maybe a couple hundred lines of C, a kernel of just raw compute that takes 99.9 percent of your time. If I had to sit down and design a really great algorithm to be accelerated by GPU, it would probably look a lot like seismic. Now reservoir simulation, and you were talking about climate modeling and computational fluid dynamics, those are all a lot harder. Seismic is hard, but it’s containable. Reservoir simulation, computational fluid dynamics, climate modeling have hundreds of kernels, big kernels, and if you’re going to do it, you’re going to do every one of them on GPU. Because otherwise you get killed by Amdahl’s Law, and sloshing data back and forth between the CPU and the GPU. So, that’s why it took longer. That’s why companies didn’t want to do it.
And, in fact, I think there was a time not too long ago, maybe four or five years ago, where people said, GPUs, they’re really good for things like seismic, and Monte Carlo – things that are kind of naturally parallel – but they’re never going to be good for serious number crunching. So, the codes that we’re talking about are all similar in the sense that they’re solving coupled partial differential equations on a grid. It’s different partial differential equations, because it’s different from physics in CFD, climate modeling, and structural mechanics. Turns out, the big picture of the codes are very similar. In the center of the code somewhere, there’s a huge linear solve. It’s probably wrapped by a huge non-linear solve. Then there’s different non-linear solver approaches depending on the particular discipline you’re working in. Then there’s a lot of work to calculate the different coefficients in the solver, then updating. That’s where all your complexity comes in, and your kernels, some of them are naturally parallel, and some of them are not. But for big companies, I think, it’s an investment to make a decision to do that.
Sumit Gupta: We presented to you many times our view of the accelerated computing future, and having these fast interfaces between the CPU and GPU to enable workloads like this which are not naturally parallel, which are a mixture of sparse computations and data is what NVLink and OpenCAPI enable. I think you’re starting to see the beginning of the vision we outlined to you a year ago. That’s one of the reasons why we’re really strongly partnering with Stone Ridge and other software partners like them.
Vincent Natoli: The ironic thing is, or maybe it’s not ironic, the hardware changes very rapidly, and the software has the word soft in there, so you would think that it’s malleable, but it actually changes a lot slower, because the basis of this code is so complex, and there’s so much effort that goes into it, and it’s so specialized too, that it is hard to switch platforms. So companies in the mid-2000s, when companies started to write reservoir simulators, there were no GPUs. They wrote their codes thinking the future was a whole bunch of inexpensive cores, so their codes did map quite well for it. But, in the meantime, because it took them literally eight to ten years to develop the codes, the whole hardware landscape changed under their feet. Now, they have some difficult decisions to make. I see this because companies are thinking, the idea of rewriting entire codes is very daunting, so they’re thinking of porting pieces one by one over, and so on.
TPM: Why wouldn’t they look at that situation and say, why are we writing the code? If it’s so strategic for us, why can’t we just use this code that these smart guys at Stone Ridge have just invented?
Vincent Natoli: Well that’s what I’m trying to convince them, you know?
TPM: But even still, buy yourself time. Even if the next iteration of stuff you’re going to run it on neural networks and squirrels running on little mouse wheels, whatever, they could skip a generation and buy themselves some time and I’m surprised they wouldn’t give you a spin. I mean, this is how everybody stopped writing their own EDA software a million years ago.
Vincent Natoli: Fewer and fewer companies are doing it because of a couple of things. One, it turns out that parallel programming is very complicated, even on CPU, actually especially on CPU. There’s three levels of parallelism, right, you’ve got your link between the nodes using MPI, on the node using some kind of threading (open NP), and then for each core you have vectorization. You can’t just take domain experts, in the 1980s and 1990s this worked beautifully, they took domain experts – like you’re a reservoir engineer and you know a little programming – you could go in and write a nice Fortran code that runs on one core and you could actually do a pretty could job. But, you can’t do that today. There’s too many disciplines that overlap in this. There’s the applied math. They have math, they have physics. There’s the physics of whatever you’re trying to model to the mathematics. There’s the mapping of the math to algorithms, and then the algorithms to software.
TPM: And then the software to the hardware, the hardware’s not sitting there perfectly easy, there’s something you need to do.
Vincent Natoli: Right that’s the last part that it’s complicated to get performance. A lot of people could write a kernel and it would run. I like to say this, if we do this experiment in the 1980s, and we told a bunch of physicists to go in a room and write this compute kernel, there would be a distribution of performance, but it would be kind of peaked. If you did it today, there could be two orders of magnitude difference between the worst and the greatest, because coding is that complicated. Like I said, there’s three levels of parallelism, and there too are GPUs, and then you’ve also got your you know… GPUs in some ways are a little easier because there’s really two levels. You write CUDA for the GPU and there’s MPI for between the GPUs. So, I don’t know, a lot of people say in the beginning of GPU computing they were saying, oh it’s too complicated, you need specialists to do it. Having done this for awhile, it feels like it’s not as complicated as writing for CPU.
TPM: No, you’re writing for a distributed Cray vector machine.
Vincent Natoli: It is a language with a natural parallelism to it. It’s a language that expresses parallelism better, so it’s easier to implement. It’s a more natural mapping, I should say. And you have more control for the low level, like you can control what’s in short memory and what’s in registers, whereas on the CPU you depend a lot on the compilers to do things correctly.