The move to integrating AI into current operations and finding its role in entirely new applications at Royal Bank of Canada (RBC) is similar to what we’re seeing among other large-scale enterprises. Before you roll your eyes and click away because you see something about enterprise AI, read on for just a moment. Because it’s not about the workload or even GPUs. It’s about all the various performance pieces that go along with that shift and what they mean for a larger view of enterprise systems in the more encompassing sense.
Deep learning begins small but the rethink from the often custom infrastructure, network capabilities, and new demands on storage pushes a larger-scale imagining of where similar compute, storage, and network performance could alter legacy-driven (essential) applications. In short, while AI might not be a widespread set of mission-critical production use cases yet, it is sparking new thinking about architecture enterprise-wide. In some cases, it’s thinking for the first time about GPU acceleration for more than just training, in others, it’s a recognition of how storage environments will need to evolve to keep pace with increase compute capacity, heterogeneity of systems and workload, and so on.
All of the above (and more) are on the table for RBC as it looks to its future infrastructure and diverse application bases. What they’re doing now might signal how similar companies in banking in particular might buy and use big infrastructure. Big banks mean big legacy and while many are unlikely to scrap their mainframes and large enterprise software systems, the cutting edge inspired by hyperscale, AI, and even some HPC will lead the charge in the next few years.
RBC is a great example of the early sea change. The banking giant built its own AI research and development center that started with some GPUs integrated by Dell, HPE, and IBM but quickly came to the realization that going directly to the source—in this case, Nvidia—for the full-stack with integrated compute and networking made more sense. This third iteration of their burgeoning AI infrastructure is now chewing on some actual production workloads serving various parts of RBC’s business, with one V100-equipped DGX box for model development and testing and another for production. All of this has been looped into a private cloud with an OpenShift base with the ability to burst into AWS, Google, or Azure when needed.
There is nothing really remarkable about this infrastructure evolution. We’re seeing this elsewhere in large-scale enterprise AI where the test/dev cluster expands and eventually becomes part of operations. It’s not a fast process, but based on what Mike Tardif, Senior VP of Tech Infrastructure at RBC, tells us, it’s an eye opener for more than just AI training and inference. They’re looking at the GPUs and what they have entailed from a storage and networking perspective as representative of where they’d like to go down the road, especially for demanding segments of the business, including capital markets.
Like many banks that have been around for decades, RBC’s infrastructure has plenty of old code and systems to tow for the long haul. There are still plenty of mainframes, CPU only systems, and standard storage and networking gear. “95% of our environment is client server and all Intel for the most part,” he says. Integrating AI into these workflows will be difficult but Borealis AI research team at RBC is thinking about this might work. Even further, with GPUs entering the mix, Tardif is seeing that they may have a role outside of the isolated clusters in the Borealis AI testbed in accelerating enterprise applications and databases. He says that it’s still early days and wonders if his teams of “traditional developers even know that capability [broader GPU acceleration of enterprise apps] is even there and what they would do with it.”
The big challenge of all of this is full integration. The AI operations are part of a standalone GPU grid but the data needed is on mainframes, in Teradata, Hadoop, and elsewhere. In their DGX environment there is an R&D and production division of clusters. “The reason we chose Nvidia DGX is because with the snap-in configurations we can grow. OpenShift and containers were an important move as well because in the future if we need more elasticity we can burst into a public cloud.” He adds that the Borealis AI environment is in its third GPU implementation. “As we grow clusters there are other folks, in capital markets, for instance, that we used to buy smaller GPUs for, so rather than disparate systems if we could have a shared firewalled GPU farm that would be ideal.”
Tardif made another important point, one that we think might be representative of that sea change noted earlier:
“In the past the GPU was just the chip. In talking with Nvidia execs about where they’re going, we see that at some point [GPU computing] becomes a white box, commoditized. If you look at gaming datacenters, it’s just rows of chips without sheet metal around them, just a farm of circuit boards. We’re thinking the future will be racks more like that. Maybe we’ll bump to public cloud and rent GPUs for a few hours or go even deeper where we are now with DGX, but there’s a lot to think about, especially since the GPUs we have are heavily utilized already.”
Tardif is having a harder time seeing a fit for the traditional OEMs, for AI hardware now, but looking ahead, for some broader workloads. “In the first couple of iterations with Borealis and AI we had standard boxes from Lenovo, HPE, Dell, and even tried an IBM one with some GPUs. But knowing where AI is going we needed something more scalable. With the DGX it’s now a computer in a box and works directly with everything.”
The point is, sure it’s not bleeding edge for the most part, but no enterprise banking IT shop really is. The other more important point is that it’s changing. And while AI is at the helm of that charge for increasing modernization, it’s just a catalyst for thinking about infrastructure overall rather than some sweeping set of workloads that will change business operations. Not yet, at least.
For example, with the new AI testbed comprised of DGX clusters, Tardiff knew that the traditional NetApp and EMC installations they rely on elsewhere in the business wouldn’t be able to keep up with the DGX. They looked to all-flash, finding Pure Storage’s blades the right fit. And for an organization that only recently updated its infrastructure overall from 1Gb/s to 10Gb/s, the real power of 100Gb/s is starting to become clearer now that they’re seeing it play out in their Borealis environment.
The overall infrastructure for RBC is vast and widespread with a large environment in Canada (where the Borealis team is centered) and 5X football field-sized datacenters elsewhere in Canada. They also have 60 colo centers around the world. As Tardif noted earlier they do use the cloud for bursting with no real loyalty between AWS, Azure, and Google, especially for its capital markets teams with risk calculations and other workloads.
In short, GPUs and AI are pushing a full-stack re-envisioning of their environment, bit by bit. All of this has been enabled by containerization and OpenShift and other Red Hat tools, Tardif says. But performance and possibility are pushing the company forward in some unexpected directions. This isn’t surprising. Consider this on a small AI test/dev lab front. Heavily utilized GPUs for training/retraining, then production, mean the storage has to keep pace. The networks have to keep pace. Integration becomes more important than ever and the need to pick up work easily and whip around to other datacenters in nice containerized units is also essential.
While RBC is still in the early stages with all of these things, their journey is noteworthy because it shows the power of a large customer thinking of Nvidia as an integrator, as a datacenter provider, not just a chip company. That’s a big deal—and it’s one that its partners in storage and now with Mellanox can run with. It’s a strong story and while it’s AI centric now, as more applications get the GPU acceleration treatment (for instance, GPU accelerated databases) it will push for a new kind of infrastructure.
And now Nvidia owns that from top to bottom (minus the storage piece but there are plenty of partners eager to for those Nvidia partnerships to solve that problem).