AI Infrastructure Needs Systems Level Thinkers

If we are talking about infrastructure it might seem obvious to take a systems-level view. But the introduction of AI and machine learning with all the novel accelerators and new architectures has detracted from that bigger picture. Processor choices, which are finally here, certainly matter, but the implications and impacts are less clear–and certainly less appealing to talk about. But we are going to anyway.

To start, it is worth summarizing a few trends as they pertain to datacenters of a certain scale.

The first two are not new but are worth repeating to set up what follows. Companies that are doing interesting work to begin with are being pushed into evaluating (if not already putting into production) AI and machine learning projects. And second, when it comes to thinking about infrastructure, they are looking to their hyperscale higher-ups for lessons on how build efficient, high performance systems specifically for those emerging workloads.

Again, not necessarily a newsflash, but here is what is different and interesting. Those hyperscale companies that have set the infrastructure pace, resplendent with accelerators (novel and general purpose), custom networking and storage, and presumably admirable price/performance payoffs for all of this hard work can afford to do what the big enterprises have a harder time investing in, especially if their core business is not in delivering products or services with a big machine learning angle.

And this leads to a secondary set-up point. Pretty soon, every business is going to have to have an AI angle. Sometimes the use case will be unique and need big training clusters and change the way banks or insurance or healthcare operate. But at this early stage, most will be doing things that have an eye-roll factor ala my inbox full of pitches for the next revolution in AI assisted hair color selection. Anyway. This is different than the hype during the big data days when enterprises followed hyperscalers down the Hadoop path, for instance. Because when done right, AI applications, early as they are, do yield results, even if the real ROI of those limited applications has not been fully realized.

The point is, enterprise and even research organizations are following the hyperscalers down this path. The difference in how this all gets done from a systems perspective is completely different. From big deltas in system component prices to skilled teams plucked from the top ranks, it is just not as simply to roll out the kind of at-scale ML initiatives a Facebook or Google or Amazon does, first because many businesses are still at the competitive edge versus core competency stage with AI, and second, because enterprise cannot build the kind of homegrown, highly engineered, sophisticated, and (very relatively) cheap systems their hyperscale kin can.

This leads us to another trend–and one that bodes well for the entire server infrastructure ecosystem. Edge programs aside, when it comes to putting AI in production, the cloud is less of an option at scale. The cost side of that is not a newsflash either, particularly for data movement and compute-heavy training. And while doing inference on vanilla CPU is an option in cloud, getting those trained (and retrained) models humming on cloud instances, especially if near-real time is the goal (and it often is for most current vision/image/video/text applications in AI) does not necessarily always make sense.

Some off record conversations with household name companies these past couple of weeks at various events sparked this article. The unexpected thing that has happened is for some of these companies, whom we might call “lower tier” hyperscale companies with web-delivered services in the social and sharing economy spaces, is that while they built their business on AWS and other cloud platforms, they have done the math on certain AI initiatives and are having to learn how to build and maintain infrastructure. It’s all about cost but its also about performance and the ability to fine tune what kind of bang for buck while also retooling how training and inference happen. In other words, those cloud natives who didn’t want to think about the underlying hardware and wanted flexibility are now, you guessed it, thinking about the infrastructure for flexibility’s sake. And the price differences, of course.

There is a problem for these particular ex-cloud natives in this process. They want to customize their infrastructure and do a lot of interesting things to get max performance but it turns out building systems like Google does is not simple, cheap, or even the right thing for those jobs at whatever lesser scale they are working from. And on the other end, the OEMs are scrambling to make generalized AI platforms, which mostly look like the same GPU-dense servers we’ve seen for years now albeit with the latest graphics cards and some memory enhancement and some pre-packaged services to help on-board with AI via hooks to frameworks (TensorFlow, etc). And at the high end for those who really want to invest, there are packaged appliances like the DGX machines from Nvidia.

On that note, there are a couple more trends that spawn from this. Unfortunately, these are a little harder to generalize. They are, however, the reason why we keep pushing out stories about once-esoteric things like NVMe over fabrics and what is on the horizon for phase change memory, and how trusty old parallel file systems are falling short for AI, and how, basically, the way folks build infrastructure is different than it has ever been.

We are going to drive this home even more during our interview and panel-based event in May (more on that here) but what is clear is that one cannot go down a single path to build out AI infrastructure by just thinking about how the newest processor will drive new capability. And it is not even as simple as any added complexity from adding GPU-heavy nodes into the mix. With accelerator-driven training in particular, it is not possible to look at the system as anything but a system; the impacts of these decisions on data movement, storage and fast access, how to create shared pools of memory to all that money invested in those GPUs can be teamed up—all of this matters. In other words, we are used to living in a server world where the focus is on the processor and network. For AI at scale, the systems level view is the only one that makes sense.

Oddly enough, the best lesson for this outside of hyperscale is something like the Summit supercomputer at Oak Ridge National Lab. The big challenge in the design of that system was creating the ultimate balance between compute capability, capacity, and being able to maximize both of those under serious power limitations. It is this kind of systems-level thinking that is needed again (and we have a lot on Summit as a machine—as an integrated whole here). Due to the HPC and AI double-duty designed to the system it might not be an exact parallel for enterprise shops looking to integrate AI, but it is a starting point for all the considerations, starting with the beefy GPUs but really ending where those meet the network and then the storage, the systems software, and then loop back again. It’s a headier thing than the vendors and mainstream tech press give it credit for and it’s the actual way to think about these things intelligently and holistically, at least from what we understand.

Sure, it is possible to get things all neatly bundled and work in performance through NVMeOF tricks or in software or by spending big on networking options. But if it is true that the lower tier hyperscalers are emulating what their big brothers (and sisters) are up to in the datacenter, that might mean more creativity in systems again. And lo, there are even non-Intel CPUs to consider, at long last. It is also possible to invest big in training with a GPU-laden DGX but trying to network those together also leaves some inefficiencies on the table.

So, the question is, what are people doing at scale and why? What did they do before this? What were the limitations? How do they think about ROI if AI is still just a freshly prototyped initiative with a lot of gusto but a hard-to-prove return? Do different people make the decisions across the hardware stack (i.e., the compute and storage and network aren’t all decided in a systems-level way)? What is the benefit of rolling your own clusters for AI to get max benefit (and max pain in the ass factor)? And how do these scale, do they need to, what is the training versus inference imbalance in CAPEX investment and how is that reconciled and to what end product or service? Oh, and like, who is going to buy custom ASICs just for training when there’s a general purpose option (or two, if you count FPGAs) other than those who can afford to take a risk without a roadmap? Inference might be a different story but how valuable does that application get before you move off standard CPU? And when do you have too much compute and no way to feed it?

The vendors will say that it is simple to bundle the entirety of a system designed for AI training and inference and deliver it turn-key-ish. The makers of accelerators will say that it just takes a phalanx of cores to get the job done and the rest will follow. The storage makers say that they have “re-architected” for AI, in some cases doing so to the point that they have created some less attractive to their existing base or in other cases, they’ve just rebranded with AI as the banner term for a post-disk future. The networks are the networks, those are not so simple to change but there are a slew of new options. The creators of AI frameworks say these things are extensible and easy to use. The pundits tell us AI will eat the world and will surpass the GDP of most nations. The VCs and investors tell us that everything new with a Stanford grad as CTO/co-founder is worth a billion dollars.

But the users?

The ones that have been at this AI experimentation game for a while have a tale to tell about evolution of systems, not just chip choices or accelerator capabilities. They are IO starved and don’t see a rapid solution to counterbalance all that compute. The ones who are new, who prototyped on the cloud and are being tasked with on-prem production systems are in a push-pull over price for an application set that in some cases does not have a clear ROI to balance the CAPEX on hardware. And then there are early experimenters–and time will tell where they take their systems, but one thing is for sure, they are increasingly on-prem as well.

The good news, for us here at TNP and what we focus on, is that infrastructure matters again. More accurately, on-prem does once again. The pendulum shifts back, and it will go back again, but for now, there are a lot of questions.

We will be talking in person and on-stage to the people who have made these decisions and asked these questions as they built systems of their own—as well as to the people that build these systems to understand who and what they are building for.

Please join us. And it all ends in a happy hour with us at the end of the day, so there’s that.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.