DOE AI Expert Says New HPC Architecture Is Needed

Artificial intelligence is taking center stage in the IT industry, fueled by the massive growth in the data being generated and the increasing need in HPC and mainstream enterprises for capabilities ranging from analytics and automation. AI and machine learning address a lot of the demands coming from IT.

Given that, it’s not surprising that the view down the road is that spending on such technologies will only increase. IDC analysts are forecasting that global revenue in the AI space, including hardware, software, and services, this year will hit $341.8 billion — a 15.2 per cent year-over-year increase — and will jump another 18.8 per cent in 2022 and break the $500 billion mark by 2024.

Datacenter hardware OEMs and component makers for the past several years have worked furiously to build AI, machine learning and related capabilities into their offerings and public cloud providers are offering wide ranges of services dedicated to the technologies.

However, a problem with all of this — the way AI is being used and the underlying infrastructure that supports it — is that much of it is an evolution of what has come before it and is aimed at solving problems in the relative near future, according to Dimitri Kusnezov, deputy under secretary for AI and technology at the Department of Energy (DOE). In addition, much of the development and innovation has been transactional — it’s a big and fast-growing market with a lot of revenue and profit opportunities, and IT executives are aiming to get a piece of it.

But the highly complex simulations that will need to be run in the future and the amount and kind of data that will need to be processed, storage and analyzed to address the key issues in the years ahead — from climate change and cybersecurity to nuclear security and infrastructure — will stress current infrastructures, Kusnezov said during his keynote address at this week’s virtual Hot Chips conference. What’s needed is a new paradigm that can lead to infrastructures and components that can run these simulations, which in turn will inform the decisions that are made.

“As we’re moving into this data-rich world, this approach is getting very dated and problematic,” he said. “Once you once you make simulations, it’s a different thing to make a decision and making decisions is very non-trivial. … We created these architectures and those who have been involved with some of these procurements know there will be demands for a factor of 40 speed-up in this code or ten in this code. We’ll have a list of benchmarks, but they’re really based historically on how we have viewed the world and they’re not consonant with the size of data that is emerging today. The architectures are not quite suited to the kinds of things we’re going to face.”

The Department Of Everything

In a wide-ranging talk, Kusnezov spoke about the broad array of responsibilities that DOE has, from overseeing the country’s nuclear arsenal and energy sector to protecting classified and unclassified networks and managing the United States’ oil reserves — which include a stockpile of 700 million barrels of oil. Because of this, the decisions the Department makes often come from questions raised during urgent situations, such as the Fukushima nuclear disaster in Japan in 2011, various document leaks by WikiLeaks and the COVID-19 pandemic.

These are immediate situations that require quick decisions and often don’t have a lot of related modeling data to rely on. With 70 national labs and a workforce of almost 100,000, the DOE has become the go-to agency for many different crises that occur. In these situations, the DOE needs to develop actionable and realistic decisions that have high consequences. To do this, the agency turns to science and, increasingly, AI, he said. However, the infrastructure will need to adapt to future demands if the DOE and other organizations are going to be able to solve societal problems.

The Energy Department has been at the forefront of modern IT architecture, Kusnezov said. The launch by Japan of the Earth Simulator vector supercomputer in 2002 sent a jolt through the US scientific and technology worlds. Lawmakers turned to the DOE to respond and the agency pursued systems with millions of processing cores, heterogenous computing — leading to the development of a petaflop system in 2007 that leveraged the PlayStation 3 graphics processor — and the development of new chips and other systems.

“Defining these things has always been for a purpose,” he said. “We’ve been looking to solve problems. These have been the instruments for doing that. It hasn’t been just to build big systems. In recent years, it’s been to create the program for exascale systems, which are now going to be delivered. When you face hard problems, what do you fall back on? What do you do? You get these tough questions. You have technologies and tools at your disposal. What are the paths?”

Traditionally that has been modeling and measuring — techniques that first arose with the Scientific Revolution in the mid-1500s. Since the rise of computers in the last decade, “when we look at the performance goals, when we look at the architectures, when we look at the interconnect and how much memory we put in different levels of cache, when we think about the micro kernels, all of this is based on solving equations in this spirit,” Kusnezov said. “As we’ve delivered our large systems, even with co-processors, it has been based deliberately on solving large modeling problems.”

Now simulations are becoming increasingly important in decision making for new and at-times immediate problems and the simulations not only have to help drive the decisions that are made, but there has to be a level of guarantee that the simulations and the resulting options and decisions are actionable.

This isn’t easy. The big problems of today and the future don’t always have a lot of historical data used in traditional modeling, which brings in a level of uncertainty that needs to be included in calculations.

“Some of the things we have to validate against you can’t test,” Kusnezov said. “We use surrogate materials in simulated conditions, so you have no metric for how close you might be there. Calibrations of phenomenology and uncontrolled numerical approximations and favorite material properties and all of these can steer you wrong if you try to solve the Uncertainty Quantification problem from within. There are many problems like that where if you think within your model you can capture what you don’t know, you can easily be fooled in dramatic ways. We try to hedge that by experts in the loop with every scale. We strain architectures and we try and validate broader classes of problems whenever we can. The problem that I have at the moment is that there is no counterpart for these kinds of complex approaches to making decisions in the world, and we need that. And I hope that’s something that eventually is developed. But I would say it’s not trivial and it’s not what’s done today.”

DOE has always partnered with vendors — such as IBM, Hewlett Packard Enterprise and Intel — that build the world’s fastest systems. That can be seen with the upcoming exascale systems, which are being built by HPE and involve components from the likes of Intel. Such partnerships typically involve modifications to software and hardware roadmaps and the vendors need to be willing to adapt to the demands, he said.

In recent years, the Department also has been talking with a broad range of startups — Kusnezov mentioned such vendors as SambaNova Systems, Cerebras Systems, Groq and Graphcore — that are driving innovations that need to be embraced because a commercial IT market that can be measured in the trillions of dollars isn’t going to help solve big societal problems. The money that can be made can become the focus of vendors, so the goal is to find companies that can look beyond the immediate financial gains.

“We have to be doing much more of this because, again, what we need is not going to be transactional,” Kusnezov said. “We have pushed the limit of theory to these remarkable places and AI today, if you look to see what’s going on — the chips, the data, the sensors, the ingestion, the machine learning tools and approaches — they’re already enabling us to do things far beyond — and better — than what humans could do. The discipline of data now, coming late after the push for solving theories, is starting to catch up.”

Systems and components that that evolved over the past decades have pushed the limits of theory and experiment for complex problems — and that will expand with exascale computing. But current architectures were not designed to enable scientists to explore both theory and experiment together, he said.

“Decisions don’t live within the data for us,” Kusnezov said. “The decisions don’t live in the simulations either. They live in between. And the problem is from chip designs to architectures, they’ve done remarkable things and they’ve done exactly what we intended them to do from the beginning. But the paradigm is changing. … The kinds of problems that drove the technology curve are changing. As we look now at what’s going on in AI broadly in terms of chips and techniques and approaches, it’s a remarkable breath of fresh air, but it’s being driven by near-term market opportunities [and] specific applications. It might be that we will stumble into the right endpoint, but I don’t want to lose this window of time and the opportunity to say while we are thinking of altogether new designs for chips and architectures. Can we step back just a little bit to the foundations and ask some more fundamental questions of how we can create what we need to merge those two worlds — to inform decisions better [and] new discovery better? It’s going to take some deep reflection. This is where I hope we can go.”

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. The last voice that I want to hear from on the direction of AI is the USDOE. There is no need for the federal government to “step in and save us.” The private sector is managing AI quite well as strongly evidenced by Elon Musk’s AI Day recently. Yes, I say “keep out!”

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.