Hardware is the most tangible part of any system, and it is the aspect of the system that tends to get the most attention. You can touch it, and when it is fresh out of the box you can even smell the crisp newness of the compute. Still, some of the more important aspects of a system are more subtle and are more difficult to tease out and control.
Dave Turek, vice president of exascale systems at IBM, spends a lot of time thinking about these issues as well as the exascale systems that Big Blue is designing. And as you might expect, given IBM’s experience with building “Summit” at Oak Ridge National Laboratories, which is one of a number of pre-exascale hybrid AI-HPC supercomputers in the world – and currently the most powerful one at that at slightly over 200 petaflops – Turek is thinking about computing at scale across a number of different fronts and will be discussing these this week at The Next AI Platform event in San Jose.
One of the themes that IBM is contemplating as AI and HPC workloads converge is that a much higher degree of composability needs to be implemented in systems than is available today, and coupled to this idea is also the need to shift the focus from algorithms to workflows of algorithms that come and go across that hardware configurations that are truly ephemeral in a way that cloud computing appears to be because of massive variety but really isn’t as malleable as it appears.
“We have ideas on this that go quite far out in time, but work is commencing on it as well,” explains Turek. “This is not a new problem, it’s just a problem that requires a little more attention right now. People are operating on the problem with really blunt instruments, in this case by answering two question: Do I need GPUs or not, and if so, how many should I get? And then they come up with a number driven maybe more by budget constraints than anything else. But how often do you think the GPUs are being invoked? How often are the different architectural attributes of the CPUs and GPUs being invoked?”
Being general purpose and covering a lot of workloads means leaving a lot of silicon dark a lot of the time – something we cannot afford as Moore’s Law improvements are slowing and making chip progress slow and prices rise.
“We actually have just completed a first draft of what I am calling a new manifesto for the future of high performance computing, which is meant in the broadest sense. It’s quite simple, and it say the future will be based on bits, qubits, neurons, and an information architecture. Neurons is the AI piece, bits means classical HPC, and qubits is quantum, of course, but it all needs to sit on a sea of an information architecture. Note that I didn’t say data, I said information architecture, which means having the data infused with some sort of ontology so that we can operate on it and make good use out of it. Implicitly, this means we need to denominate work in terms of workflows, not in terms of algorithms, because workflows are more important to the enterprise than the execution of a particular algorithm. And by further implication, we want to have the ability to deliver the right set of components at the right time.
Another thing that IBM is thinking about as it looks ahead to the future of AI architectures is security, bit in a different way than from what we might expect. The way neural networks function is a bit of a black box; the way that they come up with answers is not as traceable as how an algorithm written in C++ or Fortran is. It is bad enough that neural networks are a bit of a black box for industries that are used to determinism, but equally important to trying to pick apart what a particular AI application gets the answers that it does is the need to certify that someone did not inject data into the training sets that will allow someone to actually steer the results. IBM, says Turek, is exploring the use of the blockchain to keep track of the application and data workflows in AI applications.
“What blockchain will do AI and HPC workflows is allow us to go in reverse, taking the answer and backing up from that to see what happened in different steps of workflows, and thereby give us some sense of the methodology and to check if any mischief occurred along the way – or better still, to make sure no mischief occurs in the first place,” Turek says.