Lenovo is, among all of the major suppliers of HPC systems in the world, perhaps uniquely positioned to have a very large share of the HPC market.
The bulk of the company’s HPC business initially was founded on the System x server business that it acquired from IBM a few years back plus the related storage and cluster management software that Lenovo licensed from Big Blue as part of the deal. But Lenovo, which serves Chinese hyperscalers and cloud builders, knows a thing or two about scale out distributed computing and how to get the cost of a machine down to compete. And so, it is winning deals and growing its business.
The November 2018 Top500 rankings of machines running the Linpack Fortran benchmark is a case in point in how far Lenovo has come. If you drill down into the current list, Lenovo had an aggregate of 234.3 petaflops of Linpack performance (16.6 percent of the total capacity on the list) across 7.74 million cores in 140 systems, for the first time besting Cray, which had 193 petaflops of sustained double precision oomph on Linpack on 49 machines with just over 7 million cores, comprising 13.6 percent of total capacity. For some reason, the “Summit” supercomputer at Oak Ridge National Laboratory and the companion “Sierra” system at Lawrence Livermore National Laboratory are not labeled IBM machines (they share the label with Mellanox and Nvidia) even though IBM is the prime contractor, but if you add these machines into the True Blue base of machines, then IBM has 15 machines on the list with a total of 296.4 petaflops of sustained performance on Linpack across 7.6 million cores. The point is, Lenovo is keeping pace with IBM and Cray and it is staying well ahead of Chinese rivals Inspur, Sugon, and Huawei.
At this point, says Scott Tease, executive director of high performance computing at Lenovo, his employer has the most globally diverse base of supercomputers, with machines installed in 17 countries. Lenovo can compete in China against its indigenous rivals, and because of IBM’s long history in HPC in North America and Europe, Lenovo can compete in these markets, too. The other Chinese companies want to break into the Western economies, and the Western companies want to break into China. Both get mixed results, and none of them has as many machines on the list as Lenovo. And while this may not seem like a big deal, the fact of the matter, Tease tells The Next Platform, many enterprise, government, and academic HPC centers want to see how you are doing on the Top500 list as part of the bidding process on machines even if they may poo-poo the Linpack test results themselves. Building big systems demonstrates that you can build big systems.
But as we enter the exascale era, Lenovo is gearing up for big changes. This is all going to get a lot more complex, and of necessity as Moore’s Law improvements in chip economics and performance slow.
“One of the things we talk a lot to customers about is that a straight, processor-based computing platform roadmap is not going to get us to exascale,” says Tease. “It is going to be a mixture of different accelerator technologies, whether they are Nvidia or AMD GPUs or Intel’s futuristic Configurable Spatial Accelerator or Intel and Xilinx FPGAs, the exascale systems are going to be a mixture of CPUs and accelerators. The angle that we are taking is to leave ourselves pretty open for different partnerships because it seems that no matter where you are at on the globe, there is a different forerunning on the technology that is going to drive exascale. In Europe, there is RISC-V and Arm, in China there is a bunch of homegrown compute plus AMD Epyc, and in the United States there is hybrid CPU-GPU machines and Intel’s CSA. We are trying to stay open given our global diversity, and we are telling people that whatever investments we make in an exascale product, our goal is to be able to take that product down and sell it to HPC and AI customers of all sizes. It will not be the kind of thing where we design it once and then never sell it again.”
One of the big changes is that the general purpose, Xeon compute substrate that has been the hallmark of HPC clusters for the past decade is seeing competition for the first time. No one is quite sure how this will all play out, but everyone is watching to see who goes first and how big they go.
“There is a massive amount of excitement about AMD coming back into the market because it gives choice because there is a viable competitor,” Tease explains. “So that’s good. We have not seen interest in new processing technologies in a long time because there really was not anything that could compete with Intel. Then there is Arm processors and RISC-V and a slew of accelerators based on GPUs and FPGAs. Balancing this excitement at customers is some concern about how this is all going to be consumed. Customers will be able to design machines for specific tasks and specific levels of price/performance, but it is a little bit scary because it widens the aperture of components they have to consider. It all comes down to how much better does the price/performance need to be to lure customers off the beaten path of the Xeon processor.”
That answer will, of course, depend on the applications, the true availability of processors and accelerators at any given time, and their cost.
The other thing that Lenovo is keeping a close eye on as we move into exascale is AI, and specifically machine learning that is done through accelerators.
There is an emerging idea that machine learning (meaning sophisticated statistical condensation of data into information as well as deep learning using neural networks) is going to run side-by-side, be interleaved with, or act as an overlay to traditional simulation and modeling workloads that comprise scientific and financial HPC as we know it. This will happen because AI, in the broadest sense, will help figure out what to simulate and how to simulate it better than we can by ourselves could, and hopefully make better use of all of the flops in a system. The question is this: Will HPC centers be resistant to this convergence with AI?
Lenovo, like other HPC system vendors, is treading carefully with the AI in HPC message.
“When we first started this whole drive to take on AI from within the HPC mission at Lenovo, my assumption was that as we took this message out, the leaders of the HPC centers would embrace AI as the next big thing in HPC that would transform it and give them a control point to revitalize the community. It hasn’t really been that exciting, and a lot of organizations are still struggling to see how they apply AI in their centers. The good thing for me is that I think AI has the potential to change up how we do technology. Some people think you can just move AI to the cloud and that’s it. But you do HPC and AI where the data exists, and for HPC centers, this is where the data exists and they are not going to want to take machine learning training and inference away from where the data lives. In the existing simulation and modeling we do today in HPC, we have just a massive amount of data and we need to use AI to either enhance the results we get or to speed up the time it takes to get results by changing the sequence of steps in an HPC workflow – or both. AI is going to be far more disruptive than cloud ever was, especially for HPC.”
Tease has respect for the IBM Power9-Nvidia Tesla hybrids that are the building blocks of the Summit and Sierra supercomputers that are number one and two on the latest Top500 rankings, and the important thing to remember about those systems is that both were created to do both HPC and AI from the get-go. Tease similarly has great respect for the engineering that went into Nvidia’s DGX-2 system, which is thus far aimed mostly at machine learning workloads. That said, Tease thinks that for AI to go mainstream, the architectures are going to have to be more common and less costly.
“To make that next big jump, we will need processors and accelerators to do mixed precision in both floating point and integer, but we can’t be positioning these hybrid machines as just being for AI,” Tease continues. “These concept machines are fantastic if you are a big hyperscaler or cloud builder. But to really take AI mainstream, the products that we use have to be common things that people can use for other tasks – traditional HPC, visualization, virtual desktop infrastructure, accelerated databases, and so on.”
When pressed about what that cluster architecture might be, Tease says it will probably be a machine with one or two CPU sockets and a lot of accelerators stuffed into it – and probably not using something as sophisticated as Nvidia’s NVSwitch unless and until it can help accelerate these other workloads. We concur with Tease that HPC centers will sacrifice some absolute performance to get a cluster that can run many different kinds of workloads because that suits their diverse needs. Hyperscalers don’t have to support such a diversity of workloads, so they can have highly tuned machines – just like the upper echelon of HPC centers do.
“As we prove out these technologies and methodologies and we show that there is a business case to run AI, our belief is that you are going to see systems that are far more common – and they look more like a traditional HPC cluster than what data scientists are buying to test their ideas out on. The promise is so big we want it to go everywhere, and the more unique the technology is, the less chance it will be ubiquitous. Multi-use and total cost of ownership are going to win out, just like it did with traditional HPC. When I talk to customers, I talk about scheduling and provisioning and creating a pool of resources. These are the most basic concepts of but many of these concepts are still foreign to enterprises. If we can package this up in a way that is easy to use and more cost effective than specialized machines, it is going to win out.”