Taking The Pulse Of The Core HPC Market

Timothy Prickett Morgan

3 years ago

Since a big chunk of the IBM HPC team moved over to Lenovo as part of the System x division being acquired by Lenovo back in late 2014, which coincided when we started The Next Platform, we have made a habit of talking to Scott Tease, executive director of high performance computing at Lenovo, to take the pulse of the volume segment of the HPC space.

This year is no different, but then again, it is thanks to the coronavirus pandemic, which both helps and hurts the HPC market. The pandemic, perhaps more than any other event in recent years, demonstrates the value of HPC to a large portion of the population as machines relatively large and relatively small have been employed or built from scratch thanks to new funding to come up with vaccines and treatments. But the overall market is somewhat depressed because of the economic effects of the virus, as we have detailed recently, and that hurts. Tease walked us through how Lenovo’s HPC business in particular has been affected by the pandemic and what Lenovo is doing to push its exascale to everyscale strategy.

Timothy Prickett Morgan: How is the HPC business in general right now for Lenovo? Is it up, sideways, down? What is happening out there?

Scott Tease: We just finished our first half, and remember that Lenovo is offset by a quarter and our fiscal year starts on April 1. We just finished our first half and we were up high single digits year over year. I entered the year thinking that we were going to in pretty rough waters. But there was still a lot of spending. I could point to a handful of RFQs that have been postponed, but the number has been much smaller than I would expect.

What helped us in the first half was a lot of emergency orders for systems. In some cases it was for VDI clusters for our HPC customers, or traditional HPC clusters to do COVID-19 research. The University of Birmingham is a good example, where we sold them a Ceph cluster to help them track the virus locally and see how it might be mutating in England. We expedited that system, and we did a lot of those kinds of deals and it helped us in first half.

For the second half of the year, which ends in March, the story is a little bit different, customers are pausing as we come to the end of the Purley generation of CPUs and transition to Whitley. The good news is we have done a lot of deals already on the new “Ice Lake” technology, like at the Max Planck Institute, Korea Meteorological Administration, New York University, Karlsruhe Institute of Technology and the University of Birmingham. These and other deals already signed are going to make for a great year for us next year. The future of technology and the industry is exciting, but there are still bumps in the road we need to navigate.

TPM: You are by no means alone in this, as the recent forecast from Hyperion Research shows. No one is having an easy time in the second half of this year and into early 2021 for exactly the same reasons you cite.

Different topic. What’s the story with pre-exascale and exascale systems in China itself? Does Lenovo have dibs on any of that stuff? We can’t remember the last time we saw a big system in China with the Lenovo brand on it, which seems odd given that Lenovo is a Sino-American conglomerate.

Scott Tease: We compete for them, and we don’t have any of them right now. It seems like most of the big systems in China are using proprietary technologies, and we don’t do a lot of that. We’re pretty centered on Intel and AMD CPUs and Nvidia GPUs, plus InfiniBand and Ethernet for networking. So we probably won’t take part in a lot of the exascale systems in China. For those really big systems, the Chinese government is looking to be a little bit more localized in the technologies that they’re putting into them. That said, I think we’re the biggest HPC provider in China when you look at numbers of systems installed and other metrics.

TPM: The big pre-exascale and now exascale systems get a lot of the noise, but I did the math on the Hyperion forecasts between 2021 and 2024 and if you extract the value of those exascale machines, it represents somewhere between 5 percent and 7 percent of the total sales for hardware, software, and services over those years for between 28 and 38 machines in this class that are expected to be built. The game you are really playing in is for those other 93 percent to 95 percent of the HPC market.

Scott Tease: That is exactly right.

That’s what we call the concept of “From Exascale to Everyscale,” which is that these high end technologies need a unique platform to really unlock their potential. We are talking power densities in HPC and AI systems that currently don’t exist in the standard data center. Lenovo is still going to try to manufacture everything that we use for the high end in a way that you can put it in a standard rack and power it with standard electrical. We realize that the best thing about those upper echelon systems is that it is going to drive the industry forward, but only as long as the technologies can cascade their way down to systems where the average spender can deploy them. That, to me, is the big challenge in all this.

So it is great that we are going to get five or six exaflops systems in the next couple of years, and we will all celebrate that in the HPC community and I personally don’t care who does it. I don’t care who is the first one to cross the barrier. The important thing is not those five or six sites, but for that technology to making its way down from on high so everybody can take advantage of it.

TPM: Well, we are in concert there and this was the founding mission statement of The Next Platform when we founded it six years ago. We believe in that trickle down from HPC, hyperscale, and cloud, and its cross-pollination, too.

That said, and I realize those capability class machines are only a fraction of the market, I still get excited about it and want intense competition. And, to be even more precise, I get irritated on behalf of IBM, Nvidia, and Mellanox that at least one of the three exascale systems funded by the US Department of Energy was not a combination of Power10, which has some really interesting memory technologies embedded in the processor, as well as future Nvidia GPU accelerators and networks. In understand there is a political nature to this all, that we need to keep Intel and Nvidia in check and help foster competition. From what I can see with the future “Frontier” system at Oak Ridge National Laboratory and “El Capitan” at Lawrence Livermore National Laboratory, AMD is offering very, very good bang for the buck and the HPC industry and Lenovo in particular is definitely going to benefit for that.

So, with that, what is your thinking about the combination of AMD CPUs and AMD GPUs going forward with the ROCm environment and Intel with its Xeon SPs and Xe HPC GPUs accelerators and oneAPI environment? Are these going to be a credible alternative for realsies for the core HPC market? I think these vendors will, within the next few years, be able to go toe-to-toe and head-to-head on technology. And when that happens, what we get is a price war – which is terrible for the vendors to a certain degree but truly great for customers at least initially as the cost of HPC systems comes way down.

Scott Tease: We live in interesting times, as they say. Now, I don’t want there to be a price war either because it is always not easy when you have so many choices that you can’t keep up. We like checks and balances and choice. So we are really welcoming AMD, we are welcoming Intel’s future “Ponte Vecchio” GPU, too. The market is definitely shifting towards more and more acceleration. We need more choice and we need a more competitive environment. This will push Nvidia harder, they are a really good engineering company and they are going to respond well to competitive pressure.

TPM: Can AMD and Intel catch up? Can they create competitive devices at a competitive price with a competitive software and sustain that?

Scott Tease: I think they both have potential to do it. They both have teams that have built ecosystems, so I think it’s possible they could do it. The Instinct MI100 accelerator from AMD is a good product. I think that the software ecosystem still needs a few years to kind of really build out. But when you see them winning deals at some of the tier one sites, that is the first step in building an ecosystem. Nvidia didn’t build it overnight, either, and they didn’t build this ecosystem on their own. They had help. And organizations like Barcelona Supercomputing Center, Cineca, and LRZ – those types of high-end users – they have the capability to create software that’s going to be needed by those ecosystems.

TPM: Do you think the HPC and AI markets expand fast enough that even if it’s a 70-20-10 split or a 60-20-20 split or something like that between Nvidia, AMD, and Intel for HPC compute that Nvidia and Intel still grow? I happen to think that’s the case, and AMD can grow, too. The market expands fast and dilutes market share gains for the upstarts.

Scott Tease: I think it is too. We are going from this time where GPUs were an afterthought on an HPC system to where they represent the vast majority of the compute. And even if HPC organizations are still buying a standard CPU-only system today, they are looking at their next generation and whether it will accommodate accelerators. So I think the market for acceleration will continue to grow and even if there’s competition, Nvidia is still going to benefit. Now, they’re probably getting more price competition, but they’re still going to see growing revenue. We are at the very beginning of GPU adoption here.

TPM: I also think that any market that’s going to command 50 percent or higher gross margins is going to engender competition, and those profits will have to come down a bit somewhere.

Scott Tease: Oh, absolutely.

TPM: The CPU and GPU makers are going to get their rightful share based on how hard they have been working and how good their foundries are. And you know what surprises me is this: AMD’s “Rome” Epyc 7002 CPUs didn’t take off a lot more than they did. But I have a funny feeling that the third generation “Milan” Epyc CPUs are doing pretty good out there in presales across hyperscalers and cloud builders and maybe even an HPC center or two. And yes, the AMD Instinct MI100 is not as good as the Nvidia A100 in some ways, bit it is within spitting distance for a lot of people. If you don’t have applications that can take advantage of sparse matrix or Tensor Cores, AMD can beat Nvidia on price/performance.

Scott Tease: The big thing we are keeping an eye on is all of the complexities of all of the systems that are coming out. Nvidia has its NVLink interconnect and its NVSwitch. AMD has got its Infinity Fabric and PCI-Express. Intel is talking about CXL on PCI-Express. How is the GPU talking to the server CPU? There are a lot of differences there, and PCI-Express Gen 4.0 is also going to Gen 5.0 really quickly. It’s going to make things really complicated, and system makers are going to have to make bets because the volume in any given platform is going to be smaller than it would be if there was only one choice. The value will be there, the revenue will be there, but the volume of the platform is going to be small enough that no OEM or ODM is going to want to build duplicate platforms for all of these different variations on the compute theme.

TPM: Well, I don’t envy you having a bunch of low volume, high touch engineering tasks. . . .

Scott Tease: I actually think this will benefit Lenovo, because of our leadership in system design and world-class supply chain. The good news is that the revenue for these things is awesome.

TPM: But you and I know a lot of that revenue is passing through to Intel and Nvidia and soon AMD. And I don’t think that’s right. It’s not healthy for the overall ecosystem. I think the chip makers should leave a little on the table for the people that are actually bending the metal and selling this stuff. I don’t think this makes me a communist, but rather an enlightened capitalist.

Scott Tease: I think it’s all about stewardship of the marketplace. Honestly, you need people, you need partners. You can’t be the only innovator. I think Nvidia has been in this position where they are one of the key innovators in what they do, and they created marketplaces and it took them to leadership in the top s 5 percent peak of the pyramid. What they need to work on now is expanding GPU acceleration that more broadly, so it is as ubiquitous as a Xeon or an Epyc processor. It’s everywhere. It’s in the channel, and lots of people know how to sell it and they know how to port to it. And that’s where they do need help. And again, I see them moving in that direction. And it’s kind of nice to see the partnership kind of growing and they understand to go beyond what they’ve been great at, which is very high touch with the end user client to the masses. They need partnership with companies like Lenovo and others.

TPM: Last thing. I was surprised that Cineca didn’t stick with Lenovo and picked Atos as the vendor for its “Leonardo” system. I get that there is always a geopolitical climate in HPC.

Scott Tease: In the HPC business, political influence is always part of the game. It’s the nature of the beast. I am sure that the European Union was looking for a diversification of technologies and probably also diversification of vendors as well.

We are fortunate in that we are one of the few companies that is able to be successful in HPC globally. And that is because we act like a local company in each market. So in Europe, we invest as if we were a European company. We do all of our manufacturing in Europe. We have our HPC performance lab in Europe. We have software development in Romania. We manufacture in Hungary. Our decision makers are all in their geographies. It is European leaders running the European team with European investments.

And you have got to do that these days. You can’t run a business from only one country. Most of our supercomputing assets are in North Carolina, and I am pretty vocal that all of our supercomputers are born in Morrisville, outside of Raleigh. They may not have been built there, but they were all born there. We have a supercomputing lab in NC and just outside of Raleigh in Morrisville. And every system recipe we do, every component that we test, goes through that lab. We may build them in Mexico, we may build them in the United States or China or Hungary, but they all had their start right here in NC.

Lenovo has invested heavily in Europe, so we hope to have more good news to come on our continued success there in the future.