Shifting Tides for China’s Next Wave of Supercomputers

While the news of the last few years has clearly indicated that China is at the top of the world when it comes to chart-topping supercomputers, the storyline is evolving—and the picture for high performance computing in Asia is not as rosy as many believe.

To be fair, however, the landscape for high performance computing in China was much more picturesque at this time last year and certainly the one before. The sky appeared to be the limit, with announcements at several of the country’s supercomputer sites–for those dedicated to weather to those crunching larger research problems–about massive upgrades and extension. But then, as is the case in quite a few other areas inside China over the last several months, the tone changed.

Putting aside the larger discussion of the current economic woes in China, there have been two critical blows that have left the country reeling when it comes to the future of their large-scale supercomputing infrastructure. First, and most publicly discussed, is the fact that Intel is restricted from shipping supercomputing processors to China, a major blow for the world’s fastest supercomputer, Tianhe-2, which was set to be upgraded this year. The upgrade cycle for that has been pushed out at least a year, and The Next Platform reported this summer, it will now use an interesting blend of homegrown architectures to power its top super, including an accelerator that is based on digital signal processors (DSPs), which have long since been a research target at the National University of Defense Technology (NUDT) where the Tianhe-2 machine is based.

But it’s not just the very top machine in China that will feel the burn. Funding cutbacks throughout China will ripple through all 19 supercomputer sites, having a big impact on the top ten systems, but the sharpest slice in the top three or four systems, mostly because their upgrades will be the most expensive. Considering that two machines in China were set to get a lift past the 100 petaflop barrier and both were reliant on Intel and government funding, those plans are pushed out a year or more. And another large supercomputer in China, the Sunway BlueLight system, which uses non-Intel (ShenWei) processors will have a tough time getting its petaflop boost as well due to the funding problems.

As one might logically guess, this coupling of funding issues and Intel restrictions might spur an already burgeoning domestic semiconductor and HPC software program that could be best represented by the forthcoming DSP architecture that will be featured on the upcoming Tianhe-2 machine. Indeed, this appears to be plan, with ramped-up investments up and down the stack; from broad machine learning and more specific HPC-focused software initiatives for both research and industry to ongoing hardware investments like the Godson processors that are cropping up in a new field of Chinese-built machines. But as Earl Joseph, program vice president for HPC at IDC, tells The Next Platform, the current norm on the hardware front is hard to beat.

“At this point in time, all of those systems that are built with domestic [Chinese] processors as the primary processor have not been doing well from a performance standpoint—and they are also costly, so the price performance is still not quite competitive.”

Even still, there are options in China that aren’t solely reliant on homegrown architectures or Intel parts that may never make regulation muster. For instance, consider that IBM has already sold IP rights to China to bring OpenPower based systems into the Asian supercomputing fold. And while these chips might be based on IBM tech, as more roll into production on Chinese systems, these will be acceptable in comparison to the whole swath of critical infrastructure in the country that was moved off IBM and other U.S. companies (banks in particular) before the IBM and Lenovo deal went down.

On that note, there are other efforts in China to take a bite out of American tech vendors as the pullback from non-Chinese hardware and software for big infrastructure continues. For instance, the Chinese government provides funding to companies to displace services that American outfits like Google and hardware vendors like Cisco, IBM, and EMC offer. Called the “De-IOE” movement, which stands for IBM, Oracle and EMC, is not to be overlooked and has received a fair bit of attention after Alibaba, Asia’s answer to Amazon, kicked off their IBM and Oracle systems in favor of a homegrown set of approaches in 2013.

As one might imagine, there is something to said for true reliability of time-tested systems from these vendors, but the risk now could mean a reward in the form of a more vibrant Chinese server and enterprise software market in the coming years. And Europe is thinking this way as well as it watches what is happening with ARM throughout the world (although for now, of course, ARM is just a license-driven company versus a hardware manufacturer). And not to make too much of the European side point, consider that when Lenovo set up its worldwide applications development center, it wasn’t in China, it was in Stuttgart, Germany, not to mention the fact that Lenovo is part of the European Technology Platform for HPC supported by the European Commission (a government-sponsored vendor organization to create new HPC technologies).

A Quiet Risk: The Reliability Factor

But all of these efforts aside, the point is, to find their competitive edges, countries are ditching the years of development into IBM mainframes, Oracle’s systems and software, and countless other tech products to blaze these new trails. And that is not without its own risk to quality. It may not be a permanent struggle, but as new platforms based on old ideas spin off, they will not be fail proof (not to suggest any system is free from failures, of course).

And remember the DSP architecture that will be the unexpected accelerator upgrade for what is far and away the fastest supercomputer on the planet? It is an interesting compute possibility–but how will it even perform at scale?

As countries like China and others seek their homegrown approaches to time-worn technologies from the world’s largest hardware and software providers, learning by doing (read as error) is the new development framework. After all, consider the pressures.

If one considers that the vast majority of critical infrastructure in China was running on the one bit of hardware and software that is most time-tested, the mighty mainframe, progress toward homegrown competitive efforts might be lumbering for a time. While it is possible to build equivalent technologies, the years of refinement are not to be discounted, according to Joseph. In the course of conversation, he referenced a presentation at the International Supercomputing Conference from Inspur, the Chinese HPC hardware company that is behind the Tianhe-2 machine, where a national HPC project for the country’s transportation and rail reservation system, was able to handle peak demand for one of the world’s largest traveling populations within China, at 95% accuracy. That might sound great for something where this is a margin of error allowed. But consider an airline or travel booking site that would, 5% of the time, improperly handle the reservation. This would be considered unacceptable, of course, but it comes down to the reliability of the systems.

Still, Chinese companies like Sugon and Inspur are hard at work to deliver the next generation of products that will lock out the need for American tech—even at the risk of having slightly less performance and price attractiveness. However, on the flip side, there are certainly pressures beyond the more recent restrictions on Intel processors, especially for the processors. Just as the United States has made it illegal for its own government branches to buy supercomputers from a non-U.S. company, so too does a similar approach stand in other countries, and where it is not formalized in legal language, there is a definite “favoritism” for in-country or region vendors. For instance, French supercomputer maker Bull does a large swath of its business in France and Europe (although they have secured systems in Japan as well), and Inspur’s business has not translated outside of China.

So what happens with a company like Lenovo then, which is now the representative force for IBM’s pre-existing X86 HPC customers? Lenovo has made it clear that their focus is right where they have the most chance of success—in China, followed second by Europe, and then the U.S. market. Remember, this is for high performance computing systems versus general consumer devices—but they will face an uphill battle for even non-government supercomputers in oil and gas, financial services, and other areas as IBM works to bolster its OpenPower story in the U.S.. We have already watched this U.S. and China tension play out this year in the context of the $44 million NOAA supercomputer deal, which was originally awarded to IBM but following the acquisition, moved into home court advantage with Cray taking over the contract.

IDC’s Joseph says that while he has a great deal of belief in what the Chinese supercomputer market will produce in terms of its own technologies in the coming years, it’s an uphill battle, although having Lenovo as a supplier of X86 parts will lighten the burden. Still, he says, the supercomputing world outside of the U.S. is keenly scouting for alternatives to American-based technology giants and this has European and Asian governments in “all ears” mode when it comes to emerging and established options alike.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.

Subscribe now

6 Comments

  1. In China Maybe business may not be so good for IBM’s own branded power8 hardware business, but what about those Chinese companies that have a Power8 license via that ARM style OpenPower licensing! It’s still going to net IBM some IP revenues, and maybe even some software/services revenues from the non governmental(non defense based) Chinese users of any Chinese company that licensees power/power8 CPU designs and sells them in the Chinese marketplace. The restrictions may not work as intended anyway but to only have lasting adverse affect on the US makers of the restricted technology.

    Certainly there will still be users like Google and Alibaba that will maybe contract to have their own power8 based systems built around the IBM power IP. If the third party power8 based systems are adopted by Google, and others through Google’s/others licensing and custom building of their own Power8 server systems then Power8/Power is doing well regardless of one single company’s fortunes. Like ARM holdings has allways done, and now IBM, once IBM went towards the licensed IP business model with the power IP/ISA, IBM has ceded total control over what may happen with the Power8 licensees and their uses of the licensed IP. IBM is in a good position anyways because the more power8 licensees there are using the Power IP/ISA the more chances IBM has with their real cash cow, IBM’s software and services that are made to run within that power/power8 ecosystem regardless of who made the power8 based server hardware. IBM is investing in Linux based OS development also and getting its services integrated into the Linux based software ecosystem. So much of the cloud services/server world runs under the Linux based software stack that the underlying hardware dependencies are pretty much abstracted away for any ISA that supports Linux based OSs.

    Lenovo may have IBM’s x86 based server business but really having a business based on a singular CPU ISA is not going to be that advantageous going forward as there will be plenty of SMB, and even enterprise users that will go with whatever CPU ISA can be had at the price point that is best, and can run at least the Linux based software stack. The ARM based server market is still very much in its formative stages, and even in the x86 based market there will be some more competition once AMD gets its x86 Zen based HPC/server SKUs in the marketplace, and these AMD server SKUs will come with their own on interposer based GPU accelerators wired up to those Zen cores via a much faster much wider fully coherent interconnect fabric etched into the interposer’s silicon substrate. Even though AMD has had to push its K12 custom project dates back to 2017 to focus on getting its Zen cores ready, I still see AMD doing with its custom ARM server parts what it is doing with its Zen based server APUs on an interposer. AMD being all in with server APUs will probably be doing up a nice version of its custom K12 ARM cores and Greenland GPU accelerators on an interposer with HBM and probably the first custom ARMv8a ISA based CPU cores that have SMT abilities.

    The Chinese are not the only ones utilizing DSPs, Even Qualcomm on its newer SOC products is making use of its Hexagon Digital Signal Processor on its mobile SKUs by opening up that DSP for applications to use for HSA style workloads via the API on the devices that use the SOC. So the Chinese are not the only ones enamored with the HSA style compute mantra of making us of all the processing capabilities on a platform for any types of compute. I wonder if there will be any types of restrictions on SOC’s that can and do make use of all there various heterogeneous processing resources. What about future systems on an Interposer that could be made of non restricted CPU/SOC dies, only to have those SOC dies able to be grouped on an interposer and wired together as if they were made from a singular monolithic die, with some very wide very efficient coherent connection fabric etched into the interposer’s silicon substrate. That type of modular construction of modular CPU, GPU, DSP and other dies on an interposer is going to become commonplace with extremely powerful processing ability able to be assembled on an interposer and 10s of thousands of traces on the interposer utilized to wire up all these modular units into a pretty powerful system that could very easily out class a restricted Xeon, power9 or other type of monolithic CPU/SOC. The systems on an interposer are going to be pretty hard to regulate if they can be assembled from smaller units that may not be individually restrict-able.

    • I wouldn’t paint such a rosy picture here for IBM selling IP as a blue-chip company and selling real chips makes a huge difference in terms of profit and profit margins. Or who do you guess is the bigger company and generates more revenue Qualcom or ARM. It isn’t ARM I can tell you that.

  2. OMG with the IBM only, when it’s not going to be IBM’s overpriced power8 hardware kit that is sold in China, it will be one of those Chinese Power8 licensees at a much better price point. Licensing IP, and selling IP are two completely different things, and those Chinese licensees will be able to undercut Intel’s pricing.

    And while we are on the subject of CPUs in servers, I think that things are not going to be so CPU centric going forward in the server room especially where AMD’s server APUs are concerned and CPUs are piss poor at number crunching compared to the GPU, just count up the total Numbers of FP/int units in even those 18 core Xeons, or 12 core power8’s and compare them to the number of GPU vector, FP/INT resources that will be on those server APUs with their Greenland/Arctic islands GPUs.

    Those AMD ACE units will have multiple times the ratios of FP/INT unit counts as any puny CPU. And we are not talking about GPU cores communicating over PCI, we are talking about a GPU accelerator wired up to the CPU cores with a more direct on die type of connection fabric etched into the interposer’s silicon on those AMD server APUs. We are talking about teraflops of computing power on a server/HPC APU and a much more closely integrated to the Zen CPU cores system for those AMD APUs on an Interposer. Then we can add to that the actual asynchronous compute ability built into those ACE units on AMD’s GCN micro-architecture where executing threads can be preempted, or if there are any serial dependencies or stalls in a running GPU thread, the thread can be very quickly context switched out and another switched in with very little effort for some SMT type of execution in those ACE units.

    The days of the CPU only as the only method of compute are drawing to a close, and as much as Intel would have you believe that the CPU only era is still viable, the entire mobile market, server market, as well as the PC/Laptop market is going to accept the system design tenets of heterogeneous compute, and one need only look at the membership of the HSA foundation to see the level of support there is for heterogeneous compute from phones to supercomputers. AMD is not the only one working on HSA, as the HSA foundations members are also starting to introduce SOC/mobile and other server products that make use of ARM and GPGPU compute and the HSA 1.0 standards as defined by the HSA foundation. Did you not read about the Chinese use of heterogeneous compute and their DSPs, or even the post that you replayed to mention Qualcomm’s use of DSP for heterogeneous compute.

    Too much attention is being payed to CPU cores and the IPC metric, and too little attention is being paid to the GPU on APUs and SOCs and an entire industry’s beginning to make use of the GPU for its raw compute ability. GPUs, DSPs, FPGAs are going to be doing more of the raw grunt work leaving to the CPU the role of managing the OS and services rather than actually crunching the numbers. You can expect that on AMD’s arctic islands GPU/accelerators that a completely new asynchronous compute micro-architecture will have even more of the CPU types of abilities to run compute workloads completely independent of the CPU. Really the whole computing industry is embracing heterogeneous compute and all that can be discussed is the CPU, and the IPC metric. Those Zen cores are not the defining feature of AMDs server APUs its the GPU and those ACE units that define the true abilities of those APUs on an interposer. It’s time to stop and count the FP/other units on the GPU and then those CPU cores do not really matter going forward for compute.

    P.S. Arm Holdings does not sell chips they are a design/IP bureau that utilize the licensed IP business model, in fact it was Arm Holdings and MIPS Computer Systems Inc. that pioneered the licensed business model in the computing world for CPU/SOC designs. One does not compare a design/IP company’s market cap, the proper comparison is to compare the entire ARM based industries’ market cap and R&D resources, so even Intel can not match that comparison. Arm Holdings market Cap is inconsequential, as again it’s the licensees that are using the IP, and some are developing their own custom micro-architectures that run the ARMv8a ISA. Apple’s custom Cyclone micro-architecture is still twice as wide as Arm holdings’ A72 on the front end instruction decoder section, and wider with the execution pipelines on the receiving end also, Apple only licenses the Armv8a ISA. AMD’s custom K12 will possibly be having SMT capibilties added to its custom micro-architecture ARMv8a ISA running designs.

      • @NonCPUsCentric. You are completely lost in your comment. 6 large paragraphs and basically nothing substantial has been said. CPU are the most violable point in the HPC/Supercomputer market, and they will be for the rest of your natural life. Of course, I agree that changes are coming, but FPGA will play major role in that shift/trend not the ARM or GPU’s. INTC already has foreseen that, plus MU has developed Automata which appears to be exceptionally valuable Co-processor proposition.
        AMD for the moment has only 1 thing in their mind, how to survive another day, not to mention spend money on R&D or develop their technologies further.

        • No one expects ARM based server SKUs to take the HPC/server/supercomputer world over in a night or 5 years. And think long and hard before you criticize any ARM ISA RISC based design, because Power/Power8 is also a RISC design, and IBM and Nvidia have a few supercomputing contracts to prove your statement may not hold for ARM RISC designs also. AMD’s Jim Keller could take the design Ideas around Power8/x86 and bring them to a custom ARMv8a server design and even add SMT to the K12 custom ARM designs that are currently in development. There is one RISC design from IBM that is able to get such an extra wide order superscalar performence in its power8 cores(at 8 processor threads per core SMT) with each Power8 core having 14+ execution pipelines per core to feed from those 8 instruction decoders per core and keep things running along. There is no reason that the ARMv8a ISA can not have a custom micro-architecture created that is an extra wide order superscalar oriented like the power8’s! In fact Apple’s Custom wide order superscalar core in its ARMv8a ISA running Cyclone micro-architecture has more in common with Intel’s i7s resource wise that any of Arm Holdings’ reference designs! The Apple A7:

          CPU Codename Cyclone,
          ARM ISA ARMv8-A(32/64),
          Issue Width 6 micro-ops,
          Reorder Buffer Size 192 micro-ops,
          Branch Mispredict Penalty 16 cycles (14 – 19),
          Integer ALUs 4,
          Load/Store Units 2,
          Load Latency 4 Cycles,
          Branch Units 2,
          Indirect Branch Units 1,
          FP/NEON ALUs 3,
          L1 Cache 64KB I$ + 64KB D$,
          L2 Cache 1MB,
          L3 Cache 4MB,

          I’d like to see this information for the Apple A8/A8x and A9/A9x, but Anand Lal Shimpi is no longer with Anandtech so this level of information is no longer available outside of a pay wall post Apple A7’s introduction. Those P. A.(now Apple) semiconductor folks have done a good job with that ARMv8a ISA and the custom Cyclone.

          One should not discount any of the custom ARM designs where the licensee has only licensed that ARMv8a ISA, and Only the ISA, because Apple’s P. A. semiconductor acquisition has paid off handsomely for Apple, and there are indications that AMD’s custom ARM cores may just have SMT capabilities if you have taken the time to view any of the YouTube video interviews with AMD’s Jim Keller.

          That said and back to some non CPU centric part of the discussion do not discount the membership of the HSA foundation and their CPU/SOC maker members including Imagination technologies(IT) commitment to HSA compute on its GPU cores, Ditto for ARM with its Mali GPU cores, and do read up more about AMD’s GCN GPU asynchronous compute micro-architecture. IT is already offering SMT options for its MIPS design CPU cores and virtualization abilities on its PowerVR GPU cores.

          Intel will have NO GPU micro-architecture comparable with AMD’s level of GPU asynchronous compute ACE unit abilities. Intel will not have any systems on an interposer with the level of CPU and GPU integration that AMD will be bringing online with their new dedicated server APUs. Hell Intel’s SOCs do not yet fully implement Unified memory Addressing as fully as AMD’s Carrizo APUs. Intel is only doing some unification on its extra RAM DIE, in a L4 cache manner, but still not to the HSA 1.0 level of memory addressing/other integration between CPU and GPU.

          We are talking about systems of an Interposer with much more wide parallel traces and effective bandwidth, to HBM, and CPU to GPU. The Interposer will allow for such a level of wide fully cache coherent connection fabrics off die, and from die to die, that could only be achieved previously by having everything on a single monolithic die.

          The Silicon interposer will allow for complete systems to be fabricated from separate dies, but allow for those dies to be wired up on the interposer silicon substrate as if the dies are all on a single monolithic die, with the added benefit of having the individual dies fabricated on the process node/materials design that best suits the individual die’s functionality. There will be a economic incentive and performance incentive for making systems on an interposer, like AND will be doing with their HPC/Server APU SKUs, as no longer will the be an need for even larger die sizes in order to scale up a system’s hardware without using larger and larger individual monolithic chip dies. The CPU, and GPU/FPGA/DSP/Other dies can be designed to be modular and scale up on the interposer by simply adding more dies to a larger interposer. Silicon interposers will have supplanted PCB boards and other types of modules for the hosting of the processing systems chips in a modular fashion, with some minor exceptions for off interposer non HBM RAM. Larger interposers will be made, and even interposer bridge connectors will be made to connect separate interposers to each other should there be a need.

          P.S Lets try and discuss new technology and the new technologies implications for the marketplace, and not a businesses financial state because AMD has been going out of business for how many decades now! We all see that AMD is still here, and still innovating.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.