CPU Based Exascale Supercomputing Without Accelerators

February 23, 2016 Rob Farber Compute, HPC 9

Intel has been pursuing a long-term, multi-faceted set of investments to create the processors and technologies needed to build CPU-based supercomputers that can deliver exascale levels of performance in a cost and energy efficient fashion. Slated to become operational in 2016, the Trinity and Cori supercomputers will be powered by the next-generation Intel Xeon Phi processor, code name Knights Landing (KNL), booting in self-hosted mode as a standard, single processor node.

These pre-exascale supercomputers will deliver double-digit petascale performance (for example, 10¹⁶ floating-point operations per second, or flop/s) without the use of attached accelerators or coprocessors. They will certainly dispel the myth that the only path to exascale supercomputing is through a heterogeneous node design with bus attached independent computational devices.

Aside from delivering leadership-class computational performance, the Trinity and Cori supercomputers will provide the HPC community with concrete data and valuable insights into the productivity and other benefits of a massively-parallel SMP (Symmetric Multi-Processing) supercomputer environment based on CPUs as compared to the current generation of heterogeneous systems such as Tianhe-2 (which utilizes Intel Xeon Phi codename Knights Corner coprocessors) and ORNL’s Titan supercomputer (accelerated by NVIDIA K20x GPUs). The benefits of a CPU-only software environment at this scale will be of particular interest as it eliminates the complexity and performance bottlenecks of the offload data transfers required to run heterogeneous applications on accelerators.

Additionally, the Trinity and Cori systems will validate the energy, cost, and performance of the self-hosted KNL computational nodes plus set the stage for the introduction of a variety of other Intel technology investments into the exascale arena including innovations in memory and storage (e.g. MCDRAM), networking (e.g. Intel Omni-Path Architecture (Intel OPA) and on-chip Intel OPA, plus software elements that are part of the Intel Scalable System Framework. In particular, a new non-volatile memory technology co-developed by Intel and Micron called 3D XPoint technology is poised to redefine what is meant by memory and storage in high-performance computer architectures and may profoundly affect the cost, capacity, and performance of future supercomputer designs.

Lessons learned from current petascale leadership-class supercomputers

The scientific, technological, and human benefits provided by the double-digit petascale performance of top ranked, leadership-class supercomputers like Tianhe-2, Titan, Sequoia, and Riken (the current four fastest supercomputers in the world) and other TOP 500 supercomputers have initiated a global race to build the first exascale supercomputer by the end of this decade. A petascale supercomputer can deliver 10¹⁵ flop/s (floating-point arithmetic operations per second) while an exascale system will provide 10¹⁸ flop/s.

Lessons learned from current leadership-class supercomputers show that the costs and technological requirements to increase performance a further 30x beyond that of the Intel Xeon Phi coprocessor powered Tianhe-2 supercomputer (the fastest supercomputer in the world as of November 2015) that delivers 33.86 petaflop/s (or 3.386 x 10¹⁶ flop/s) to a true exascale 10¹⁸ flop/s machine are substantial, sobering, but achievable due to investment in new technologies.

Cost alone will require that many stakeholders participate in the funding and creation of an exascale system – of which early machines are anticipated to cost between $500 million and $1 billion. And that means that any machine built will certainly be required to deliver leadership class performance on a variety of stakeholder workloads. This only makes sense, as packing a supercomputer with devices that only provide floating-point performance to achieve a 10¹⁸ flop/s benchmark goal will be an expensive and meaningless effort if the rest of the hardware does not provide sufficient memory, network, and storage subsystem performance to run real applications faster than existing supercomputers. In short, the stakeholders are going to want to get their money’s worth.

The National Strategic Computing Initiative (NSCI), the Department of Energy’s CORAL initiative, and the Office of Science and the National Nuclear Security Administration’s longer-term investments in exascale computing under the DesignForward high-performance computing R&D program are providing a viable path for the many stakeholders to reach that exascale goal.

What matters then is: (1) how much the hardware cost of those flop/s can be reduced, (2) what can be done with those flop/s, (3) how much flexibility is given to the HPC developer to exploit that performance, and (4) what capabilities are provided to efficiently run important applications that are not flop/s dependent (such as massive graph algorithms).

Balance ratios, discussed in this article, are the metrics used by the HPC community to cut through the hype and make sense of a machine’s overall performance envelope to determine if it can run a desired application-mix efficiently, plus get a general sense of the cost and power requirements needed to procure and run the supercomputer. Examples include cost (dollars per flop/s), power (flop/s per watt) and important subsystem ratios such as memory capacity (bytes per flop/s), memory bandwidth (bytes/s per flop/s), memory transactions per second (memory op/s per flop/s) along with similar ratios for network and storage capabilities.

Breaking the “exascale requires accelerators” myth with CPUs

Heterogeneity has recently become a buzz word as it has successfully been utilized to increase the flop/s rate of TOP 500 winning supercomputers. For example, Tianhe-2 utilizes three Intel Xeon Phi coprocessors per node while the ORNL Titan supercomputer utilizes one accelerator per node. Plugging a coprocessor or accelerator into the PCIe bus of a computational node has both cost and power advantages for the current generation of systems based on commodity motherboards. However the thermal, power, and size limitations dictated by the PCIe bus standard imposes artificial boundaries on what can be physically installed on a PCIe card – including number of processing elements that deliver flop/s and amount of memory.

The Intel Xeon Phi family of processors were designed to support both heterogeneous computing and native SMP (Symmetric Multi-Processing). This design duality has given Intel the ability to claim the TOP 500 performance crown in the current leadership class heterogeneous supercomputing environment with Tianhe-2 while also giving customers the ability to take the lead in future systems that discard the hardware and bandwidth limitations of PCIe based devices and heterogeneous computing environments.

The NNSA Trinity and NERSC Cori supercomputers will take the next step and provide concrete pre-exascale (10¹⁶flop/s) demonstrations that machines based entirely on self-hosted SMP computational nodes can support a wide variety of stakeholder workloads. Cori will support a broad user base as an open science DOE system serving thousands of applications and hundreds of users. The Trinity system will be used for more targeted weapons stockpile-related workloads. Together, the CPU-based Trinity and Cori supercomputers will break the mindset that heterogeneous computing is required for exascale systems.

“Together, the CPU-based Trinity and Cori supercomputers will break the mindset that heterogeneous computing is required for exascale systems.”

In a very real sense, Amdahl’s law for stakeholder applications will dictate the cost and power consumption of future exascale systems since serial sections of code are expensive to process from a thermal, space, and cost of manufacturing standpoint. Parallel sections of code, in particular the SIMD (Single Instruction Multiple Data) regions, can be processed via much more efficient vector units that can deliver high flop/s per dollar and high flop/s per watt performance.

Intel is incrementally tuning and refining the sequential processing power of the Intel Xeon Phi processors. Seymour Cray famously quipped, “If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?” In the exascale era, strong serial processors (i.e. oxen) are prohibitively expensive, which means future supercomputer designs have to provide “just enough and no more” serial processing capability. The Intel Xeon Phi processors used in the Trinity and Cori supercomputers will take the next step towards thermally and cost efficient exascale processors. The KNL processors used in these pre-exascale supercomputers will deliver significantly more serial processing power than the previous generation Intel Xeon Phi processors previously code name Knights Corner (KNC) used in Tianhe-2. In contrast, GPUs rely on CPUs to perform any serial processing, thus forcing users to run in a heterogeneous environment. GPUs rely on CPUs to perform any serial processing, thus forcing users to run in a heterogeneous environment.

The phrase, “Knowledge is power” takes on a new yet critical meaning in the exascale context as greater parallel processing translates into a higher flop/s per watt ratio, and similarly a lower cost per flop/s ratio. In other words, it will be the parallel floating-point hardware that will make exaflop/s systems possible from a power and cost perspective. A big unknown is if the new KNL sequential processing capabilities will provide all that is needed for future exascale workloads. Literally having the ability to recompile and run on high-end Cori and Trinity Intel Xeon E5 v3 processors (formerly known as Haswell) gives interested exascale stakeholders the ability to compare performance and identify if any additional sequential processing performance features need to be added to future Intel Xeon Phi processors. Similarly, the parallel processing capabilities of the dual per core vector units on each of the KNL processors can be evaluated to see what changes, if any, will be required to enhance performance for key exascale HPC workloads. Gary Grider (Deputy Division Leader for HPC at LANL) observes, “In general, we seem to be back to trying to figure out how much of our problems are Amdahl Law vs. throughput and vectorization again.”

“In general, we seem to be back to trying to figure out how much of our problems are Amdahl Law vs. throughput and vectorization again.” – Gary Grider (Deputy Division Leader HPC at LANL)

From a price performance perspective, Trinity and Cori are both expected to deliver similar double-digit petascale performance as Tianhe-2 at roughly 1/3 to 1/5 the cost ($380 million Tianhe-2, $128 million Trinity, $70 million Cori). A more precise ratio as well as the cost per flop/s ratio for the Intel Xeon Phi processor nodes can be determined once these machines are operational and the performance numbers published. Energy consumption is also decreasing (e.g. Tianhe-2 17.6 MW, Trinity projected 15 MW, Cori projected 9 MW).

The Trinity and Cori machines give the HPC community the opportunity to evaluate if the self-hosted, SMP design of the Intel Xeon Phi processor powered computational nodes deliver more usable flop/s for key HPC applications. Software will play a key role in the success of the Trinity and Cori supercomputers, which is why the Intel Scalable System Framework includes portable programming standards such as OpenMP 4.0, Cilk+, and Intel Threading Building Blocks (Intel TBB). These open-standards promise that performant portable codes can be created to exploit the floating-point performance of both SMP and heterogeneous systems architectures even at the exascale.

The exascale “too much data” problem

Visualization is a good example of the need to balance exascale floating-point performance against other machine characteristics. Hank Childs, recipient of the Department of Energy’s Early Career Award to research visualization with exascale computers and Associate Professor in the Department of Computer and Information Science at the University of Oregon notes, “Our ability to generate data is increasing faster than our ability to store it”.

“Our ability to generate data is increasing faster than our ability to store it.” – Professor Hank Childs

Without performant memory and network subsystems within an exascale supercomputer, it will be difficult or impossible to do something useful with the data generated by those 10¹⁸ flop/s capable processors. Jim Jeffers (Engineering Manager & PE, Visualization Engineering) underscored this challenge in his editorial, CPUs Sometimes Best for Big Data Visualization, when he wrote, “a picture is worth an exabyte”.

The freely available Intel developed open source Embree, OSPRay, and OpenSWR libraries already demonstrate the power of a CPU-based homogenous, SMP based computing environment for software defined visualization. In particular, Jeffers highlighted the importance of memory capacity in his example of a single Intel Xeon® processor E7 v3 workstation containing 3TB (trillion bytes) of RAM that was able to render a 12-billion particle, 450 GB cosmology dataset at seven frames per second. When presenting this example during his IDF15 talk, Software Defined Visualization: Fast, Flexible Solutions For Rendering Big Data, Jeffers commented that it would take more than 75 GPUs each containing 6 gigabytes of on-GPU memory to perform the same scientific visualization task.

At exascale levels, in-situ visualization that runs the visualization software on the same hardware that generates the data will probably be a requirement. Other scientific visualization packages such as VisIT and Paraview, along with the Intel Scalable System Framework visualization projects, have already installed hooks in their codes for in-situ visualization. Noting this trend, Paul Navratil, manager of the TACC Scalable Visualization Technologies group, reflects this growing view in the HPC community that, “exascale supercomputers will have to be performant visualization machines as well as efficient computational engines.” He also notes that it is up to organizations such as TACC to “expand the realm of what is possible for domain scientists so they can use capabilities like in-situ visualization”.

At the exascale, in-situ visualization that runs the visualization software on the same hardware that generates the data will probably be a requirement.

Jim Ahrens (founder and lead of Paraview at Los Alamos National Laboratory) says, “There is a renaissance in visualization and analysis as we figure out how to perform in-situ tasks automatically.” Christopher Sewell (LANL) points out the wide support for the VTK-M joint project that includes LANL, ORNL, Sandia, Kitware, and the University of Oregon, all of whom are working on exploiting the shared-memory parallelism of Trinity as well as other machines to make in-situ visualization readily available to everyone.

Redefining memory and storage balance ratios

New memory technologies such as MCDRAM, or stacked memory, and a new non-volatile memory technology that Intel jointly developed with Micron called 3D XPoint are poised to redefine memory and storage balance ratios. MCDRAM will deliver significantly higher bandwidth than conventional DDR4 memory while 3D XPoint promises, “up to 4x system memory capacity at significantly lower cost than DRAM”, a hundred times lower latency than today’s best performing NAND, and write cycle durability that is 1000x that of NAND.

Redefining what is meant by memory

Succinctly, MCDRAM will be used as high performance near-memory to accelerate computational performance while fast and large capacity far-memory based on conventional DRAM and Intel NVRAM DIMMs using 3D XPoint technology will greatly increase amount of memory can be installed on a computational node.

Together, MCDRAM and Intel DIMMs based on 3D XPoint will redefine a number of key supercomputer memory balance ratios including: (1) memory bandwidth (with MCDRAM), (2) memory capacity (with Intel DIMMs based on 3D XPoint technology), and (3) cost per gigabyte of memory. (Note: Intel DIMMs based on 3D XPoint technology will required a new memory controller on the processor.)

Redefining what is meant by storage

Similarly, storage devices based on 3D XPoint technology will redefine storage performance. For example, a prototype Intel Optane technology storage device running at the Intel Developers Forum 2015 (IDF15) delivered a spectacular 5x to 7x performance increase over Intel’s current fastest NAND SSD according to IOMETER, a respected storage performance-monitoring tool.

As with memory, storage technology using 3D XPoint technology has the potential to redefine a number of key supercomputer storage balance ratios including: (1) storage bandwidth, IO operations per second, and cost per terabyte.

Other innovative HPC uses

Innovative uses of 3D XPoint memory technology can make a big difference to exascale supercomputing efforts, both by extending the use of in-core algorithms through greater per node memory capacity and accelerating the performance of storage-based out-of-core algorithms. 3D XPoint memory can also potentially be used as burst buffers as well to decrease system cost and accelerate common use cases.

A burst buffer is a nonvolatile, intermediate layer of high-speed storage that can expedite bulk data transfers to storage. Succinctly, economics are driving the inclusion of burst buffers in leadership class supercomputers to fill the bandwidth gap created by the need to quickly service the IO requests of very large numbers of big-memory computational nodes. Checkpoint/restart operations are a common burst buffer use case.

Other technologies that may further reduce the cost of an exascale supercomputer

Intel is working on a host of other projects that will further reduce the cost of an exascale supercomputer. Very briefly, publically disclosed projects include (but are not limited to):

The Intel Omni-Path Architecture, an element of the Intel Scalable System Framework, allows denser switches to be built which will also help reduce the cost of the internal exascale supercomputer network. In addition, Intel OPA promises a 56% reduction in network latency, a huge improvement that can greatly benefit a wide-range of HPC applications.
A forthcoming Intel Xeon Phi processor code name Knights Landing is planned that will have ports for Intel’s 100 Gb/sec Intel Omni-Path interconnect on the chip package. This eliminates the cost of external interface ports while improving reliability.
A planned second generation of the Intel Xeon Phi processor code name Knights Hill will be manufactured using a 10nm production process as compared to the Intel Xeon Phi processor code name Knights Landing 14nm process. The result should be an even denser and more power efficient Intel Xeon Phi processor compared to those used in the Trinity and Cori procurements.
Both Knights Landing and Knights Hill Intel Xeon Phi processors include a host of performance improving features such as the hardware scatter/gather capabilities introduced in the AVX2 and AVX-512 instruction sets as well as out-of-order sequential processing.

A broad spectrum of new technologies are redefining machine architectures from processors to memory and network subsystems to storage. The Trinity and Cori supercomputer procurements are poised to take the next step that will provide the HPC community with valuable success stories and lessons learned that will be incorporated in the next generation of – possibly exascale – leadership class supercomputers. In a very real sense, the self-hosted (or bootable) mode of the Intel Xeon Phi family of processors used in the Trinity and Cori supercomputers will concretely demonstrate that heterogeneous computing environments using GPU and coprocessors are not an exascale requirement. That said, the dual-use Intel Xeon Phi processor design – unlike GPU accelerators – lets the customer decide if they want to build a self-hosted or heterogeneous exascale machine.

Visualization is an excellent use case to consider when trying to understand machine balance and how the exascale “too much data” problem can be addressed. A community-wide effort to support in-situ visualization is in process so domain scientists can better utilize data from future leadership-class and exascale supercomputers. However, running both the simulation and visualization software on the same computational nodes will stress both memory and network subsystems, which highlights the importance of balanced machine capabilities such as memory capacity and network capability. To meet this need, Intel will support on-chip Intel Omni-Path technology to increase network bandwidth while decreasing both cost and latency. Similarly, cost-effective 3D XPoint technology memory along with high-performance MCDRAM are poised to literally redefine what is meant by memory and storage along with capacity, performance and cost.

Rob Farber is a global technology consultant and author with an extensive background in HPC and a long history of working with national labs and corporations engaged in both HPC and enterprise computing. He can be reached at info@techenablement.com.

BlackDove says:

February 23, 2016 at 9:04 am

I’m not sure why anyone in the business would expect that a heterogenous architecture could be useful for exascale, given the nature of the datasets and software that they’ll be using. All the checkpointing would probably be made much more difficult by the heterogenity.

Besides, HPC specific CPUs currently dominate the heterogenous machines in terms of actual performance in HPCG, HPCG/HPL balance and the balance of bytes/FLOP. Current SPARC XIfx designs are 1:1 HPCG/HPL and have excellent byte/FLOP ratios. SX vectors currently have a 1:1 byte/FLOP ratio as well.

Since exascale is the convergence of HPC and big data, with massive memory amount and bandwidth required, I think its pretty reasonable to think that the SPARC based Flagship2020 computing initiative will produce the first real exascale computer.

K is all SPARC CPUs and is still 4th on the Top500 and 1st on the Graph500. The computational efficiency of those SPARC CPU machines is also greater than 90%, while GPU or current heterogenous Xeon Phis like Tianhe-2 are around 55-65% efficient and perform much better on HPL than HPCG, which is becoming less useful toward exascale.

My personal prediction is this: another SPARC machine using the already developed silicon photonics, 3D memory(HMC has been in use on SPARC XIfx since 2014, long before KNL) of some type will be the first real exascale computer and like K, it will cost over $1 billion. PrimeHPC FX100(2014) is already scalabe to over 100 PFLOPS with only 512 racks. K was 4 years before that. A greater than 10x performance increase in that short of a time is impressive.

I do find it interesting that the architecture that has demonstrated performance in the form of K, and is currently the most sophisticated HPC architecture in use isnt even mentioned in an exascale CPU article. It gets almost no coverage.

According to an older article on here, Knights Hill may have fewer cores and more memory and memory bandwidth per core, delete DDR4 DIMMs entirely and use only 3D memory(HMC derived probably) making it look a lot less like KNL and more like SPARCfx. Interesting that they’d go back in a sense.

Integrated interconnects like Tofu2(which is partially optical but not photonic) are already alleviating a large amount of the bottleneck that interconnects pose. If Tofu3 includes silicon photonics it should be interesting to compare to Omnipath with silicon photonics on Knights Hill/Skylake Purley.

Reply
- OranjeeGeneral says:
  
  February 24, 2016 at 5:12 am
  
  Interestingly to see that someone bets on SPARC I definitely wouldn’t. SPARC future with a DB company behind it always has been flaky sure there has been commitment but how long will it last? Especially since the hardware manufacturing game is getting more and more expansive and the ROI will get lower especially if you manufacture at such a low scale as Oracle does. But I agree I never bought into the hype of hetergeneous architecture. XeonPhi and AMD APU/fusion approach look far more reasonable. If you need FLOP just add a very wide and fast vector unit next to your CPU because that’s what basically GPUs are.
  
  Reply
- jimmy says:
  
  July 14, 2016 at 11:26 am
  
  “HPC specific CPUs currently dominate the heterogenous machines in terms of actual performance in HPCG, HPCG/HPL balance”
  
  That is an incorrect statement, there are several GPU systems that have some of the highest (top 5) scores on the HPCG list.
  
  Check your facts Before making embarassing faulty statements.
  
  Reply
  - BlackDove says:
    
    July 27, 2016 at 12:35 pm
    
    You did not quote the entire sentence, and apparently took the totality of what I said out of context. What I said is not faulty, and if it is, please refute it with something more than a vague “check your facts, but i wont bother to give any facts to refute your statement” reply.
    
    The balance of HPCG to HPL performance is what i was specifically referring to, not just HPCG performance. Yes there are some GPU systems like Tsubame that do well in HPCG. I never claimed there arent.
    
    However, the ratio of HPCG to HPL performance is usually one way or the other, and few systems that are not HPC specific CPU systems do BOTH well.
    
    Since this article was published, it has come to light that the first real exascale system is likely to be ARM instead of SPARC, but will still be an all CPU architecture.
    
    Time will tell which architecture makes for the first real exascale system. I doubt it will be a conventional heterogenous system or something as ridiculously FLOPS heavy as TaihuLight.
    
    If you look beyond one or two benchmarks deeper into the architecture and real world performance that can be used to do useful work, HPC specific architectures typically have the highest computational efficiency, overall memory bandwidth and they can make the best use of their FLOPS, even if theyre not at the top of the Top500.
    
    Even Nvidia has shifted their GPU architecture with GP100 to be much more HPC specific, and resembles the approach that Intel is adopting as well, which is becoming a “manycore processor with high bandwidth near memory”.
    
    Where did we see that architecture first? Oh wait…
    
    Reply
Barry Bolding says:

February 23, 2016 at 2:21 pm

Just a small note… Trinity and Cori are Cray XC Systems. Coral is a Cray Shasta system. Titan is a Cray XK7 system. The ability to drive future exascale performance is a system-level problem, not just a processor-level problem.

Reply
John Barr says:

February 24, 2016 at 8:53 am

Without explicitly saying that heterogeneity is bad, the tone of the article suggests that this is the case. However, for most workloads, heterogeneity is good as you can use different components in an HPC system to best execute code sections requiring massively parallel or serial support. If an Exascale system is to be more than a one trick pony it must have heterogeneous computing elements in order to support the varying compute requirements of a spectrum of applications.

Reply
jimmy says:

July 12, 2016 at 8:06 pm

I also found the tone to be very biased against widely used and proven to be successful heterogenous computing.

This article essentially reads like a puff piece written by Intels marketing 🙂

Rob fails to mention a few important Points:

-> The currently largest planned HPC systems in the World are IBM+Nvidia GPU based (Sierra and summit)
-> Knights Landing is very different from regular Xeons ( memory model, VERY wide vector units) and requires serious software efforts to be utilized.
-> Knights Landing is also sold in the accelerater format vis PCI-express, many vendors opt for this as they want compute Heavy nodes.
-> Nvidia based Pascal HPC systems are no longer bound by the PCI-e bottleneck due to the new super fast NVLINK interconnect (up to 200 GB/s).
-> Nvidia now offers extremely compute Heavy nodes with 8 pascal GPU:s without any PCIe bottlenecks, this has tremendous impact on the node interconnect infrastructure saving a lot of Money and Power.

Reply
- BlackDove says:
  
  July 28, 2016 at 12:49 am
  
  The article is stating facts, both current and historical.
  
  The largest planned system is most definitely Summit, Sierra or Aurora. The largest planned unclassified system is currently the ARM CPU based Post K(name not final) system by Riken and Fujitsu. Over $1 billion is already allocated to upgrade the facility and replace K(which itself was $1.2 billion).
  
  Intel does not have, and has not had, the most advanced HPC architecture in a very long time. Cray Shasta using Intel is catching up to Fujitsu, who currently has the most advanced and effective architecture deployed(K, PrimeHPC FX100).
  
  The heterogenous systems to date using Nvidia GPUs have been relatively cheap because the GPUs were cheap to manufacture. Nvidia got serious with GP100 and the cost of a P100 is over 2x the price of a K80. Knights Landing isnt cheap either.
  
  If you want something thats useful for more than running HPL to impress people who dont understand anything more than Top500 placement you need an advanced and usually expensive architecture. The Summit and Sierra computers use very expensive nodes compared to Titan or other conventional heterogenous architectures.
  
  Reply
  - BlackDove says:
    
    July 28, 2016 at 12:49 am
    
    *should read “most definitely NOT
    
    Reply

CPU Based Exascale Supercomputing Without Accelerators

Lessons learned from current petascale leadership-class supercomputers