CPU Based Exascale Supercomputing Without Accelerators
February 23, 2016 Rob Farber
Intel has been pursuing a long-term, multi-faceted set of investments to create the processors and technologies needed to build CPU-based supercomputers that can deliver exascale levels of performance in a cost and energy efficient fashion. Slated to become operational in 2016, the Trinity and Cori supercomputers will be powered by the next-generation Intel Xeon Phi processor, code name Knights Landing (KNL), booting in self-hosted mode as a standard, single processor node.
These pre-exascale supercomputers will deliver double-digit petascale performance (for example, 1016 floating-point operations per second, or flop/s) without the use of attached accelerators or coprocessors. They will certainly dispel the myth that the only path to exascale supercomputing is through a heterogeneous node design with bus attached independent computational devices.
Aside from delivering leadership-class computational performance, the Trinity and Cori supercomputers will provide the HPC community with concrete data and valuable insights into the productivity and other benefits of a massively-parallel SMP (Symmetric Multi-Processing) supercomputer environment based on CPUs as compared to the current generation of heterogeneous systems such as Tianhe-2 (which utilizes Intel Xeon Phi codename Knights Corner coprocessors) and ORNL’s Titan supercomputer (accelerated by NVIDIA K20x GPUs). The benefits of a CPU-only software environment at this scale will be of particular interest as it eliminates the complexity and performance bottlenecks of the offload data transfers required to run heterogeneous applications on accelerators.
Additionally, the Trinity and Cori systems will validate the energy, cost, and performance of the self-hosted KNL computational nodes plus set the stage for the introduction of a variety of other Intel technology investments into the exascale arena including innovations in memory and storage (e.g. MCDRAM), networking (e.g. Intel Omni-Path Architecture (Intel OPA) and on-chip Intel OPA, plus software elements that are part of the Intel Scalable System Framework. In particular, a new non-volatile memory technology co-developed by Intel and Micron called 3D XPoint technology is poised to redefine what is meant by memory and storage in high-performance computer architectures and may profoundly affect the cost, capacity, and performance of future supercomputer designs.
Lessons learned from current petascale leadership-class supercomputers
The scientific, technological, and human benefits provided by the double-digit petascale performance of top ranked, leadership-class supercomputers like Tianhe-2, Titan, Sequoia, and Riken (the current four fastest supercomputers in the world) and other TOP 500 supercomputers have initiated a global race to build the first exascale supercomputer by the end of this decade. A petascale supercomputer can deliver 1015 flop/s (floating-point arithmetic operations per second) while an exascale system will provide 1018 flop/s.
Lessons learned from current leadership-class supercomputers show that the costs and technological requirements to increase performance a further 30x beyond that of the Intel Xeon Phi coprocessor powered Tianhe-2 supercomputer (the fastest supercomputer in the world as of November 2015) that delivers 33.86 petaflop/s (or 3.386 x 1016 flop/s) to a true exascale 1018 flop/s machine are substantial, sobering, but achievable due to investment in new technologies.
Cost alone will require that many stakeholders participate in the funding and creation of an exascale system – of which early machines are anticipated to cost between $500 million and $1 billion. And that means that any machine built will certainly be required to deliver leadership class performance on a variety of stakeholder workloads. This only makes sense, as packing a supercomputer with devices that only provide floating-point performance to achieve a 1018 flop/s benchmark goal will be an expensive and meaningless effort if the rest of the hardware does not provide sufficient memory, network, and storage subsystem performance to run real applications faster than existing supercomputers. In short, the stakeholders are going to want to get their money’s worth.
The National Strategic Computing Initiative (NSCI), the Department of Energy’s CORAL initiative, and the Office of Science and the National Nuclear Security Administration’s longer-term investments in exascale computing under the DesignForward high-performance computing R&D program are providing a viable path for the many stakeholders to reach that exascale goal.
What matters then is: (1) how much the hardware cost of those flop/s can be reduced, (2) what can be done with those flop/s, (3) how much flexibility is given to the HPC developer to exploit that performance, and (4) what capabilities are provided to efficiently run important applications that are not flop/s dependent (such as massive graph algorithms).
Balance ratios, discussed in this article, are the metrics used by the HPC community to cut through the hype and make sense of a machine’s overall performance envelope to determine if it can run a desired application-mix efficiently, plus get a general sense of the cost and power requirements needed to procure and run the supercomputer. Examples include cost (dollars per flop/s), power (flop/s per watt) and important subsystem ratios such as memory capacity (bytes per flop/s), memory bandwidth (bytes/s per flop/s), memory transactions per second (memory op/s per flop/s) along with similar ratios for network and storage capabilities.
Breaking the “exascale requires accelerators” myth with CPUs
Heterogeneity has recently become a buzz word as it has successfully been utilized to increase the flop/s rate of TOP 500 winning supercomputers. For example, Tianhe-2 utilizes three Intel Xeon Phi coprocessors per node while the ORNL Titan supercomputer utilizes one accelerator per node. Plugging a coprocessor or accelerator into the PCIe bus of a computational node has both cost and power advantages for the current generation of systems based on commodity motherboards. However the thermal, power, and size limitations dictated by the PCIe bus standard imposes artificial boundaries on what can be physically installed on a PCIe card – including number of processing elements that deliver flop/s and amount of memory.
The Intel Xeon Phi family of processors were designed to support both heterogeneous computing and native SMP (Symmetric Multi-Processing). This design duality has given Intel the ability to claim the TOP 500 performance crown in the current leadership class heterogeneous supercomputing environment with Tianhe-2 while also giving customers the ability to take the lead in future systems that discard the hardware and bandwidth limitations of PCIe based devices and heterogeneous computing environments.
The NNSA Trinity and NERSC Cori supercomputers will take the next step and provide concrete pre-exascale (1016 flop/s) demonstrations that machines based entirely on self-hosted SMP computational nodes can support a wide variety of stakeholder workloads. Cori will support a broad user base as an open science DOE system serving thousands of applications and hundreds of users. The Trinity system will be used for more targeted weapons stockpile-related workloads. Together, the CPU-based Trinity and Cori supercomputers will break the mindset that heterogeneous computing is required for exascale systems.
“Together, the CPU-based Trinity and Cori supercomputers will break the mindset that heterogeneous computing is required for exascale systems.”
In a very real sense, Amdahl’s law for stakeholder applications will dictate the cost and power consumption of future exascale systems since serial sections of code are expensive to process from a thermal, space, and cost of manufacturing standpoint. Parallel sections of code, in particular the SIMD (Single Instruction Multiple Data) regions, can be processed via much more efficient vector units that can deliver high flop/s per dollar and high flop/s per watt performance.
Intel is incrementally tuning and refining the sequential processing power of the Intel Xeon Phi processors. Seymour Cray famously quipped, “If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?” In the exascale era, strong serial processors (i.e. oxen) are prohibitively expensive, which means future supercomputer designs have to provide “just enough and no more” serial processing capability. The Intel Xeon Phi processors used in the Trinity and Cori supercomputers will take the next step towards thermally and cost efficient exascale processors. The KNL processors used in these pre-exascale supercomputers will deliver significantly more serial processing power than the previous generation Intel Xeon Phi processors previously code name Knights Corner (KNC) used in Tianhe-2. In contrast, GPUs rely on CPUs to perform any serial processing, thus forcing users to run in a heterogeneous environment. GPUs rely on CPUs to perform any serial processing, thus forcing users to run in a heterogeneous environment.
The phrase, “Knowledge is power” takes on a new yet critical meaning in the exascale context as greater parallel processing translates into a higher flop/s per watt ratio, and similarly a lower cost per flop/s ratio. In other words, it will be the parallel floating-point hardware that will make exaflop/s systems possible from a power and cost perspective. A big unknown is if the new KNL sequential processing capabilities will provide all that is needed for future exascale workloads. Literally having the ability to recompile and run on high-end Cori and Trinity Intel Xeon E5 v3 processors (formerly known as Haswell) gives interested exascale stakeholders the ability to compare performance and identify if any additional sequential processing performance features need to be added to future Intel Xeon Phi processors. Similarly, the parallel processing capabilities of the dual per core vector units on each of the KNL processors can be evaluated to see what changes, if any, will be required to enhance performance for key exascale HPC workloads. Gary Grider (Deputy Division Leader for HPC at LANL) observes, “In general, we seem to be back to trying to figure out how much of our problems are Amdahl Law vs. throughput and vectorization again.”
“In general, we seem to be back to trying to figure out how much of our problems are Amdahl Law vs. throughput and vectorization again.” – Gary Grider (Deputy Division Leader HPC at LANL)
From a price performance perspective, Trinity and Cori are both expected to deliver similar double-digit petascale performance as Tianhe-2 at roughly 1/3 to 1/5 the cost ($380 million Tianhe-2, $128 million Trinity, $70 million Cori). A more precise ratio as well as the cost per flop/s ratio for the Intel Xeon Phi processor nodes can be determined once these machines are operational and the performance numbers published. Energy consumption is also decreasing (e.g. Tianhe-2 17.6 MW, Trinity projected 15 MW, Cori projected 9 MW).
The Trinity and Cori machines give the HPC community the opportunity to evaluate if the self-hosted, SMP design of the Intel Xeon Phi processor powered computational nodes deliver more usable flop/s for key HPC applications. Software will play a key role in the success of the Trinity and Cori supercomputers, which is why the Intel Scalable System Framework includes portable programming standards such as OpenMP 4.0, Cilk+, and Intel Threading Building Blocks (Intel TBB). These open-standards promise that performant portable codes can be created to exploit the floating-point performance of both SMP and heterogeneous systems architectures even at the exascale.
The exascale “too much data” problem
Visualization is a good example of the need to balance exascale floating-point performance against other machine characteristics. Hank Childs, recipient of the Department of Energy’s Early Career Award to research visualization with exascale computers and Associate Professor in the Department of Computer and Information Science at the University of Oregon notes, “Our ability to generate data is increasing faster than our ability to store it”.
“Our ability to generate data is increasing faster than our ability to store it.” – Professor Hank Childs
Without performant memory and network subsystems within an exascale supercomputer, it will be difficult or impossible to do something useful with the data generated by those 1018 flop/s capable processors. Jim Jeffers (Engineering Manager & PE, Visualization Engineering) underscored this challenge in his editorial, CPUs Sometimes Best for Big Data Visualization, when he wrote, “a picture is worth an exabyte”.
The freely available Intel developed open source Embree, OSPRay, and OpenSWR libraries already demonstrate the power of a CPU-based homogenous, SMP based computing environment for software defined visualization. In particular, Jeffers highlighted the importance of memory capacity in his example of a single Intel Xeon® processor E7 v3 workstation containing 3TB (trillion bytes) of RAM that was able to render a 12-billion particle, 450 GB cosmology dataset at seven frames per second. When presenting this example during his IDF15 talk, Software Defined Visualization: Fast, Flexible Solutions For Rendering Big Data, Jeffers commented that it would take more than 75 GPUs each containing 6 gigabytes of on-GPU memory to perform the same scientific visualization task.
At exascale levels, in-situ visualization that runs the visualization software on the same hardware that generates the data will probably be a requirement. Other scientific visualization packages such as VisIT and Paraview, along with the Intel Scalable System Framework visualization projects, have already installed hooks in their codes for in-situ visualization. Noting this trend, Paul Navratil, manager of the TACC Scalable Visualization Technologies group, reflects this growing view in the HPC community that, “exascale supercomputers will have to be performant visualization machines as well as efficient computational engines.” He also notes that it is up to organizations such as TACC to “expand the realm of what is possible for domain scientists so they can use capabilities like in-situ visualization”.
At the exascale, in-situ visualization that runs the visualization software on the same hardware that generates the data will probably be a requirement.
Jim Ahrens (founder and lead of Paraview at Los Alamos National Laboratory) says, “There is a renaissance in visualization and analysis as we figure out how to perform in-situ tasks automatically.” Christopher Sewell (LANL) points out the wide support for the VTK-M joint project that includes LANL, ORNL, Sandia, Kitware, and the University of Oregon, all of whom are working on exploiting the shared-memory parallelism of Trinity as well as other machines to make in-situ visualization readily available to everyone.
Redefining memory and storage balance ratios
New memory technologies such as MCDRAM, or stacked memory, and a new non-volatile memory technology that Intel jointly developed with Micron called 3D XPoint are poised to redefine memory and storage balance ratios. MCDRAM will deliver significantly higher bandwidth than conventional DDR4 memory while 3D XPoint promises, “up to 4x system memory capacity at significantly lower cost than DRAM”, a hundred times lower latency than today’s best performing NAND, and write cycle durability that is 1000x that of NAND.
Redefining what is meant by memory
Succinctly, MCDRAM will be used as high performance near-memory to accelerate computational performance while fast and large capacity far-memory based on conventional DRAM and Intel NVRAM DIMMs using 3D XPoint technology will greatly increase amount of memory can be installed on a computational node.
Together, MCDRAM and Intel DIMMs based on 3D XPoint will redefine a number of key supercomputer memory balance ratios including: (1) memory bandwidth (with MCDRAM), (2) memory capacity (with Intel DIMMs based on 3D XPoint technology), and (3) cost per gigabyte of memory. (Note: Intel DIMMs based on 3D XPoint technology will required a new memory controller on the processor.)
Redefining what is meant by storage
Similarly, storage devices based on 3D XPoint technology will redefine storage performance. For example, a prototype Intel Optane technology storage device running at the Intel Developers Forum 2015 (IDF15) delivered a spectacular 5x to 7x performance increase over Intel’s current fastest NAND SSD according to IOMETER, a respected storage performance-monitoring tool.
As with memory, storage technology using 3D XPoint technology has the potential to redefine a number of key supercomputer storage balance ratios including: (1) storage bandwidth, IO operations per second, and cost per terabyte.
Other innovative HPC uses
Innovative uses of 3D XPoint memory technology can make a big difference to exascale supercomputing efforts, both by extending the use of in-core algorithms through greater per node memory capacity and accelerating the performance of storage-based out-of-core algorithms. 3D XPoint memory can also potentially be used as burst buffers as well to decrease system cost and accelerate common use cases.
A burst buffer is a nonvolatile, intermediate layer of high-speed storage that can expedite bulk data transfers to storage. Succinctly, economics are driving the inclusion of burst buffers in leadership class supercomputers to fill the bandwidth gap created by the need to quickly service the IO requests of very large numbers of big-memory computational nodes. Checkpoint/restart operations are a common burst buffer use case.
Other technologies that may further reduce the cost of an exascale supercomputer
Intel is working on a host of other projects that will further reduce the cost of an exascale supercomputer. Very briefly, publically disclosed projects include (but are not limited to):
- The Intel Omni-Path Architecture, an element of the Intel Scalable System Framework, allows denser switches to be built which will also help reduce the cost of the internal exascale supercomputer network. In addition, Intel OPA promises a 56% reduction in network latency, a huge improvement that can greatly benefit a wide-range of HPC applications.
- A forthcoming Intel Xeon Phi processor code name Knights Landing is planned that will have ports for Intel’s 100 Gb/sec Intel Omni-Path interconnect on the chip package. This eliminates the cost of external interface ports while improving reliability.
- A planned second generation of the Intel Xeon Phi processor code name Knights Hill will be manufactured using a 10nm production process as compared to the Intel Xeon Phi processor code name Knights Landing 14nm process. The result should be an even denser and more power efficient Intel Xeon Phi processor compared to those used in the Trinity and Cori procurements.
- Both Knights Landing and Knights Hill Intel Xeon Phi processors include a host of performance improving features such as the hardware scatter/gather capabilities introduced in the AVX2 and AVX-512 instruction sets as well as out-of-order sequential processing.
A broad spectrum of new technologies are redefining machine architectures from processors to memory and network subsystems to storage. The Trinity and Cori supercomputer procurements are poised to take the next step that will provide the HPC community with valuable success stories and lessons learned that will be incorporated in the next generation of – possibly exascale – leadership class supercomputers. In a very real sense, the self-hosted (or bootable) mode of the Intel Xeon Phi family of processors used in the Trinity and Cori supercomputers will concretely demonstrate that heterogeneous computing environments using GPU and coprocessors are not an exascale requirement. That said, the dual-use Intel Xeon Phi processor design – unlike GPU accelerators – lets the customer decide if they want to build a self-hosted or heterogeneous exascale machine.
Visualization is an excellent use case to consider when trying to understand machine balance and how the exascale “too much data” problem can be addressed. A community-wide effort to support in-situ visualization is in process so domain scientists can better utilize data from future leadership-class and exascale supercomputers. However, running both the simulation and visualization software on the same computational nodes will stress both memory and network subsystems, which highlights the importance of balanced machine capabilities such as memory capacity and network capability. To meet this need, Intel will support on-chip Intel Omni-Path technology to increase network bandwidth while decreasing both cost and latency. Similarly, cost-effective 3D XPoint technology memory along with high-performance MCDRAM are poised to literally redefine what is meant by memory and storage along with capacity, performance and cost.
Rob Farber is a global technology consultant and author with an extensive background in HPC and a long history of working with national labs and corporations engaged in both HPC and enterprise computing. He can be reached at firstname.lastname@example.org.