Inside Six of the Newest Top 20 Supercomputers
November 14, 2016 Nicole Hemsoth
The latest listing of the Top 500 rankings of the world’s most powerful supercomputers has just been released. While there were no big surprises at the top of the list, there have been some notable additions to the top tier, all of which feature various elements of supercomputers yet to come as national labs and research centers prepare for their pre-exascale and eventual exascale systems.
We will be providing a deep dive on the list results this morning, but for now, what is most interesting about the list is what it is just beginning to contain at the top–and what that signals for the next wave of supercomputers. While there are only two new systems in the top 10 for this 48th ranking of the Top 500, one in the U.S. and one in Japan, we wanted to take a closer look at these new entrants to the top 20 to see where the trends of top-tier systems lie—and look ahead at what might topple those positions in the 20 slots over the next year.
What the machines below share in common (and against the trend of the current longer placeholders) is a reliance on the Knights Landing processor architecture. Many machines debuted over the last two years with the Xeon Phi coprocessor, but with Knights Landing’s status as a processor (with no offload model), the Omni-Path interconnect to keep pace with that capability, and innovations on the storage side, including the availability of burst buffers for more efficient, higher performance data flow, the overall performance profile of the Top 500 list continues to grow. The new entries also favor Cray machines, particularly the XC40, which has a new Pascal GPU brother as of today, the XC50.
Let’s begin with the highest performance, highest ranking systems that are newcomers to the list. The associated number is their ranking on the overall list. Keep in mind when you see these theoretical peak numbers, the top machine in the world, the Sunway Taihu-Light machine in China is capable of 125.4 petaflops (with an HPCG ranking that shows .03% of that peak achieved) and the next machine down (#2), the Tianhe-2, also in China, capable of 54.9 peak petaflops performance (and 1.1% of that peak achieved on HPCG).
In June, we took an in-depth look at the Cori supercomputer, which is in its second phase at NERSC. To refresh, the machine is a Cray XC40 with 1,630 Intel Xeon “Haswell” processor nodes and 9,300 Xeon Phi Knights Landing nodes. Part of what makes Cori interesting is that it is among the first to take advantage of burst buffer technology for application acceleration via a 1.5 petabyte Cray “DataWarp” burst buffer.
In its inaugural Linpack benchmark run to determine Top 500 placement, the system racked up 27 petaflops in theoretical peak performance. In addition to ranking in the #5 slot on the overall Top 500, the system also ran the HPCG benchmark where it came in the same place, although recall this is a much smaller list—only 100 systems ran the benchmark to determine placement on the more data movement-centric list. It might not sound impressive, but the machine was able to reach 1.3% of its theoretical peak (keep in mind the top system on that benchmark, the K Computer in Japan, only hits the 5.3% mark).
“In Phase 2 of the Cori implementation, we face a number of opportunities and challenges,” NERSC’s Jack Deslippe said in June. Deslippe leads the team that is helping prepare scientific applications to run well on Cori Phase 2. “The biggest challenge facing the NERSC staff is the same as the one facing our users – how to get user code optimized for the new system. “If you compare the Cori Phase 2 system with our current system, Edison, you’ll see there are some pretty striking differences,” he continues. “For example, Edison with its Intel Ivy-Bridge processors has 12 cores per CPU and 24 virtual cores (from hyperthreading) per CPU. Cori, on the other hand, has several times the number of physical cores per CPU and virtual cores.
While we tend to be able to gather information about U.S. national lab supercomputers easily, not quite as much was available about Oakforest-PACS in its earlier stages. The Fujitsu machine has just edged out another novel system in Japan, the K Computer, which has achieved consistently high Linpack, HPCG, and Green 500 results since it entered the rankings back in 2011. It is now the fastest supercomputer in Japan and at almost 25 petaflops peak, is on the heels of Cori.
Oakforest-PACS sports a different architecture than the K Computer (SPARC based with the Tofu interconnect) and instead relies on the 2UPRIMERGY CS1640 M1 node architecture (8,208 total), each of which has a single Knights Landing processor, and the Omni-Path interconnect. The DDN storage system consists of a 26 PB capacity file system and a 940 TB fast file cache system.
While the system is ranked at #6 on the Top 500, its performance on the HPCG benchmark is #3 out of the 100 systems that ran for the November rankings. At last check, the percentage of peak performance on Linpack was not listed (the list is still being developed at the time of publication) but is expected to be between 1% and 0.3%.
Not to diminish this achievement, the real exascale heat gets turned up on the next ARM-based incarnation of the K Computer, something we have been watching develop this year.
#11 Weather Modeling/Forecasting System at UK Met Office
Like other major weather forecasting centers, the UK Met Office makes major investments, often in pairs (one system for production, another for backup and research) in its systems. At #11 on the list is the newest such machine; an unnamed 8.1 petaflop super that uses the newest Broadwell processors on a Cray XC40 system with Aries interconnect. While this is a noteworthy success, it highlights a larger trend for non-accelerated weather supercomputers—that Cray is dominating here. This will get more interesting as other weather centers follow CSCS’s lead with adding XC50 nodes (with Pascal P100) to add machine learning elements into the larger simulation workflow.
Cray has historically been a very research and development focused company, which is the top reason why they have managed to succeed in specific areas like weather, says the company’s Barry Bolding. Ten years ago, the company made its first significant investment in weather experts to hone the systems for numerical weather prediction and other models. The efforts of this team kicked off one of its first large weather system deals at the Korean Meteorological Administration (where they are still supplying supercomputers), and has grown to include an ever-increasing share of the weather systems market.
At the time of the Korean weather system, Cray had a ten percent slice of the weather market worldwide. A decade later, they are hands-down the top supplier of weather forecasting machines with what Bolding says is between a 60-70% share—a fact that continues to drive their investments in continued research and development in that area.
one of Cray’s largest weather supercomputers to date at the European Centre for Medium-Range Weather Forecasts (ECMWF) occupies both the numbers 38 and 39 on the Top 500 (even though it is a matching set) and in South Korea, a dual-weather cluster rests at numbers 216 and 217 to power the Korea Meteorological Administration’s forecasting efforts (along with a slightly larger, newer cluster, the Uri machine, which is an XC40). Earlier this year, Cray announced another significant weather win in the United States with the a win at NOAA in the U.S., another for the Met Office in the UK for $156 million, and topped another milestone by announcing a contract worth up to $53 million for a new Cray XC40 supercomputer at the Bureau of Meteorology in Australia.
The Marconi cluster in Italy is unique for a few reasons. The most obvious is that it is based in Italy, which currently only has five systems on the Top 500. One of those on the last list was the previous ranking for the first run on Marconi, a Lenovo machine with Broadwell and Omni-Path and a theoretical peak of 2 petaflops in June.
As we reported at ISC when the first phase of the Marconi machine was complete for Linpack benchmark runs, . The “Marconi” NextScale system, based on Intel Broadwell Xeon E5 processors was just fired up in time to make the June Top 500 supercomputer rankings, is the first big deal that Lenovo closed as Lenovo.
The Marconi system also has the distinction of being the largest system in the world based on Intel’s Omni-Path follow-on to InfiniBand, although the 180 petaflops “Aurora” system at Argonne National Laboratory that is expected in 2018 will be about six times larger than the Marconi system when it is finally completed years hence. The initial phase of the Marconi system, which was announced in April of this year, has 1,512 nodes with a total of 54,432 Broadwell Xeon cores and a peak double precision performance of 2 petaflops and a sustained Linpack performance of 1.72 petaflops. By the end of the year, a massive chunk of compute based on Intel’s “Knights Landing” Xeon Phi processors will be added, with a total of 250,000 cores and 11 petaflops peak in this section. By July 2017, a third phase of the Marconi project will be comprised of another 7 petaflops of compute, almost certainly based on Intel’s “Skylake” Xeon E5 v5 processors, pushing the peak performance of the Marconi system up to the 20 petaflops range. But Tease says there is an outside chance that it could be based on “Knights Hill” Xeon Phi processors, like the Aurora system at Argonne will be.
Over the longer haul, Cineca will have a follow-on system that will see it push the performance of its flagship system to somewhere between 50 petaflops and 60 petaflops by 2020. The combined investment for Marconi and its follow-on is a mere €50 million – and that is for both phases.
Argonne National Lab already has a leading supercomputer, Mira, and will be home to the 2018 Aurora supercomputer. In the interim, however, a system to help teams bridge the gap was needed and Theta was born. What is interesting here is that the peak for Theta is not as high as Mira, but teams at Argonne are far more interested in real application performance than theoretical top performance. Accordingly, they have worked with Cray to develop the 8.5 petaflop system, based on Knights Landing and the Aries interconnect across its 3.240 nodes.
The team has a great deal of work to do to get past years of investments in BlueGene systems. As we reported in April, 2015, with BlueGene buried, Argonne is looking to new partnerships with Cray and Intel to power their next generation of applications and internal work on tweaking large systems for energy efficiency and reliability. The early work begins with Theta before hitting full production swing with Aurora, and this means a steep learning curve for groups at Argonne that have tuned for BlueGene over the years—and who still plan on using some of the key operational tools that were optimized for IBM systems.
This will serve as a large testbed cluster in advance of Aurora, in part to help the Argonne teams make an architectural leap out of their comfort zone, and by necessity. Now that IBM has formally, but quietly, moved on from its massively parallel BlueGene system, it leaves IBM-centric labs like Argonne in the cold—and after that sort of abrupt vendor exit, it stands to reason that they would look beyond Big Blue to support their new supercomputing ambitions.
Argonne has been a notorious consumer and supporter of IBM BlueGene systems since the very beginning, starting with the formation of the BlueGene Consortium with IBM in 2004, all the way through successive generations of the system through Mira, a leading-class BlueGene/Q that went into production in 2013. And as one might imagine, it’s not easy for the teams at Argonne who have worked on BlueGene machines for a large part of their career to watch the line fade into darkness in favor of the push for OpenPower.
“The L to P to Q on the Blue Gene were evolutionary changes, they were changes, but not significant. Jumping now to the Knights series of Intel processors for Theta means an early chance for people start porting their codes to that architecture, which we hope will make moving to Aurora much easier,” said Bill Allcock, director of operations at Argonne Leadership Computing Facility.
The National Center for Atmospheric Research in the United States is going to be replacing its current 1.5 petaflops “Yellowstone” massively parallel Xeon system with a kicker based on future Xeon chips from Intel that will weigh in at an estimated 5.34 petaflops and offer the weather and climate modeling research organization lots more oomph to run its simulations.
The HPE/SGI Cheyenne will be housed in the same Wyoming Supercomputing Center where the Yellowstone system was installed in 2012, which is located in Cheyenne, Wyoming and hence the name of the system. These two petaflops-class machines will run side-by-side for a while until the newer one is fully operational.
145,152 cores, about twice as many as what Yellowstone could deploy on workloads. But the performance improvement on real-world weather and climate modeling applications is expected to be larger. Cheyenne, at 5.34 petaflops, has about 3.6 times the peak performance of Yellowstone, at 1.5 petaflops. “We are projecting that our workloads will support 2.5X probably,” says Anke Kamrath, director of the Operations and Services Division at NCAR. “Just because you make the processors a little faster does not mean you can take advantage of all of the features.”
The plan is to have around 20 percent of the nodes have 128 GB of main memory, with the remaining 80 percent being configured with 64 GB, allowing for different parts of the cluster to run applications with differing needs for memory. The machine will have a total of 313 TB of memory, and that’s a little more than twice the aggregate main memory of Yellowstone, which stands to reason.
A system upgrade is not built into the Cheyenne deal, but the 9D enhanced hypercube topology that SGI and partner Mellanox Technology has created for the ICE XA system from SGI allows for an easy upgrade if NCAR just wants to expand Cheyenne at some point. (The hypercube topology allows for nodes to be added or removed from the cluster without shutting the cluster down or rewiring the network.)
“Things change so much, who knows about the future,” concedes Kamrath. “We could use Knights Landing, there are ARM processors coming out. I expect for our next procurement there will be a much higher diversity of things because there is more competition coming, which is a good thing.”
The ICE XA design crams 144 nodes and 288 sockets into a single rack, and Cheyenne will have 28 racks in total. The ICE XA machines, which debuted in November 2014 and which have had a couple of big wins to date, offer a water-cooled variant and NCAR will be making use of this to increase the efficiency of the overall system. In fact, NCAR expects that Cheyenne will be able to do more than 3 gigaflops per watt, which is more than three times as energy efficient as the Yellowstone machine it replaces.