Architecting Storage And Compute At Exascale

When IBM’s Summit supercomputer was officially unveiled in June, some hailed it as the first exascale system because of its peak performance in applications that made heavy use of GPU acceleration at low precision. It is perhaps more accurately described as a 200 petaflops system, based on its double precision performance, and although this makes it the most powerful computer in operation today, the industry has some way to go to deliver a true exascale system.

The road to sustainable exascale performance will prove a tough challenge because simply scaling the Summit architecture to five times as many nodes may not deliver the five times boost in performance to reach exaflops level, and because any deployment has to fit within monetary and energy budget constraints.

Nevertheless, the prize of exascale is already drawing investment in Europe, Asia, and the United States, thanks to big data driving advances in machine learning and AI that are feeding into a wide variety of industry sectors as different as medicine or autonomous vehicles. And the thing about machine learning is that the more data you have available to train your model, the better it performs, so bigger is most definitely better.

Meanwhile AI, data analytics and traditional HPC simulation workloads are converging, and there is a requirement for a system powerful enough to do all these together, says Chris Lindahl, product marketing director of supercomputing products.

“AI is really important for future needs, but there is still a need to do the traditional model simulation,” says Lindahl, “and you can use those things together really effectively. So you might end up with a simulation built on an analysis that an AI algorithm has run, and being able to use the output of the simulation to feed back in and retrain the network will also be important.”

This means that the use cases and opportunities for exascale could be huge. And while there will only be a tiny number of sites capable of operating such a system initially, this will quickly change. When the first petaflop machines were introduced, it took just a couple of years for this to become a fairly standard unit of performance, and Lindahl predicts the same will happen with exascale.

Cray, which is partnering with Intel on one of the three exascale projects being funded by the US Department of Energy, believes it could build such a system now with existing technology, but the limiting factors are power and budget constraints. In fact, scaling up today’s architectures to exascale would lead to a power consumption of over 100 megawatts, whereas the target energy budget is 20 megawatts to 40 megawatts.

So what kind of architecture can we expect to see in such systems? It is fair to assume that they will use a mix of processor types, just as many supercomputers already rely on a combination of CPUs and GPU accelerators. Lindahl believes that this will expand as customers start using technologies such as FPGAs and specialized hardware to support AI functions.

Among the complicating factors is that future systems may have to use lower clock frequencies in order to keep power consumption down, implying that chips with a larger number of cores may be required to push performance up.

This in turn impacts the available memory bandwidth and capacity per core, which could favor processors with a larger number of memory channels such as AMD’s Epyc or IBM’s Power9 processors. However, Cray’s partnership with Intel on its exascale project means that it is likely to use Xeon parts. Intel is said to be developing new chips based on the Xeon Scalable architecture to replace the “Knights Hill” Xeon Phi that would have powered the “Aurora” supercomputer.

Exascale is likely to require more nodes than current petaflop systems, which puts the spotlight on how you interconnect them all. IBM’s Summit uses dual-rail EDR 100 Gb/sec InfiniBand links connecting its nodes in a fat-tree or folded-Clos topology, but growing such a network means adding another level or more to the tree, increasing the number of hops and thus the latency between nodes.

One alternative is Cray’s Dragonfly, a hierarchical topology where groups of nodes are connected together using all-to-all links, and each group is then also wired to all the other groups, which is designed to scale linearly in both performance and cost. The “Aurora” system would have used this topology with Omni Path links, which is due to get an upgrade to 200 Gb/sec speeds.

The growing importance of AI and big data also has implications for the storage layer, which will have to scale up by an order of magnitude in performance and capacity to keep the compute nodes fed with data. This means having a full exabyte of storage capacity, with requirements for throughput of about 15 TB/sec, compared with 1 TB/sec to 3 TB/sec for current systems.

HPC environments have traditionally relied on large numbers of hard disks accessed via a parallel file system such as Lustre or IBM’s Spectrum Scale (GPFS) to meet both the capacity and throughput requirements. These are optimized for large streaming reads and writes, but newer applications have different access patterns, with machine learning characterized by an unpredictable mix of random and sequential accesses of various sizes, which can greatly impact the performance of a parallel file system.

The obvious answer is to use flash-based storage. Flash has much lower and more predictable latency than spinning disks and read and write speeds can be between 10X and 30X faster. But the cost per gigabyte of flash is still about 10X higher than that of rotating disk, a situation unlikely to change in the near future. This is because, while the cost of making flash chips has been falling for several years, disk vendors have kept on improving the areal density of disk platters, so the cost per gigabyte is coming down nearly in parallel with that of flash media.

In reality, a hybrid storage stack combining flash and disk is going to be used even in the exascale arena. This could take a similar form to burst buffers, which were introduced in HPC environments to speed checkpointing, but which are now finding a broader use as a new primary storage layer between the compute nodes and disk storage.

Under this scenario, the application would read from and write to the flash layer of the storage stack. Those small-scale random reads and writes, characteristic of new-style workloads like machine learning, would be absorbed by the flash, with the data being committed back to the disk layer at the end of the job.

Flash has made a minimal impact in the HPC space so far, according to Cray’s director of storage product management Larry Jones, because most of the focus for flash has been on accelerating enterprise applications like databases, rather than delivering large chunks of sequential I/O.

This will change with the upcoming PCI-Express 4.0 NVM-Express devices, which will allow flash storage to move from a throughput of about 2 GB/sec seen with SAS SSDs to 6 GB/sec per device, “and that really changes and broadens the applicability of flash to HPC,” Jones says.

The challenge is ensuring that the data the application requires is in the fast flash tier of storage when needed. This is similar to automated storage tiers in an enterprise environment, but the issues in an exascale deployment are exacerbated by the size of the data sets.

“If I have to spend a couple of hours moving my data from the disk tier to the flash tier so I can run my application, that probably doesn’t make a lot of sense, because you are wasting compute time,” says Jones.

This could be solved through several different approaches; policy-based software that moves data based on things like the age of the file; transparent tiering that works as a cache trying to anticipate what data is needed; or scheduled tiering that works with the job schedulers to make sure that data is pre-loaded and afterwards purged from the flash layer ready for the next job.

Meanwhile, it has been mooted that scaling applications beyond a certain size may call for a new approach, ditching parallel file systems like Lustre altogether and instead using object storage, which is regarded as better suited to the kind of unstructured data used for big data analytics and AI workloads, and which can scale almost limitlessly within a single namespace.

The drawback is that object stores are typically accessed using REST APIs rather than file system protocols used by the many existing applications still operated by HPC sites. Fortunately, Jones believes that Lustre will evolve to these new requirements and not require customers to sacrifice compatibility, at least for the first exascale systems.

Whatever architecture the first exascale systems adopt, it is clear that a holistic system-wide approach is increasingly necessary in order to deliver the level of performance required. For example, the impact of moving data around such a large system could outweigh gains in compute performance, so algorithms may need to minimize this.

According to Lindahl, this means developing a deep understanding of how a calculation is made, where the data is coming from, and how you are going to get it to where it needs to be in the most efficient way. Just looking at the signaling rates of all the components will not deliver the step up in compute power that applications will soon need.

“For us, we really want to make sure that to get to exascale we’re looking at that overall system performance,” he says.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


This site uses Akismet to reduce spam. Learn how your comment data is processed.