Staying On The Cutting Edge At TACC
December 3, 2015 Timothy Prickett Morgan
By definition, the national HPC labs are on the very bleeding edge of supercomputing technology, which is necessary given the scope and scale of the problems they are trying to solve through simulation and analysis and enabled by the largesse of their budgets. A handful of other supercomputer centers are hotbeds of experimentation and innovation, and have a diversity of systems that rivals the national labs. The Texas Advanced Computing Center is one of them.
It is a delicate balance that HPC centers like TACC have to walk. Like every other IT organization, TACC has to fight for funding to get new machinery and then run it at the highest utilization possible for a few years to extract that value back out of the system. Machines roll in, and machines roll out, and every new box not only does more work, but often has a new way of accomplishing that work.
The Next Platform had a chat with Tommy Minyard, director of advanced computing systems at TACC, about the experimental nature of the center and what compute, storage, and networking technologies are being embraced now to run its HPC workloads for the next several years.
TACC has two new systems that are coming online now. The first is “Lonestar 5,” a $10 million Cray XC40 system rated at 1.2 petaflops, comprised of 1,252 nodes and having 30,048 cores, that enters production use this month. That system is being equipped with 5 PB of “Wolfcreek” SFA14K storage arrays from DataDirect Networks running the Lustre file system. The Lonestar 5 machine was acquired to support the University of Texas system and other Texas institutions, including Texas Tech and Texas A&M, and it is of a slightly more modest scale than the $27.5 million “Stampede” cluster, which was funded by the National Science Foundation, which uses a mix of Xeon and Xeon Phi compute elements, and which is still the big HPC workhorse at TACC. Stampede was built by Dell, which hails from the region of course, and has 6,400 PowerEdge server nodes. The system has 522,080 cores across its Xeons and Xeon Phis, with 2.2 petaflops coming from the former and 7.4 petaflops coming from the latter. It has 14 PB of storage dedicated to it, based on a homegrown Lustre cluster.
The second new machine, which we talked about here, is called “Hikari,” and it is a 432-node, water-cooled Apollo 8000 system from Hewlett-Packard Enterprise that is experimenting with high voltage power distribution to the racks, hot water cooling, and 100 Gb/sec EDR InfiniBand. This machine is more modest than either Stampede or Lonestar 5, with over 10,000 cores and over 400 teraflops of double precision number crunching.
Why Not Just Upgrade?
It is logical to ask why TACC didn’t just upgrade Stampede instead of adding a second system, particularly with Intel getting ready to ramp its “Knights Landing” Xeon Phi processors and coprocessors in the first half of next year.
“The reason we wanted Lonestar 5 now is that it keeps a good cadence with the Xeon processors,” says Minyard. “We have Intel’s Sandy Bridge Xeons in Stampede right now, and Lonestar 5 uses the new Haswell Xeon processors. The performance improvements we see just at the processor level are definitely substantial and justify getting another system.”
This stands to reason. The Sandy Bridge Xeons could do 8 double precision floating point operations per clock per core, while the Haswell Xeons can do twice that at 16 flops per clock per core. Add in the fact that you can get more cores for the money and more cores on the die, and you can dial up the flops per socket quite a bit. In this case, the Xeon part of Stampede is using 8-core Xeon E5-2680 rated at 172.8 gigaflops running at its base 2.6 GHz clock speed. The Lonestar 5 machine is using 12-core Xeon E5-2680 v3 processors runs at 2.5 GHz with AVX vector units running at 2.1 GHz and delivers 403 gigaflops peak. That is a big jump per socket.
Firing up Lonestar 5 this month does not mean, however, that Stampede is going to be neglected. “Stampede is about to be three years old, an in terms of HPC systems it is already getting a little long in the tooth,” says Minyard. “We are extremely interested in the Knights Landing processor and what is going to happen to that. We have a plan in to NSF right now to augment the existing Stampede system with additional Knights Landing nodes. We had in our original plan for Stampede intended to put Knights Landing cards in the system, and Intel has since changed their plan of record and cards are not going to be available in the timeframe we had planned. So we are adjusting our plan to use the self-hosted nodes that will be available hopefully in the first half of next year.”
When asked if TACC would be one of the first HPOC centers out the door with production Knights Landing chips, Minyard quipped that he hoped so. The plan is to have somewhere around 500 nodes using standalone Knights Landing processors that get grafted onto the Stampede machine.
“We have already announced our early science program, and as soon as we get the hardware in we are going to work with some very specific groups that use a lot of time on existing NSF systems and work on porting and tuning their applications for the Knights Landing processor,” Minyard explains. “We are actually coordinating that effort with the US Department of Energy so we are not duplicating effort on the same codes; they will work on a specific set of applications and our groups will work on a different set and share all the work on these important codes that run on all of the systems both at the DOE and at the NSF.”
Those 500 Knights Landing nodes will be linked together using the new Omni-Path interconnect from Intel, a kicker to its True Scale InfiniBand that was launched at the SC15 supercomputer conference last month. These nodes will be bridged into Stampede and the Stockyard storage using Lustre routers.
“It will be new technology, but this is one of the roles we play, venturing into the new technologies and figuring out how to get them to work and sharing those results with everybody else so they don’t have to reinvent the wheel. The SFA14K is a similar thing: It is new technology, new hardware, and we are putting it through its paces right now.”
For Lonestar 5, TACC is using DDN’s edition of the Intel Lustre Enterprise Edition stack for the system’s scratch storage. This is distinct from the future GS series appliances from DDN that will be announced next year, pairing Lustre and the SFA14K hardware. (Thus far, DDN has announced the base SFA14K arrays and the flash-based IME14K application and file system accelerator, but not Lustre or GPFS appliances.) The DDN storage is linked to the Cray XC40 through a dozen two-port InfiniBand interface cards that are made by Cray to link to storage and possibly other systems. In this case, the InfiniBand cards bridge the storage with the internal Aries fabric that is used for lashing compute elements together. Each one of those twelve InfiniBand blades has Lustre routers embedded on them to route the Lustre traffic to the Aries fabric.
Like most HPC centers, TACC sends out requests for proposal that allow for the bidding of the compute cluster separately from the storage clusters with the systems it builds, and it seeks multiple bids for each part of the system to get the best performance and bang for the buck.
“We did evaluate a couple of different options, and we did talk to Cray about Sonexion hardware and we also discussed potentially doing our own setup, since we have done our own file systems as we did with Stampede, where we took a bunch of servers with a bunch of disks and set them up as a Lustre file system. In this case, we wanted to have a somewhat proven technology with Lustre, and the work that we did with DDN on “Stockyard,” our global file system, definitely went a long way towards us going with them for the Lonestar 5 file system. But the decision we made was based on the price and performance that was being offered on the DDN gear.”
Stockyard is a central repository of user application data, rather than the scratch data that is generated as the production HPC systems that are running simulations and models and stored on Lustre file systems. The Stockyard global file system is the first such central repository that TACC has had since the early 1990s, when it used Fibre Channel SANs for this purpose. But two years ago, as researchers were looking to use datasets on various applications that ran on different systems at TACC, the center decided to reincarnate the application data repository idea, but this time going with a Lustre cluster instead of SANs. In that case, TACC chose the latest-greatest storage from DDN, its SFA12K arrays, and put together a 20 PB setup. Lustre routers are used to link Stampede, Lonestar 5, and other systems like the flash-heavy “Wrangler” system, which we told you all about here, to the Stockyard repository.
“The routers are one of the nice things about using Lustre,” says Minyard. “If we need to scale up performance, we just add more routers and control the amount of bandwidth each one of the clusters gets based on the number of routers they have into the Stockyard system.”
The Stockyard global file system has about 5 PB of its 20 PB of total capacity crammed with data, even after TACC imposes quotas of 1 TB per user and maybe 20 TB to 30 TB per groups of users that are sharing data, just to give you a sense of the number of datasets that are being herded there. (The Lustre scratch file systems embedded in the HPC clusters do not have quotas, but TACC does purge files that are older than 60 days.) Three years from now, when Stockyard is replaced with a new global file system, TACC will have a monumental data movement problem. Replacing an old compute cluster with a new one is relatively easy, because the amount of data that has to be moved over is small (or, in this case because of the repository, close to zero).
We have heard it before in many different contexts in system architecture, and we will say it again: Compute is easy, data movement is hard.
Precisely how long will it take to move all the data off Stockyard to a new repository? “That’s a good question,” says Minyard with a laugh. “It will take quite a bit of time. Hopefully not a year, if we can get really good transfer rates for the data. We will definitely have multiple streams of data from one system to another.”
Data gravity issues are not the only challenge with this central repository approach that TACC is trying again. If Stockyard is down, then all of the applications on its production HPC systems are affected, which would not be the case if each machine had its own dedicate application data storage. But, says Minyard, downtime on Stockyard is infrequent and the benefit of not having multiple copies of data scattered around its systems – which would need to be moved for different processing and which would take up space and cost money – outweighs the occasional inconvenience.
On another storage front, TACC is also keeping an eye on flash-based burst buffers, which are available from both Cray and DDN. “We are actually evaluating the Infinite Memory Engine on Stampede, our other large scale system,” Minyard reveals. “That system has already been in place and we had the opportunity to get some of the DDN burst buffer technology. The main reason for testing it there is the scale. We have a lot more clients and a lot more I/O that happens on Stampede, and we are much more familiar with the InfiniBand fabric for running Lustre and IME over.”
When and if Lonestar 5 needs a burst buffer somewhere down the road, you can bet that Cray, with its DataWarp buffer, and DDN, with its IME14K, will both be trying to win that deal.