When it comes to large, Department of Energy national labs in the U.S. with high ambitions to build large capability supercomputers, Oak Ridge, Argonne, Los Alamos, Berkeley, and a few others tend to make the Top 500 system grade. While all of these centers are focused on particular science objectives, some of the other national laboratories put their computational emphasis toward departmental, domain-specific clusters versus building large, shared resources like Titan at Oak Ridge, for example.
Over the next five years, however, there might be a newer entry to the list of national labs with a significant focus on large-scale computing. Although Brookhaven National Lab in New York is not expected to push Top 500 barriers with a massive pre-exascale machine, their focus is shifting away from the isolated departmental clusters (and use of other large DoE systems) in favor of building a much larger system with a plan to continuously upgrade on that foundation—and with an eye on emerging architectures to fit their scientific emphasis on energy, materials science, and physics.
As one might imagine, these focus areas are computationally intensive, but according to Kerstin Kleese Van Dam, who has just been tasked with heading a new computational science directive at the lab, they are also data-intensive—a clear hint at how the center might think about acquiring future systems to support the effort. In a few weeks, she says, an RFP will be initiated for an upcoming capability system that she says will include provisions for approximately 200 nodes with two GPUs each, most likely from Nvidia, that will be part of a system to support early exascale computing ambitions for chemistry and materials sciences codes, among others.
Although the RFP won’t appear for a few weeks yet, if one figures on a fairly common high performance computing cluster with NVvida Tesla K80 GPU accelerators and the latest Haswell E5-series processors (say a 12-core version running at 2.3 GHz), 200 nodes would place such a system somewhere around #120 to #160 on the Top 500 list of the world’s most powerful supercomputers. (That would be roughly around 600 teraflops of LINPACK performance and in excess of 900 teraflops of peak theoretical performance at double precision.) While we certainly can’t guess the full specs before the vendor and device selections have taken place, this is nonetheless important because so far, Brookhaven National Lab doesn’t have a single machine on the list, making it one of the only national labs without a ranking.
Kleese Van Dam says she is less concerned about having a machine on the list than she is having a system that will be prepared for the unique analytics and simulations requirements. While she said they will eventually run LINPACK once the mystery machine is delivered to determine its rightful place, larger HPC benchmarks are not as great of a priority as micro-benchmarks for key materials sciences, nanotechnology, and high energy physics codes will be.
What is especially interesting is that this machine, while useful as a capacity system to feed the needs of multiple users, is more functional compared to what is set to come over the next five years. In addition, the state of New York has chipped in funds to help with a future technologies push that will lead Brookhaven to developing next-generation systems using novel architectures. Kleese Van Dam says that given their status as a CERN and LHC collaborator for the ATLAS project (Brookhaven is the Tier 1 center for the United States), teams at the center have expertise using FPGAs for data ingest and stream processing. As part of her larger goals to expand the center’s supercomputing emphasis, she is taking domain experts in computational science and mathematics from across divisions and collecting their skillsets. This includes a number of FPGA programmers who she says will be looking into how the devices might be used as accelerators for HPC applications—something that is emerging now in a few other research arenas.
Among other future technologies the lab will be considering are solid state drives throughout the system. There are numerous workflows where data from one task should be kept close at hand so it can be used in part in the next step—an area that is a good fit for unique SSD usage. She also described how the center is watching technologies like burst buffers to aid with larger I/O issues and noted that one interest area is also in parallel file systems that can offer better data movement capabilities.
Another interesting technology area mentioned was neuromorphic computing, something that has found a place in some fringe research so far but is expected to grow over the next few years, particularly in areas that feed into the larger LHC and ATLAS experiments.
“Down the road, we are watching what happens with things like burst buffers, neuromorphic computing, and specifically, we’re watching what is happening with Omni-Path, with the DataWarp technology from Cray, IBM and how they will work with FPGAs, and of course SSDs for larger workflows. For the capacity cluster we’re putting the request out for, we will be staying conservative, but we are very much interested in exploring and being forward looking.”
In her several decades at high performance computing centers in both research and industry, Kleese Van Dam has overseen a number of procurements—and has a good sense of how to design large-scale infrastructure programs that match application requirements with technologies, even if they are on the bleeding edge. Before her appointment to direct the computational science effort at Brookhaven, she was Chief Scientist and Team Lead for Data Services at Pacific Northwest National Laboratory (she held several positions there over the years) and was the computing director at University College London. She also worked with German car manufacturers in the late 1980s and early 1990s to develop custom modeling and simulation capabilities that took auto companies beyond the commercial codes most companies purchase. All of this was preceded by extensive work with supercomputing and weather codes. In short, there are few people with more well-rounded backgrounds to lead an ambitious project to take a DoE center well-known for its science into being one that is highly respected for its cutting edge HPC work.
In terms of the forthcoming RFP for Brookhaven’s 200-node cluster, Kleese Van Dam says she has worked with many vendors over the years and understands well what works and doesn’t. The primary motivator in the selection process is finding a vendor who “won’t just put the machine on the floor and walk away.” Brookhaven is looking for a partner to develop specific codes for specific programs, and that takes a strong connection with the vendor to tune, optimize, and ensure the best performance out of a machine. She pointed to past experiences with companies like Cray, who she says was a stand-out company in the past for tuning and involvement, and also noted strong relationships at Brookhaven with IBM over the years (after all, they’re right up the road, so to speak) and more recently, Intel.
“Our focus is really going to be on the data analysis side of high performance computing, especially for streaming, multi-source data, and using that in combination with modeling and simulation,” Kleese Van Dam says. This is not surprising given the many research facilities that present just those sorts of problems, including the new national synchrotron light source facility, the relativistic heavy ion collider, and Center for Functional Nanomaterials, will all contribute to the scope of applications running on future systems. We will also be focused on the programming models to see what architectures are best suited for the very heterogeneous workloads we have. These are still open research questions, but we are looking ahead for Brookhaven to be committed to these goals over the next years.”