The Bespoke Supercomputing Architecture That Stood the Test of Time

Nicole Hemsoth Prickett

5 months ago

In the history of computing, there has been an endless push and pull between the need for general-purpose versus fine-tuned custom systems and software.

While general purpose will, by nature, prevail on leadership-class HPC, the work done in meticulous world of ASIC design, system and software optimization filters into architectural thinking eventually. In the case of the present, ultra-specialization will likely come back around again for specific use cases in AI. One can argue it already has if the first wave of AI chip startups was any indication.

When it comes to special-purpose supercomputing, the go-to example is the Anton supercomputer architecture, a custom system dedicated to the task of solving (and dramatically speeding) complex problems in molecular dynamics with levels and speed and fidelity impossible even on top exascale supers.

Anton (and its founding father) were front of mind as it was formally recognized this year at the Supercomputing Conference (SC23) as a Test of Time Award winner. David Shaw, founder of research firm, D.E. Shaw, accepted the award and spoke at length about the evolution of the Anton system architecture and its algorithms, which have morphed to meet the times since the system’s 2008 unveiling.

Shaw represents a departure from tradition across the board. First, the Test of Time Awards have been largely centered on academic achievements. However, Shaw took a circuitous route to computational biology. After finishing his PhD at Stanford before teaching computer science at Columbia (while working on the NON-VON parallel system architecture) Shaw joined Morgan Stanley in the mid-1980s. He then founded a hedge fund, D.E. Shaw & Company, which focused initially on optimized trading algorithms before founding D.E. Shaw Research.

His work, whether for trading or grand-scale science challenges, emphasized speed and optimization on large parallel systems.

The prototype system, first described in an ACM paper in 2007, claimed that the massively parallel machine, “should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems.”

The original paper also explained how the system, which was set to emerge in 2008 with “512 identical MD-specific ASICs that interact in a tightly coupled manner using a specialized high-speed communication network” could “dramatically accelerate those calculations that dominate the time required for a typical MD simulation.”

As it turned out, Anton 1 could do everything the D.E Shaw team outlined then. In the decade-plus of work on both the system architecture and application, in 2023, D.E. Shaw has six drugs in human clinical trials. Two were developed independently from concept to trial and four others were developed with Relay Therapeutics, which focuses on protein dynamics for identifying new drug candidates.

“Our long-term goal has always been to design new molecules that can serve as medications, which is something we’re finally doing, just in the last few years,” Shaw told the SC23 audience. He says that while his team is exploring the intersection of machine learning with the future of drug discovery, the supercomputing architecture piece has “been core all along and is still the most central.”

A central component to the Anton story over the years has been speed and scale—not just of the architecture (Anton 3 is 2X more powerful than the first generation) but of its ability to get the time scales of MD simulations down so far that complex interactions can be observed—well enough to design highly-targeted treatments for a range of treatments. Getting simulations down to the millisecond in 2008 was groundbreaking but Anton 3 can is in the 1-2 femtosecond range. As Shaw explained at SC23:

“In 2008, the fastest supercomputers of the time had simulated about 1/10^th of a microsecond of time in a day. The longest that had ever been done was 10 microseconds and that lasted weeks or even months but was heroic computations. Many of the most important biological phenomena—the kind relevant to potential pharma design—all took place on scales of 10 microseconds so we were orders of magnitude away from where we needed to be.”

Those timescales meant that there wasn’t much to see other than molecules vibrating rather than the big changes that would allow for real discoveries.

Shaw demoed simulations that highlighted the value of ultra-fine timescales in drug discovery. In one, a protein shifts around, leaving gaps and pockets that the targeted medication can find and worm its way into. In another, he showed a hidden target which didn’t appear to have a binding entry point until simulations showed that there was an opportunity, something that wouldn’t have been possible if not for the ability to see activity that only lasted a very short time.

As it turns out, protein dynamics is exactly what the Anton machines do best. Protein folding, one of the key discoveries of our burgeoning century, opened doors for scientists beyond pharma. In fact, one of the most famous names in that arena, John Jumper, was working at D.E. Shaw Research during the early days of Anton development and went on to organize the AlphaFold project.

“New algorithms running on a conventional supercomputer would have been too slow. And a new supercomputing architecture running conventional algorithms would have been too slow.”

Shaw says that when the massively parallel Anton 1 machine emerged with its custom ASIC and special baked-in capabilities for particle interactions, it “allowed a dramatic increase in simulation length because of its speed—it was 100X faster than the fastest general purposes of the time and allowed continuous, millisecond-long simulations of proteins” which meant many new behaviors and interactions were observed for the first time ever.

Most of the chip area of the original machines (Anton 1 and 2) were dedicated to specialized math that honed in on the most computationally expensive parts of MD simulations. That meant there were (and still are) tradeoffs, including a lack of flexibility and programmability. It was “a very inflexible bunch of fast, stupid logic and nothing was programmable,” Shaw says. “We were brutal to the people who were designing that embedded software and put high priority on having it run fast.”

The data flow nature of the Anton systems meant that data went right to where it was needed, it didn’t stop along the way or pop to global memory. There’s plenty of memory on the systems but it’s distributed across the chip, which meant to high bandwidth and low latency. “At the interchip level, we had some app-specific ways of minimizing latency and the overall throughput but overall, we always had the luxury of doing that because we knew what algorithms we were trying to speed.

By the way, if this architectural discovery sounds familiar, it is of course happening among all the AI chip players who, in some ways, also have the luxury of a defined workload to optimize around.

By 2013 with the introduction of Anton 2, teams reported Anton 2 was an order of magnitude faster than the original with support for 15X as many atoms. They were able to add better flexibility and programmability and support for more accurate physical models with some new algorithms. The capacity, or total number of atoms, was big deal for potential discoveries but it meant more data movement between chips and further refinement to communication strategies.

With Anton 3 last year, D.E. Shaw pushed its 512-node machine into public view, showcasing its ability to simulate biological systems at unprecedented scale—in the ballpark of millions of atoms. “There were a number of architectural changes in this machine due to changes in underlying technologies, including different rates of in advancement in processing versus communication parameters” but it meant a new world of discoveries, including those that led to the drugs in clinical trials now.

The following slide highlights the improvements compared to traditional HPC architectural elements (GPU/CPU).

The x axis is size of the biological system, the simulation speed is on the y axis (microseconds per day). As expected, the curves show what we see elsewhere: as you go up in system size performance goes down (simulating more interactions).

Anton 3 had some of the largest jumps architecturally, including refactoring the ASIC layout to minimize increasingly expensive cross-chip communication by moving to a tile-based architecture that combines sub-tiles for both the “hardwired” particle interaction pipelines and also programmable and more flexible processing units.

Designers also added specialized “bond calculators” or modules that sped up some of the slower parts of working with different bonded atoms and the interactions they propagate. “We could save area, energy, and get speed with this chunk of hardware to take the load off other parts of the chip,” Shaw explains. The team also worked on some novel data compression techniques but ultimately, communication proved the bottleneck. “We learned about our calculations and looked in detail at the underlying physics to find redundancies and opportunities but we have the luxury of looking at just one application.”

“Historically, it’s been hard to have special purpose machines compete with general purpose supercomputers. It was great that Anton got 100X for this application—exciting to us. But the part that relates most to our long-term goal was more to do with the underlying science, learning about molecular systems and curing people. That’s what we wanted to do.”