Inside the Manycore Research Chip That Could Power Future Clouds

For those interested in novel architectures for large-scale datacenters and complex computing domains, this year has offered plenty of fodder for exploration.

From a rise in custom ASICs to power next generation deep learning, to variations on FPGAs, DSPs, and ARM processor cores, and advancements in low-power processors for webscale datacenters, it is clear that the Moore’s Law death knell is clanging loud enough to spur faster, more voluminous action.

At the Hot Chips conference this week, we analyzed the rollout of a number of new architectures (more on the way as the week unfolds), but one that definitely grabbed our attention is the Piton manycore research processor—a low power, scalable processing element that Princeton researchers expect could boost cloud and webscale datacenter efficiency and performance.

Michael McKeown, a Princeton researcher focused on energy-efficient datacenter models and processors, presented Piton, which is a 25-core academic manycore processor that can scale (so far) up to 8,000 chips with shared memory across arbitrary cores in the system.

“Piton is not designed as a single chip concept—it is a full system with shared memory,” McKeown explained. The memory sharing is enabled by a concept his team coined as “coherence domains”–the grouping together of cores in a system with coherency maintained among that domain. What is perhaps most interesting for large-scale cloud providers, aside from the energy efficiency, is a clever feature cooked inside that allows providers to charge based on the actual memory bandwidth usage for users on a shared system.

Piton1

As seen in the architecture overview, this SPARC ISA, tile-based design is connected with a 64-bit on-chip network. The on-chip networks and coherence protocol extends off the chip, so the same interconnect is both on and off-chip, which, along with the “coherence domains” concept, allows for the system-wide shared memory.

25 tiles connected via a 5X5 2D mesh. IBM 32 nm SOI process (IBM fabs); 36 mm die (6mm x 6mm); 460 million transistors; 1GHz target clock frequency; 208-pin QFP wire-bonded package with epoxy encapsulation. The team has already received silicon and it has been tested in the lab. Images of that work from the test implementation can be found below.
25 tiles connected via a 5X5 2D mesh. IBM 32 nm SOI process (IBM fabs); 36 mm die (6mm x 6mm); 460 million transistors; 1GHz target clock frequency; 208-pin QFP wire-bonded package with epoxy encapsulation. The team has already received silicon and it has been tested in the lab. Images of that work from the test implementation can be found below.

The team is also targeting throughput and energy efficiency and accordingly, uses a multithreaded core with an energy efficient “drafting mode” on top, which cuts down on switching activity and instruction cache accesses. At its simplest, this core energy efficient drafting mode groups together similar or alike code into the multithreaded core and aligns execution points of threads to identical instructions.

At the center of the chip is the OpenSPARC T1 core, which provides a 6-stage in-order pipeline; 2-way multithreading for better throughput and to hide memory latency; L1 caches for instruction and data. The L2 cache structure is interesting as well. Here, distributed cache is shared by all the tiles (64KB slice per tile with 1.6MB aggregate cache per chip).

Other features of the L2 cache structure include 4-stage dual pipelines; 5-cycle hit latencies; an integrated directory cache; 64-byte line size capability and shared SRAM macros that extend between the two pipelines. As more becomes available about this chip in paper format, we expect more information on how this and the coherence domains interplay.

The chipset FPGA shown can be any FPGA board with an FMC connector. On this they have programmed a chip bridge to take virtual channels off chip and decodes them back on chip, which are sent down to the north and south bridge.
The chipset FPGA shown can be any FPGA board with an FMC connector. On this they have programmed a chip bridge to take virtual channels off chip and decodes them back on chip, which are sent down to the north and south bridge.

Piton3

McKeown says OpenPiton is scalable to a half billion cores with configurable cache sizes and configurable NoC toplogy for further research.

The team has several chips in its labs now and is developing the OpenPiton open source release with RTL  and testing tools, simulation infrastructure and FPGA synthesis capabilities. A side goal, McKeown says, is to target several FPGAs at different price points for comparison. If Hot Chips presentations were two hours we would still be holding on for more detail.

As it stands now, we will follow up with the Piton team later this year to see how different FPGAs change the performance and efficiency profiles (thus cost) and what any of the cloud vendors they’ve talked had to say about the ability to charge for memory bandwidth usage–a potential game changer of an option for both IaaS vendors and end users.

 

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.