Details Emerge On China’s 64-Core ARM Chip
September 1, 2016 Timothy Prickett Morgan
While the world awaits the AMD K12 and Qualcomm Hydra ARM server chips to join the ranks of the Applied Micro X-Gene and Cavium ThunderX processors already in the market, it could be upstart Chinese chip maker Phytium Technology that gets a brawny chip into the field first and also gets traction among actual datacenter server customers, not just tire kickers.
Phytium was on hand at last week’s Hot Chips 28 conference, showing off its chippery and laptop, desktop and server machines employing its “Earth” and “Mars” FT series of ARM chips. Most of the interest that people showed in the server variants, which are both based on variants of the “Xiaomi” core design that the company has cooked up based on ARMv8 intellectual property licensed from ARM Holdings. There is chatter that one of the three Chinese exascale machines, which we wrote about here, will employ a future Phytium processor, but we were unable to confirm this with the Phytium executives at the event. What we can tell you is that the first engineering samples of the two Earth ARM chips, the FT-1500A/4 and the FT-1500A/16, as well as the one Mars ARM chip, the FT-2000/64, are back from Taiwan Semiconductor Manufacturing Corp and that we saw systems running the Kylin Linux operating system (a variant of Canonical’s Ubuntu) at the Hot Chips event.
Lin Deng, a chip engineer at Phytium who helped design the Earth and Mars chips, told The Next Platform that the engineering samples were being tested by its own labs as well as a bunch of other customers, including Chinese search engine giant Baidu, the latter of which is no surprise at all. Baidu is looking for every processing advantage it can get for its diverse and hyperscale workloads, as we have wrote about many times. Both the Earth and Mars chips are etched with the mature 28 nanometer processes from TSMC, not its latest 16 nanometer FinFET processes, and it will be interesting to see what Phytium will do when it can cram lots more transistors on a die as well as shrink the die size to improve yields. But that is probably 18 to 24 months out from now. In the meantime, Phytium is focused on improving yields for its Earth and Mars chips, which will be in volume production either in the fourth quarter of this year or the first quarter of next year, according to Deng.
The basic architecture of the Mars ARM server processor was unveiled at Hot Chips 27 last year, which we covered in detail here. The company did not say much about the Earth ARM variants at that time. Although Phytium was not presenting at this year’s event, the company did provide us with considerably more information on the Earth and Mars chips as well as do some show and tell with actual machines employing the ARM motors.
The Feeds And Speeds Of Earth And Mars
The low-end Earth chip is the FT-1500A/4, which has four FTC660 generation Xiaomi cores on its die. The chip runs at between 1.5 GHz and 2 GHz, and has 2 MB of L2 cache spread across the cores and an additional 8 MB of L3 cache shared by the cores. The entry Earth chip, which is aimed at desktops, laptops, and lightweight server workloads like web serving, email serving, storage arrays and clusters, has two DDR3 memory controllers running at 1.6 GHz, which deliver an aggregate of 25.6 GB/sec of memory bandwidth. The chip, which has 1,150 pins, has one 1 Gb/sec Ethernet interface and a PCI-Express 3.0 controller that can express itself as two x16 or four x8 interfaces to peripherals. The FT1500A/4 has a maximum power draw of a mere 15 watts. Again, this is with a 28 nanometer process, and one has to wonder how low the power could go with a 16 nanometer process while also boosting the core count and maybe even the clock speeds.
The FT-1500A/16 variant of the Earth chip uses the same cores, and with four times the number of Xiaomi FTC660 cores, it has four times the L2 cache spread across those cores at 8 MB. The L3 cache on this fatter Earth chip stays the same at 8 MB of total capacity. This bigger Earth chip has four DDR3 memory controllers running at 1.6 GHz, for a total of 51.2 GB/sec of memory bandwidth, and it has two 1 Gb/sec Ethernet ports coming off the on-die network controller and the same PCI-Express controller. The FT-1500A/16 chip has 1,944 pins, and it has a maximum power of 35 watts.
The Mars FT-2000/64 chip is based on the FTC661 generation of Xiaomi cores, and has the same 512 KB L2 cache per core, but delivers a whopping 128 MB of L3 cache across the 64 cores on the die. The cores on the Mars chip have a design frequency of between 1.5 GHz and 2 GHz, just like the cores in the Earth chips. The Mars processor has sixteen DDR3 memory controllers, with deliver a total of 204.8 GB/sec of memory bandwidth running at 1.6 GHz. The whole shebang needs 2,892 pins and has a maximum power draw of 100 watts. (This is pretty good considering that last year Phytium was estimating it would hit 120 watts at 2 GHz.) I/O bandwidth comes in at 32 GB/sec per socket, and the peak performance using on-core 128-bit SIMD units (they are not vector math units according to Deng) across those is 512 gigaflops at double precision at 2 GHz. That is not as dense as the floating point that an Intel Xeon Phi many-core processor or Nvidia Pascal GPU coprocessor delivers, and it is also a lot less than what a Xeon E5 or Xeon E7 in the same thermal band would deliver with far fewer cores (say 14 to 18 cores), but it is also not too shabby for a straight-up, general purpose ARM CPU.
Here is what the Xiaomi FTC661 core looks like:
Each core has 32 KB of L1 data cache and 32 KB of L1 instruction cache, which are linked to the external L2 cache segments through prefetch units. The cores are a custom design and not based on the Cortex-A57 or Cortex-A72 designs. The cores and caches are connected by a 2D mesh network, called Hawk, which runs at 2 GHz and which we think can be extended out to do NUMA-style multiple socket clustering, although Phytium has not confirmed this.
In the Mars design, four cores share a 2 MB L2 cache segment as a basic building block and two of these blocks, which Phytium calls panels, hook together with two directory control units and a routing cell. With the Mars FT-2000/64 chips, eight panels, each with eight cores, are cookie-cuttered onto the die. These cores are surrounded by the sixteen memory controllers and two PCI-Express controllers, but a lot of the surrounding transistors are dedicated to the additional eight proprietary extension interfaces, called Logical Interface Units or LIUs, that have a number of interesting functions.
The LIUs interface to the cache and memory chip (CMC) units, which include the memory controllers and interfaces to the cache memory; memory tops out at 1 TB per socket, with two memory channels per controller and presumably one DIMM per channel. The routing cells at the heart of each Mars panel link the CMC to the director controller units (DCUs) and on into the L2 caches and up into the Xiaomi cores. These CMCs weave together four banks of L3 cache memory with a total of 16 MB of capacity and 2 MB extra for ECC data scrubbing; it has two DDR3 memory controllers. The interface between the Xiaomi core panels and this CMC is proprietary, and it looks like it will have multiple uses. It provides 19.2 GB/sec of read and write bandwidth per interface.
Here is how the Mars chips and those LIUs and CMCs will be implemented for a general purpose server:
All of these functions are implemented on the Mars chip, but it looks like Phytium is also looking at hooking special accelerators directly onto the system-on-chip in some fashion too through these LIUs. Take a gander here:
Phytium is not making any commitments about what these accelerators might be, but there is a pretty good chance that it might be interfaces to the Matrix2000 DSP that was revealed by the National University of Defense Technology a little more than a year ago as the kicker to Xeon Phi accelerators after the US government slapped an embargo on the national supercomputer center in Tianjin. NUDT is building an ARM-based exascale machine for the Tianjin supercomputing center, too, and it is quite possible that the future system will marry a Phytium many-core processor with the Matrix2000 DSPs using the proprietary interfaces. These interfaces would be roughly analogous to the 25 Gb/sec links that underpin the NVLink and New CAPI accelerator links on IBM’s Power9 processor, which were also talked about at Hot Chips.
We also think, as we have said, that the Hawk interconnect will be extended with other NUMA glue chips to allow Phytium’s customers to build NUMA systems with multiple Mars chips all glued together in a shared memory architecture. This is what Phytium is no doubt referring to when it talks about enabling the creation of “mainframe computers.” It certainly does not mean making clones of IBM, Unisys, Siemens, NEC, or Hitachi mainframes based on ARM chips, but that would be fun if it could be done. And it surely does not mean having a machine with just a single socket, as it was demonstrating at Hot Chips.
Here is the prototype FT-2000/64 system that Phytium was showing off:
As you can see, the machine had a dozen disk drives in a 2U chassis and a single Mars FT-2000/64 processor on the system board. The machine was configured with eight memory cards, four on each side of the processor, as you can see in this zoom shot:
With eight cards, needing to get to 1 TB means putting 128 GB on each card. This could be easily done with two memory channels per card, but these look like custom memory cards with memory buffer chips like IBM does with the Power8 and top-end Power9 chips and Intel does with the Xeon E7s. Phytium gave us the impression last year that a lot of the functionality of the CMC chips was on the Mars system-on-chip package, but to our eye it looks like there is a generic port coming off the Mars chip that can be expressed as main memory controllers and cache controllers or as memory controllers and other kind of accelerator links.
As for the performance of the Mars chips, they are coming in a bit lower than expected.
Last year, Charles Zhang, the company’s director of research, who unveiled the Earth and Mars chips by telephone because he was restricted from leaving China at the last minute, said that on the early benchmarks (presumably from very early samples but possibly from simulations) the Mars chip was able to do about 10 GB/sec on the STREAM Triad memory bandwidth test with eight cores activated and scaled up linearly to around 80 GB/sec of bandwidth with all 64 cores on the die humming. On the SPEC_CPU2006_base processor benchmarks, the Mars chip has a rating of 19.2 on integer math and 17.8 on floating point math running a single copy of the benchmark. Putting 64 copies of the benchmark on a single Mars chip and running the SPEC_CPU2006_rate tests on the Mars chip, it attained a rating of 672 on integer math and 585 on floating point math.
In the presentation being shown this year at Hot Chips, Phytium said that Mars chip with 64 cores running at 2 GHz was able hit 570 on the integer and 482 on the floating point parts of the SPEC_CPU20006_rate test. That is a 15.2 percent hit on expected integer performance and a 17.6 percent decline on expected floating point performance.
With engineering samples in hand and a volume ramp with mature TSMC processes over the next three to nine months, Phytium has a good chance of capturing some mindshare among system makers and hyperscalers, particularly in China. As you can see from the listing of partners above, Lenovo and Inspur are partners, and both are serious players in servers, particularly in the Middle Kingdom. We do not claim to have any knowledge of the other players listed above, but the combination of Baidu plus Lenovo and Inspur makes Phytium as much of a contender as any of the other ARM server chip players at this point.