Inside China’s Homegrown 64-Core ARM Big Iron Chip
August 25, 2015 Timothy Prickett Morgan
A little-known upstart Chinese chip maker called Phytium Technology was set to use the Hot Chips 27 conference in Silicon Valley as a coming out party of sorts for its 64-bit ARM server processors, and the company’s director of research, Charles Zhang, was not permitted to come to the event because of visa issues. But Zhang was able to get the information about the 64-core “Mars” ARM processor that Phytium has been developing for three years to the attendees.
According to the Hot Chips conference organizers, Phytium was not able to get a video of Zhang’s presentation on through the Middle Kingdom’s firewalls, and an email of the video file that Zhang made in his absence also did not make it through. In the end, Zhang called in from China and gave his presentation over the sound system at the Flint Center at De Anza University where Hot Chips is being held, presumably calling in over Skype or some similar service. (Isn’t technology wonderful?)
Not much is known about Phytium, which was founded in 2012 in the city of Guangzhou in Guangdong province, which is a port city that is northwest of Hong Kong, much further away than Shenzhen, which is just north of Hong Kong and another hotbed for technology. Phytium also has an office in Tianjin, which is where massive explosions rocked the city last week, killing dozens and knocking the Tianhe-1A hybrid CPU-GPU supercomputer offline. The company’s web site has very little information on it and looks like something that was created twenty years ago, but does say that the company is working on “HPC Server” technology aimed at the Chinese market, although Zhang quipped over the phone line that if any companies outside of China were interested in talked to Phytium about using its ARM server chips, Phytium was interested in talking to them.
Zhang said that Phytium aspires to be a leading-edge processor and ASIC maker in the Chinese IT sector and specifically that it will be working on two classes of ARM-based processors: one aimed at scale-up machines and another one aimed at scale-out machines used in hyperscale and cloud computing. Zhang referred to the former as “mainframe servers” and the latter as “Internet servers,” which is terminology that probably sounds a bit funny to our ears because both are old-fashioned ways of describing scale up and scale out architectures. But you get the idea.
The two families of processors that are under development by Phytium are called Mars and Earth. Mars is the one aimed at high-end, scale up architectures that are typified by mainframes, Unix servers based on RISC or Itanium engines, and the bigger Xeon E7 machines that have auxiliary chipsets from Hewlett-Packard, SGI, Lenovo, and a few others. As you can see from the chart above, the Mars ARM processors are aimed at systems that need to access large chunks of memory and have high bandwidth into memory and I/O to run workloads across coherent memory that spans lots of processor sockets. It is not clear at all how many sockets the Mars chip from Phytium will span, but presumably it is at least a dozen and perhaps as many as 16 or 32 sockets if the company wants to deliver the kind of big iron that used to make IBM and HP a pretty good living in China until a few years ago when the country started fostering indigenous suppliers for processors and systems.
Not much is known about the Earth ARM processors that Phytium is developing, and when asked by The Next Platform for more information on them, Zhang said that he was not authorized to talk about them at this time. What we can see from the chart above is that the Earth ARM processors will be aimed at scale out clusters and offer more modest performance than the Mars cores and be focused more on low cost, high power efficiency, and dense server configurations. Oddly enough, both the Mars and Earth processors will deliver “high bandwidth memory access,” according to Zhang’s presentation. Presumably the Earth processors from Phytium will be based on the 64-bit ARM architecture as the Mars chips are.
“This is a good beginning. In the next few years, we will be adding a more powerful core.”
The Mars ARM server chips are based on a core design called Xiaomi, which is also the name of the world’s third largest smartphone maker that is located in Beijing that uses ARM processors made by MediaTek and Qualcomm in its devices and, as far as we know, does not make its own ARM processors. The choice of the Xiaomi core name is no doubt significant, and we will learn its meaning soon enough. (It could be that Xiaomi has aspirations in the datacenter as it does in phones, much as Qualcomm does, and is somehow funding the effort. No one knows.) In any event, here is the basic overview of the Mars processor, which is compatible with the 64-bit ARMv8 architecture and which presumably means that Phytium is a full licensee of the ARM architecture like Applied Micro, AMD, Broadcom, Cavium Networks, and presumably Qualcomm are.
The Mars chip is organized in a hierarchy, with a block of circuits called a panel holding four blocks of two cores. Four cores, top and bottom of the panel, share an L2 cache each, with cache coherence on the panel and across the eight panels on the complete chip managed by two director control units (DCUs). Each L2 cache has 2 MB of capacity, for a total of 4 MB per panel and 32 MB across all eight panels on the die. Each Xiaomi core has 32 KB of L1 instruction cache and 32 KB of L1 data cache.
The interesting thing about the Mars design, aside from the fact that it has 64 cores on a single die, is a set of features called the cache and memory chips, or CMC. This name implies that the CMCs are external to the Mars die, but they are not. (Perhaps calling it a cache and memory controller would be better, and in this case, more accurate.) Each panel on the Mars die has its own CMC, which weaves together four banks of L3 cache memory with a total of 16 MB of capacity and 2 MB extra for ECC data scrubbing. The CMC has two DDR3 memory controllers, each supporting 800 MHz DDR3 memory and together they deliver 25.6 GB/sec of memory bandwidth. The interface between the ARM panels and this CMC is proprietary. The routing cells at the heart of each Mars panel link the CMC to the DCU and on into the L2 caches and up into the Xiaomi cores. The cores support both 32-bit and 64-bit modes and also sport 128-bit SIMD instructions. Add it all up, and the Mars chip delivers about 512 gigaflops of peak double precision floating point performance and a memory bandwidth of 204 GB/sec across its sixteen DDR3 channels and an I/O bandwidth of 32 GB/sec across its two PCI-Express 3.0 x16 controllers.
The Xiaomi cores clock at 2 GHz, and they have a superscalar architecture for their instruction pipeline that implements out-of-order execution like modern RISC processors have for servers for a long time. The 2D mesh interconnect that links the panels together with a protocol called Hawk runs at 2 GHz, and the CMC runs at 1.5 GHz. The cores run at 0.9 volts and the uncore I/O areas of the chip run at 1.8 volts. The chips are implemented in a 28 nanometer process (we don’t know whose), and the chip measures 25.2 millimeters by 25.38 millimeters; it is unclear how many transistors are on this beast, but we know it will have around 3,000 pins and burn at around 120 watts.
On early benchmark tests, the Mars chip was able to do about 10 GB/sec on the STREAM Triad memory bandwidth test with eight cores activated and scaled up linearly to around 80 GB/sec of bandwidth with all 64 cores on the die humming. On the SPEC_CPU2006_base processor benchmarks, the Mars chip has a rating of 19.2 on integer math and 17.8 on floating point math running a single copy of the benchmark. If you fire up 64 copies of the benchmark and run the SPEC_CPU20006_rate tests on the Mars chip, it gets a rating of 672 on integer math and 585 on floating point math.
If the Mars chip has on-die NUMA or SMP clustering, this was not revealed. But it almost certainly does have such features or it cannot be considered a building block for big iron machinery. Presumably this Hawk cache coherency network that links together the ARM core panels, the CMCs, and the PCI-Express controllers on the die can be extended out across multiple processor sockets. It is also a wonder by Ethernet or InfiniBand ports are not on the Mars chip, but perhaps that is coming with the next generation.
“This is a good beginning,” Zhang said at the end of his presentation. “In the next few years, we will be adding a more powerful core.” This follow-on Mars core will have a more aggressive branch predictor, multithreading, more aggressive instruction-level parallelism, and a wider SIMD unit. The power efficiency will also be increased, memory bandwidth will be boosted, and more RAS features will be added. All of this will presumably be enabled through a process shrink in Phytium’s fab partner, and if that is Taiwan Semiconductor Manufacturing Corp, that could mean a jump straight from 28 nanometers over 20 nanometers to 16 nanometers.
It does not look like Phytium is interested in building its own systems, but rather wants to sell its Mars and Earth ARM processors to others who do make machines to sell to customers. Inspur has invested in an Itanium-based big iron machine called the K1 that runs the K-UX variant of Red Hat Enterprise Linux, and the company must be looking around for an indigenous alternative to Itanium with that processor clearly being sunsetted by Intel and Hewlett-Packard. (Although they never talk about it.) While Chinese server makers can get behind the OpenPower effort and join with Suzhou PowerCore to build Power8-based machines, making the case for ARM is just as easy, given its dominance in smartphones and tablets.
To make the case for the Mars and Earth processors, Zhang showed a table of server revenue figures for China and for the entire world. As you can see, Inspur, Lenovo, Huawei Technology, and Sugon – all indigenous companies in China with aspirations outside of the Middle Kingdom – do well in China. (Huawei was growing more modestly in the first quarter.) HP and Dell do reasonably well in China, with Dell doing a lot better than HP (thanks in part to Dell’s partnerships with hyperscalers Tencent, Alibaba, and Baidu). What you can also see is that most of IBM’s server business in China was due to its System x division, which is now part of Lenovo; without X86 machines, IBM’s Power Systems and mainframe businesses drop into the Others category that is shrinking fast in China. IBM has bet big that China will bet behind OpenPower, but Phytium is betting that ARM has a better chance.
It is always good to have options, we say. And if Applied Micro, AMD, Cavium Networks, Broadcom, and Qualcomm can’t build processors that appeal to scale up and scale out customers, maybe Phytium can. It is not at all clear when Phytium intends to tape out, much less deliver Mars processors.