While a lot of the applications in the world run on clusters of systems with a relatively modest amount of compute and memory compared to NUMA shared memory systems, big iron persists and large enterprises want to buy it. That is why IBM, Fujitsu, Oracle, Hewlett Packard Enterprise, Inspur, NEC, Unisys, and a few others are still in the big iron racket.
Fujitsu and its reseller partner – server maker, database giant, and application powerhouse Oracle – have made a big splash at the high end of the systems space with a very high performance processor, the Sparc64-XII, and a new line of Sparc M12 that employ it. Fujitsu is not just aiming the Sparc64-XII at those large enterprises that still want Unix-based systems for their mission critical applications, but also its existing customers, and maybe every once in a while new customers, who are also looking at running deep learning, traditional HPC, and analytics workloads against the data stored in these big iron systems in a native fashion.
In The Moore’s Law Slow Lane, By Design
In many ways, the remaining chip makers who create the motors inside NUMA-style big iron are lucky. Their customers, who run their most critical transaction processing and database systems on this big iron, are extremely risk averse and therefore they are loathe to change platforms unless absolutely necessary. Such decisions to change platforms are usually made for political rather than technical or economic reasons, and they are few and far between each year. These customers also have predictable workloads, which makes capacity planning easier; they tend to see transactions rise in concert with economic growth and as they add a few applications here and there to their vast portfolios of software, which can number hundreds to thousands of distinct applications.
This all makes it fairly easy for companies like Fujitsu to plan out their processor roadmaps, and frankly, to stretch out the time between processor generations. Back in the day, big iron systems had processor upgrades every two years or so, and then it stretched to three years, and now it is even longer. With so much memory and processing available in prior generations of machines, Fujitsu did not feel like it had to rush through its roadmap. On the commercial side of its Sparc64 server lineup, the four-core Sparc64-VII processors etched with 65 nanometer processes debuted in July 2008 and its kicker, the Sparc64-VII+, came out in December 2010 with some clock speed enhancements (up to 3 GHz) and larger L2 caches to boost performance; it was sold by Oracle as the Sparc M3 processor, by the way. There was no commercial-grade Sparc64-IX processor, but the sixteen-core Sparc64-X, implemented in 28 nanometer processes, made its debut in September 2012 as a kind of unified architecture for the HPC variants (the Sparc64-VIII and Sparc64-IXfx, which did not have on-chip NUMA interconnects and which had other HPC-specific instructions). The “Athena” systems that Fujitsu created with these chips were sold by itself and by Oracle. The sixteen-core Sparc64-X+ followed in August 2013 with the Athena+ systems and a clock speed burst to 3.5 GHz on that same 28 nanometer process, plus some other tweaks. There has not been so much as a peep out of Fujitsu since then, except a move to the ARM architecture and back towards custom HPC processors for the Post-K exascale machine.
You might note that there was no Sparc64-XI processor, just a Sparc64-X+ chip three and a half years ago, and if you want to be really honest about it, it has been four and a half years between process nodes for Fujitsu with regard to the commercial Sparc64 processors. That is a long time, but still better than the ill-fated “Kittson” Itanium 9700 processors, which were supposed to be implemented in 22 nanometers in a converged Itanium-Xeon E7 socket but four years ago Intel put the kibosh on that. The last Itanium chip, the eight-core “Poulson” Itanium 9500, was unveiled in August 2011, so it has been more than five years since there was an upgrade shipping in systems. Even IBM’s Power8 came to market after a longer than usual run for the Power7 and Power7+, which together spanned more than four years from 2010 through 2014, and the gap between Power8 and Power9 is going to be more than three years. (To its credit, the Power processor jumps have been fairly regular since the dual-core Power4 was launched way back in 2001.)
Alex Lam, vice president of Enterprise Business & Strategy at Fujitsu in North America, knows that the Japanese server maker has been quiet about its commercial Sparc64 chips in recent years, enough to make companies wonder if perhaps it might switch to Oracle M series and S series chips for its Sparc/Solaris platforms and call it a day. But that is not the case, and the jump to ARM for the PrimeHPC platform has not diminished the company’s commitment to Sparc64 for enterprises.
“The PrimeHPC supercomputers are a flagship product and obviously have a halo effect,” Lam tells The Next Platform. “But ultimately, if you look at where the real bread and butter is for the Sparc64 systems, it was on the commercial side. There is definitely a market and a valid use case for these kinds of systems.”
Oracle thinks so, too, and as far as we know is continuing development on its Sparc T8 and M8 processors for midrange and high-end systems, respectively, and is additionally reselling the Sparc64-XII in the new M12 systems under its own brand. Under the arrangement between Oracle and Fujitsu, Oracle peddles its own boxes all over the world, Fujitsu sells its own inside of Japan, and Fujitsu sells in North America and Europe in a way that minimizes bumping heads with Oracle; the two collaborate on Solaris Unix operating system development on the Sparc64 platforms. “There are clear swim lanes on where we sell the products,” says Lam. “It is a very healthy collaborative approach that promotes the Sparc platform in a way that is beneficial to both companies.”
Inside The New Sparc64-XII Processor
Fujitsu is changing up its strategy for big iron systems with the Sparc64-XII and the M12 systems that use them, and the choices it is making are interesting in that they are going in exactly the opposite direction of what IBM is doing with the Power8 and future Power9 chips.
First, the core count on the Sparc64-XII chip is going down, not up. Rather than add more cores to the chip, Fujitsu is adding brawnier cores, with lots more threading and other instruction per clock (IPC) tweaks. Lam says that based on its own internal benchmark test results, the core in the Sparc64-XII chip is delivering 2.5X the performance of the core used in the Sparc64-X+ processor that came out nearly four years ago. This is a big leap in per-core performance, and it will allow many customers who are paying per-core licensing fees for software to lower their budgets for databases, middleware, and applications while getting more throughput on those applications.
The Sparc64-XII cores run at 3.9 GHz, 11.4 percent higher than the 3.5 GHz clock speed on the predecessor Sparc64-X+ cores. Even with a drop from sixteen cores with the Sparc64-X+ to the twelve in the Sparc64-XII, the aggregate throughput per socket is going to rise by about a factor of 1.9X. Yes, that was a long time to wait for a near doubling of performance. But for many Sparc64 customers, the wait will be worth it. (The wonder is why Fujitsu did not show off this processor at Hot Chips last summer, as it normally would have.)
The Sparc64-XII is implemented in 20 nanometer processes and is manufactured by Taiwan Semiconductor Manufacturing Corp. For the first time, Fujitsu is adding an L3 cache to its processor, something it has eschewed until now because of the large transistor budget it requires. The Sparc64-X+ weighed in at 2.99 billion transistors, but the Sparc64-XII comes in at 5.5 billion transistors, and a lot of that comes through the addition of that L3 cache. But that 32 MB L3 cache is vital in an architecture that is moving up to eight threads per core, up from two threads per core in the past four generations of Sparc64 commercial chips, and keeping all of those threads happily fed with data requires another layer of cache.
It is interesting to note that IBM’s Power8 chips have L4 cache in the memory controllers on its homegrown memory cards, which are used in its biggest, baddest NUMA boxes.) IBM’s high-end Power9 chip will similarly have a dozen cores per chip when operating with eight threads, and there is a version with four threads per core that will scale to 24 cores per socket for systems that need lots of compute, less main memory, and limited NUMA expansion.
Each of the cores in the Sparc64-XII processor has 64 KB of L1 instruction cache and 64 KB of L1 data cache, plus a 512 KB L2 cache. The core implements various SIMD instructions and also has a new decimal floating point math unit, which comes in handy counting money. (The Power chip has had decimal math capability for several generations.) The chip has various accelerators for cryptographic and hashing algorithm processing, and supports AES, DES, 3DES, RSA, DSA, DH, and SHA protocols natively in hardware.
The Sparc64-XII chip has four DDR4 memory controllers, which support memory speeds of up to 2.4 GHz; the Sparc64-X and Sparc64-X+ chips supported older DDR3 memory and had only two memory controllers. The memory bandwidth per socket has risen by a factor of two along with the number of memory controllers, to 153 GB/sec, but the main memory capacity has been kept the same at 1 TB per chip using 64 GB memory sticks. (The prior Sparc64-X+ machines had 32 sticks at 32 GB to reach that capacity.) To balance out memory performance, the Sparc64-XII chip has two DDR4 memory channels per controller and two DIMMs per channel for a total of sixteen sticks per socket.
The Sparc64-XII has twice as many PCI-Express controllers, and has four of them integrated with eight lanes of traffic each for a total of 32 lanes; that is twice the peripheral bandwidth. The NUMA interconnect runs at 25 Gb/sec, the same speed that NUMA links ran at with the Sparc64-X+ chips, but the Sparc64-X chips ran at only 14.5 Gb/sec. (By the way, peripheral interconnects on the Power9 chip run at 25 Gb/sec, but the NUMA links only run at 16 Gb/sec.)
The Sparc64-X+ interconnect implemented a glueless four-socket M10 node and then allowed for up to sixteen of these to be linked together over routers for a total of 64 sockets and 64 TB of memory. With the Sparc64-XII machines, Fujitsu is backstepping to a two-socket glueless node, which is sold as the M12 server, and then allows up to sixteen of these to be linked together to create a system with a total of 32 sockets and 32 TB of main memory.
The funny bit is that Fujitsu has cut the socket count and maximum memory back in the high-end M12-S system by half moving from the Sparc64-X+ to the Sparc64-XII, just like IBM cut back from 32 sockets and 16 TB of memory with the Power7 in the Power795 to sixteen sockets and originally 16 TB with the Power8 in the Power E880. (IBM did boost this to 32 TB of maximum capacity with a very pricey 128 GB memory stick back in January 2016.) Oracle can scale its “Bixby” interconnect to 96 sockets and 96 TB or, with fatter DIMMs, even 192 TB, but The Sparc M5 and Sparc M6 systems topped out at 32 sockets and 32 TB, and the Sparc M7 machines launched in October 2015 expanded to 64 sockets but only with 32 TB of maximum memory. It is unclear what Oracle will do with the future Sparc M8 systems it is creating and which will compete against Power9 and Fujitsu M12-S.
Clearly, enterprises are not looking for machines that can do more than 32 TB in a single footprint, or IBM, Fujitsu, and Oracle would be selling them. Or maybe they are just too expensive.
The Sparc M12-2 “Athena++” server that Fujitsu is building and that Oracle is reselling has two of the 3.9 GHz Sparc64-XII processors and supports up to 2 TB of main memory and has 11 PCI-Express 3.0 slots and with expansion units, it can fan out to a total of 71 PCI-Express slots sharing the bandwidth for peripherals. It has a four-port 10 GB/sec Ethernet card embedded in the system board, and significantly, the M12-2 system has an aggregate memory bandwidth of 306 GB/sec, which is pretty good and absolutely competitive with the Power8 platform. The chassis has room for eight 2.5-inch drives, which can be 600 GB or 900 GB SAS disks or 400 GB eMLC SAS SSDs.
If you plan on doing NUMA scalability beyond two sockets, then Fujitsu cranks the clock speed on the cores to 4.25 GHz (that’s a 21.4 percent boost over the 3.5 GHz clock speed on the Sparc64-X+) and allows customers to pick configurations with 2, 8, or 32 sockets and 2 TB, 8 TB, or 32 TB maximum memory configurations using 64 GB memory sticks. (If you want to use less capacious 16 GB or 32 GB memory sticks, you can.) This beast can bring to bear 384 cores and 3,072 threads on a single application running against that big block of shared memory, and this Sparc64 interconnect will have much higher bandwidth and lower latency than an Ethernet or InfiniBand network – at least for a while, anyway.
With the Sparc M10, Fujitsu used water cooling on key components to increase the cooling efficiency of the system and to allow for the components to run a bit hot. With the Sparc M12 systems, Fujitsu has an improved evaporative liquid cooling system that is twice as effective, and which we presume lets it overclock components a bit to deliver that big performance bump.
The Fujitsu Sparc M12-2 and M12-S systems are available now. The Sparc M12-2 base machine with two 3.9 GHz processors, 64 GB of memory, and one 600 GB disk drive would cost $49,660. A single node in a Sparc M12-S cluster with two of the 4.25 GHz processors, plus the same 64 GB of memory and a single disk drive plus the “XB” crossbar interconnect router costs $64,284.
One last thing: Unlike Oracle’s own iron, Fujitsu keeps supporting the Solaris 10 operating system natively on its iron, and this includes the new M12-2 system and M12-S NUMA cluster. Fujitsu customers can run Solaris 10 on bare metal or in Solaris containers on Solaris 11 machines; Oracle only allows Solaris 10 in containers on newer Sparc iron of its own design.