ARM Server Chips Challenge X86 in the Cloud

The idea of ARM processors being used in datacenter servers has been kicking around more most of the decade. The low-power architecture dominates the mobile world of smartphones and tablets as well as embedded IoT devices, and with datacenters increasingly consuming more power and generating more heat, the idea of using highly efficient ARM chips in IT infrastructure systems gained steam.

That was furthered by the rise of cloud computing environments and hyperscale datacenters, which can be packed with tens of thousands of small servers running massive numbers of workloads. The thought of using ARM-based server chips that are more energy-efficient than their Intel Xeon counterparts to run all these servers was enticing.

But despite all the talk, ARM’s path into the datacenter has been bumpy. Calxeda was early to the party, but ran out of money and had to shut its doors. Others, such as Samsung and – it looks like – Broadcom (following the massive $37 billion merger with Avago), have pulled back on plans to manufacture ARM server chips. Broadcom was expected to get its “Vulcan” chip into the market by 2015. Those that have pushed out ARM chips – including AMD, Applied Micro and Cavium – have not seen widespread adoption of their products.

As we’ve discussed, Qualcomm seems to be the most aggressive in its intent to bring out an ARM server chip that can compete with Xeon processors and chip away at Intel’s dominance in the space (Intel holds about 97 percent of the server chip market). When ARM officials first began talking about putting their architecture into datacenter servers, it was eyeing an Intel that was struggling to reduce the power consumption of its x86 chips. ARM seemed a natural alternative.

However, the playing field has changed in recent years. Intel has improved the energy efficiency of its Xeons, and other players – in particular IBM, with its OpenPower effort and AMD with its upcoming x86 Xen chips – are also working to become another option for businesses that are looking for second source of silicon to not only drive down prices and fuel innovation through competition, but also to protect themselves in the event of supply chain problems.

Still, ARM officials have seen some momentum behind their efforts. Fujitsu last year announced it was ditching the SPARC architecture in favor of 64-bit ARMv8-A SoCs for the next generation of its K supercomputer, which is the seventh-fastest system in the world, according to the Top500 list. The goal is to improve the performance-per-watt of the new system. Once operation, the Post-K supercomputer will be an exascale system 100 times faster than the current system. More recently, Qualcomm this month announced a joint venture with China’s Guizhou province named Huaxintong Semiconductor Technology, which is developing an ARM-based server chip for the Chinese market. In addition, the Mont-Blanc Project in Europe is working with Cavium and system-maker Bull—owned by Atos—to build a prototype exascale computer using Cavium’s ThunderX2 ARM-based SoCs.

The rise of mobile and cloud computing and the growth of infrastructure-as-a-service (IaaS) also hold out hope for energy-efficient architectures like ARM. It is at this intersection in the rapidly evolving IT ecosystem landscape that two researchers from India recently tested the 64-bit ARM architecture against an x86 chip in running data analytics workloads. Jayanth Kalyanasundaram and Yogesh Simmhan from the Department of Computational and Data Sciences Indian Institute of Science in Bangalore ran tests pitting a server powered by AMD’s year-old ARM-based A1170 SoC and one based on the chipmaker’s x86-based Opteron 3380. In the past, there has been numerous studies of 32-bit ARM processors for various workloads, there had been no research around ARM64 chips and how they handle cloud-based applications, particularly big data workloads, the researchers wrote in their work titled “ARM Wrestling with Big Data: A Study of ARM64 and x64 Servers for Data Intensive Workloads.”

“Since energy consumption by servers forms the major fraction of the operational cost for cloud data centers, ARM64 with its lower energy footprint and server-grade memory addressing has started to become a viable platform for servers hosted by Cloud providers,” Kalyanasundaram and Simmhan wrote. “This is particularly compelling given that scale-out (rather than scale-up) workloads are common to Cloud applications, and the growing trend of containerization as opposed to virtualization.”

The two researchers used a SoftIron Overdrive 3000 server powered by an eight-core, 2GHz A1170 chip with 16GB of RAM, 1TB Seagate Barracuda HDD with a 64MB cache and Gigabit Ethernet. The system ran an OpeSUSE Linux distribution and a BTRFS file system. The x86 system was a single cluster node but with the 2.6GHz, eight-core Opteron 3380 processor. The server had a similar configuration—16GB of RAM, a 256 GB SSD for the operating system partition, the same Seagate 1TB HDD and Gigabit Ethernet. It ran the CentOS 7 Linux distribution, EXT4 file system for the SSD and BTRFS for the HDD. Both systems used the OpenJDK v7 compiled for x64 and ran Hadoop v2.7.3 in pseudo-distributed mode.

The tests used Intel’s HiBench Big Data benchmark suite running a variety of workloads for various benchmarks, from web search and hive query to machine learning and reducer parallelism tuning. The researchers also analyzed the energy efficiency of each system while running the various benchmarks.

The detailed findings can be found in the study, but the results are encouraging for cloud-based players considering ARM-based systems for their environments and for chip vendors developing ARM-based SoCs and are hoping the cloud will give them another avenue into the datacenter server market. According to the researchers, there was comparable performance between the two servers when running integer-based workloads and jobs with smaller floating-point sizes. The ARM server was dinged when running larger floating-point applications due to its slower floating-point unit (FPU) coprocessor. However, “with tuning Hadoop to expose data parallelism, the ARM64 server can come close to the performance of the x64 server, which is limited by having a faster FPU shared by pairs of cores,” they wrote.

As far as energy efficiency, the ARM server had a three-times smaller base power load than the x86 system, with a similar reduction in load when running the big data workloads. The ARM server also had similar benefits when looking at the energy-delay product (EDP) – which entails both compute performance and power efficiency – with a 50 to 71 percent advantage over the X64 system. The two researchers plan to expand the study to include a better understanding of disk IO performance and a deeper dive into the relative performances of the FPUs, as well as the impact of containerization and virtualization and how the systems run other big data workloads for stream processing and graph analytics.

The ARM architecture will continue to find the competition to become the preferred alternative to Intel in the datacenter a challenge. The best opportunity it had was several years ago, and the competitive landscape has grown since then. But studies like the one done by Kalyanasundaram and Simmhan will give cloud providers and hyperscale companies reasons to consider the architecture.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

4 Comments

  1. Why does the study not use Xeon processors, which as this article notes have a 97% MSS. And besides, the comparison ARM vs. x86 is fundamentally flawed since that would be like comparing cars vs motorcycles. It depends on which specific vehicle you’re using.

    • Yes very odd current AMD Opteron are neither known to be fast nor power-efficient. That’s why they basically have zero share in the server market

  2. “First they ignore you, then they laugh at you, then they fight you, then you win.”

    Six years following first commercial ARM Server development that was a Network Storage Appliance ZT Systems (PHYTEC) R1801e up to 16 discrete ST SPEAr ARM 9s, 80w system power, first encompassing applications report arrives; India Institute of Science’s ARM Wrestling with Big Data January 21, 2017.

    Follows a history of focused assessment; Tirias, MySQL Database using Thunder X, November 14, 2016; Anandtech, Investigating Cavium Thunder X June 15, 2016; Linley Group, X-Gene 3 Challenges Xeon E5, April 2016; Journal of Physics, Frederic Pinel, University of Luxembourg, Dissertation; Energy-Performance Optimization for Cloud, November 27, 2014, HSOC Benchmark for an ARM Server May 13, 2014; University of Edinburgh, Energy Efficiency of SOC Based Processors; April 23, 2013; Calxeda’s own ECX1000 1.1 GHz v E3 1240 3.3 GHz, June 21, 2012 and all I can say is that’s running at Intel speed.

    Now all ARM consortium has to do is beat Intel fabrication production cost : price, and the customer concerns.

    Intel must be concerned with the antitrust and competitive potentials having hunkered down to barricade Data Center Group Xeon in commercial pricing at 82% first tier discount off 1K. Intel statement on DCG revenue lag has little to do with actual demanders appears Intel diversion.

    Assessment Broadwell Xeon E5 2600 v4 EP DP –

    Checked against Intel 2016 DCG revenue divided by analyst Xeon unit total production volume for determining per unit average price. Also, analyst q4 2016 broker channel inventories holding report by product category volumes, for calculating broker holding’s 1K revenue value adjusted to reflect Intel 2016 percent of division revenues statement determines Intel price discount level.

    Summary – 33% of run or 10,775,736 units of BW v4 14nm production are priced less than full run Marginal Cost $160 is among the competitive development hurdles. Intel on Ivy and Haswell volumes has signaled to first tier dealers all margin values rung from less than 16 cores. Now sit in channels reverberating for other than cloud procurement. On amount of Intel surplus enterprise procurement secures the open market price advantage.

    Note 1- BW v4 Average Marginal Cost @ $160 is $31 more than 22 nm Haswell per unit of production cost.

    Note 2 – Average Price of BW v4 Cost calculated below @ $174 represents the non weighed price on grade SKUs, and is not profit maximized.

    Industry total revenue displacement on Broadwell E5 2600 v4 dumping is competitive cost entry barrier; $4,441,803,931, estimate a good 2x five year ARM system constituent development cost.

    Q4 2016 Intel x86 broker market inventory holdings report here:

    http://seekingalpha.com/article/4033057-intel-another-threat-emerges-zen

    ARM fabricators and design producers;

    Key; E5 26xx v4 Grade SKU, Intel 1K Price, 1st Tier Customer Price, Intel profit or (loss).

    Approximately 32,852,212 units of production.

    FOUR CORE

    2623, 3.0 GHz, 10 MB L3, 85w; 1K $444, $79.92 ($80.08)
    2637, 3.5 GHz, 10 MB L3, 135w, 1K 996, $179.28, $19.28 above cost

    SIX CORE
    2603, 1.7 GHz, 15 MB L3, 50w; $213, $38.34 = Variable Cost, ($121.66)
    2643, 3.4 GHz, 15 MB L3, 135w, $1552, $279.36, $119.36 profit > cost

    EIGHT CORE
    2608L, 1.6 GHz, 20 MB L3, 50w; $363, $65.43, ($94.66)
    2609, 1.7 GHz, 20 MB L3, 85w; $306, $55.087, ($104.92)
    2620 2.1 GHz, 20 MB L3, 85w; $417, $75.06, ($84.94)
    2667 3.2 Ghz, 20 MB L3, 135w, $2057, $370.26, $210.26 profit > cost

    TEN CORE
    2618L 2.2 GHz, 25 MB L3, 75w; $779, $140.22, ($19.78)
    2630L 1.8 GHz, 25 MB L3, 55w; $612, $110.16, ($49.84)
    2640 2.4 GHz, 25 MB L3, 90w; $939, $169.02, $9.02 above cost
    2689 3.1 GHz, 25 MB L3, 165w; $2723, $490.14, $330.14 competitive profit level for Intel

    TWELVE CORE
    2628L 1.9 GHz, 30 MB L3, 75w; $1364, $245.52, $85.52 profit > cost
    2650 2.2 Ghz, 30 MB L3, 105w; $1166; $209.88, $49.88 profit > cost
    2687W 3.1 GHz, 30 MB L3, 160w; $2141, $385.38; $225.38 competitive profit level

    FOURTEEN CORE
    2648L 1.8 GHz, 35 Mb L3, 75w; $1544, $277.92, $117.92
    2650L 1.7 GHz, 35 MB L3, 105w; $1332, $239.22, $79.22
    2658 2.3 GHz, 35 MB L3, 105w; $1832, $329.76, $169.76
    2860 2.0 GHz, 35 MB L3, 105w; $1445, $260.10, $100.10
    2680 2.4 GHz, 35 MB L3, 120w; $1745, $314.10, $254.10
    2690 2.6 GHz, 35 MB L3, 125w; $2090, $376.20, $216.20

    SIXTEEN CORE
    2683 2.1 GHz, 40 MB L3, 120w; $2424, $436.42, $276.32
    2697A 2.6 GHz, 40 MB L3, 145w; $2702, $486.36, $326.36 competitive profit level

    EIGHTEEN CORE
    2695 2.1 GHz, 45 MB L3, 120w; $2424, $436.32, $276.32
    2697 2.3 GHz, 45 Mb L3, 145w; $2702, $486.36, $326.36 competitive profit level

    TWENTY CORE
    2698 2.2 GHz, 50 MB L3, 135w; $3226, $580.68, $420.68 just below economic profit point for Intel

    TWENTYTWO CORE
    2696 2.2 GHz, 55 MB L3, 150w; $4115, $740.70, $580.70 entering economic profit points for Intel
    2699 2.2 GHz, 55 Mb L3, 145w; $4115, $740.70, $580.70
    2699R 2.2 GHz, 55 MB L3, 145w; $4569, $822.42, $662.42
    2699A 2.4 GHz, 55 MB L3, 145w; $4938, $888.84, $728.44

    Marginal Cost of 24 (22 core) master on production economic total revenue total cost assessment (before marginal cost of sort and dice) = $511.

    Science rarely leads to objects of replication, but objects for further articulation and specification under new and more stringent conditions.

    “Be the change you wish to see in the world.”
    “Which is the more powerful, the elephant or beehive”.

    Mike Bruzzone, Camp marketing

  3. Good to know! Now get on to asking Lisa Su about AMD’s Custom K12 ARMv8A ISA running CPU IP. Do this every week until there is an answer! The real question about ARM should be made to those that are responsible for reporting on the ARM market! Such as where are the complete rundowns on all the custom micro-architectures that are engineered to run the ARMv8A ISA. Let’s get the full rundown on the custom ARM makers exact implementations in CPU hardware, such as cache levels, instruction decoders, floating point units(pre core), Integer units (per core), branch units, Address generation units, on a per core basis etc. With some nice figures on Branch branch misprediction penalties, reorder buffer size, Cache algorithms and implementations(victim cache, wright through cache, cache exclusivity) and Cache associativity to lower level cache and main memory.

    Another very important question for AMD is what about SMT and AMD’s K12 custom ARM micro-architecture. Lets start looking at all the features that IBM power8/Power9 RISC ISA designs have that any ARM RISC ISA based custom ARM Micro-architecture should be able to implement, ditto for any Intel or AMD x86 features that can be done using the ARMv8A ISA.

    Those ARM holdings Scalable Vector Extensions(SVE) look interesting for the HPC markets and ARM and Fujitsu announced these scalable vector extension(SVE) extensions to the ARMv8-A architecture with Fujitsu planning an exascale system built around the ARMv8A and SVE ISA extensions.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.