Multi socket servers have been around since the dawn of enterprise computing. The market and technology has matured into 2-socket x86 and multi core highly integrated processors. In this piece, I hope to answer some questions about how we landed here and what is the likely future.
Large symmetric multiprocessor systems (SMP) once ruled the enterprise, much like the dinosaurs roamed the earth. This was required due to many factors (1) process technology only allowed single core CPUs (2) large scale networks did not exist, or at least were not economically viable (3) applications were oriented around terminal – mainframe experience. These factors prevented the notion of large scale out systems and applications like we have today. As those factors changed, enterprise started to shift from large SMP machines to scale out systems of moderate socket counts (2 & 4 socket predominately). Process technology and large networks disrupted the economics and continued to where we are today with scale out 2-4 socket with many-core CPUs.
But why did we stop at 2-4 socket systems? Why have we not made the last transition to 1-socket? There have been several reasons in the past, but I’m not sure they are valid anymore. As we made the transition from mainframes to scale out, the CPUs were single core, and the cost economics of sharing north bridges, south bridges, and memory controllers made a ton of sense. With the dawn of multi core and the integration of memory controllers there was still a valid argument to share IO devices, south bridge, PSUs, fans, etc.
But today we are entering the era of high power, highly integrated many-core CPUs and the argument to share what’s left – IO, PSU, fans, etc is becoming very weak. In fact, it is leading to power and performance challenges. The CPUs today and moving forward are equal in power to 2 to 4 of the CPUs of the past – we are talking 250W+ CPUs with 32 cores and higher power/core count in the future. This has led to the demand for more memory, more NVMe drives… and thus more power per U. Each socket has so much performance (many cores) you really should *not* share IO between sockets, in fact doing so negates performance – read on. Years ago, in our high volume 2U server the top PSU was 750 Watts and now we have standard PSU offerings up to 2400 Watts – our power per U has gone up 3x. This is leaving racks half empty and data center hot spots.
Dual socket servers are creating performance challenges due to Amdahl’s law – that little law that says the serial part of your problem will limit performance scaling. You see, as we moved to the many core era, that little NUMA link (non-uniform memory access) between the sockets has become a huge bottle neck – basically we have entered an era where there are the too many cooks in the kitchen. With 28, 32, 64 or more cores on each side trying to reach across the NUMA link to share resources like memory, and IO, not to mention cache coherency traffic, the socket size is blowing up. In order to keep the IO and memory bandwidth per core above water, there is a fight for NUMA link pins – and remember, more pins equals more power. For a balanced system though, the priority is core memory bandwidth first, then core IO bandwidth, and then lastly NUMA bandwidth – so we have a growing problem in multi-socket systems and it is the last priority to get pins. As an example, in a 2-socket system, the bandwidth to IO off CPU0 vs CPU1 can drop 25% while the latency can go up 75% during stress vs native traffic from CPU0 to the IO. See figures 1, 2, 3 – in figure 2 higher is better (bandwidth), while in figure 3 lower is better (latency). In this test we simply used STREAM to drive increasing stress and then measured network bandwidth and latency from CPU0 and CPU1.
Figure 1: Typical 2 socket server
Figure 2: Network Bandwidth comparison from CPU0 vs CPU1
Figure 3: Network Latency comparison from CPU0 vs CPU1
Have you ever had strange application performance issues that look networking related? The perceived networking problem may not be a networking problem but rather a network bandwidth/latency issue created inside the server as shown above.
Another point to consider, Figure 4 shows the per core metric (memory/IO bandwidth, core count, cache size, etc) from 2003 through 2018. What you’ll notice is that at the per core metric view, memory bandwidth and IO bandwidth per core have been falling since mid-2011. And keep in mind this does not factor in the IPC (instructions per clock) improvements of the cores from 2003 through 2018. To fix this we need more pins and faster SERDES (PCIe Gen4/5, DDR5, Gen-Z), 1-socket enables us to make those pin trade-offs at the socket and system level.
Figure 4: Standard 2S metrics per core
Why has this happened and why now? Like I said it really comes down to Amdahl’s law, or too many cooks in the kitchen, and the pin count struggle. So, what do we do? Many in the industry, especially the high performance minded, work around the problem a couple of ways: (1) establish manual affinity of the applications you care about the most to the socket with the IO, or (2) double up on NICs and avoid the problem. When you think of cloud native in the future and 1000s of containers, affinity control is not a long-term answer. And if you fix the problem with dual NICs you’ve essentially soft partitioned a 2-socket server into 2 1-socket systems – and received nothing in return other than a bunch of NUMA link pins, a larger socket, higher power (2x NICs) and higher cost (2xNICs) … to avoid the problem.
There is another answer –1-socket servers could rule the future – and here’s why.
Robert’s top 10 List for why 1-socket could rule the future.
- More than enough cores per socket and trending higher
- Replacement of underutilized 2S servers
- Easier to hit binary channels of memory, and thus binary memory boundaries (128, 256, 512…)
- Lower cost for resiliency clustering (less CPUs/memory….)
- Better software licensing cost for some models
- Avoid NUMA performance hit – IO and Memory
- Power density smearing in data center to avoid hot spots
- Repurpose NUMA pins for more channels: DDRx or PCIe or future buses (CxL, Gen-Z)
- Enables better NVMe direct drive connect without PCIe Switches (ok I’m cheating to get to 10 as this is resultant of #8)
- Gartner agrees and did a paper. (https://www.gartner.com/doc/reprints?id=1-680TWAA&ct=190212&st=sb).
Let’s take this one by one with a little more detail.
There are more than enough cores per socket and trending higher – does anyone disagree? And if you look at all the talk of multi-chip packaging, more core per socket is the path till we figure something new, and move away from the Von Neuman Architecture.
Replacement of underutilized 2S servers – again, an easy one; why have a 2-socket system with 8-12 cores per socket when you could replace with single socket 24+ core and not degrade performance below what you paid.
Easier to hit binary channels of memory, and thus binary memory boundaries (128, 256, 512…) – if we free up pins we can more easily stay on 2,4 or 8 channels of memory and not have to deal with those odd 3, 6, 9 channels. This will reduce the situations where we either don’t have enough memory (384) or too much (768) when we needed 512. These are binary systems after all.
Lower cost for resiliency clustering (less CPUs/memory….) – this is an easy win, thinking about 3 node HCI clusters, we need a 3rd node to be a witness essentially. If this was deployed with 1-socket systems, we only need 3 CPUs instead of 6.
Better software licensing cost for some models – while ISV licensing varies, some license by socket and some by core count. Gartner did a paper – Use Single-Socket Servers to Reduce Costs in the Data Center – which showed about a 50% licensing cost reduction for VMware vSAN. I would not count on ISV licensing models as long-term reason though, ISVs tend to optimize licensing models over time. If I were an ISV, I would license by core with a multi socket kicker due to increased validation/testing as you scale beyond a single socket.
Avoid NUMA performance hit – IO and Memory – I think I pretty much covered that above in the graphs. Data and facts are hard to ignore. Keep in mind, we need flat NUMA per socket CPUs to get this ideal result. Avoiding NUMA can also eliminate a potential application performance problem inside the server that looks like a networking problem.
Power density smearing in data center to avoid data center hot spots – this is an interesting area to think about and may not register for all enterprises. Due to power per rack U rising as explained above, we are creating hot spots and less than full racks. Not to mention server configuration limitations – how many times have you wanted A+B+C+D+E in a server and found it was not a valid configuration due to power and/or thermals? If we break a 2-socket server apart we can get better configuration options of NVMe drives, domain specific accelerators, etc… and then spread the heat out across more racks – aiding in better utilization of the power and cooling distribution in a data center.
Repurpose NUMA pins for more channels of: DDRx or PCIe or future buses (CxL, Gen-Z) – this is an easy win and you can see the results in the Dell PowerEdge R7415 and R6415 already. AMD EPYC provides this flexibility and Dell EMC was able to provide this advantage to our customers.
Enables better NVMe direct drive connect without PCIe Switches (ok I’m cheating to get to 10 as this is a result or #8) – this is a direct repercussion of repurposing NUMA links and exactly what we did on the PowerEdge R7415 and R6415 to allow high count NVMe drives without added cost, power, or the performance impact of PCIe switches.
Gartner agrees: Gartner did a paper – Use Single-Socket Servers to Reduce Costs in the Data Center.
So, there you have it, a bit of history, some computer architecture fundamental laws at play, the results of some testing and finally a top 10 list of why I am bullish on the future of 1-socket servers.