Why Aren’t There Software-Defined NUMA Servers Everywhere?

For decades, we have been using software to chop up servers with virtualization hypervisors to run many small workloads on a relatively big piece of iron.

This was a boon for X86 servers during the Great Recession, when server virtualization technology was mature enough to be put into production (although not yet perfect), allowing for server consolidation and higher server utilization and even helping some companies skip one or two generations of server upgrades at a time when they could not really spend the money on new iron.

So how come we don’t also use software to tightly couple many cheap servers together to create chunks of shared memory and associated compute to run workloads larger than a single machine? It is a lot easier to program for a shared memory space than it is for a loosely coupled distributed computing system, so this is a bit perplexing to us. Wouldn’t you like to have a machine with handfuls of sockets and thousands of threads and have low-level software manage resource allocation? Why scale out your applications and databases when you could be lazy and scale up?

The technology to create NUMA systems on the fly has existed for many years – remember Virtual Iron, RNA Networks, and ScaleMP? – and has perhaps culminated with TidalScale, which has been taking another run at the idea with its own twists on essentially the same story and which was founded back in March 2012 by Ike Nassi, Kleoni Ioannidou, and Michael Berman.

By the way, those other three software companies that created virtual NUMA servers out of commodity X86 machines are all gone.

Virtual Iron, founded in 2001, disappeared into the gaping maw of Oracle in May 2009. RNA Networks, founded in 2006, was eaten by Dell in June 2011, and that server maker messed around with it a bit and then we never heard about it again. And ScaleMP, founded in 2003, was still going pretty strong in the HPC sector with its eponymous NUMA hypervisor when The Next Platform was founded in 2015, ScaleMP was quietly acquired by SAP in June 2021. (There was no announcement of this as far as we are aware. )

Enter TidalScale

Nassi is the heavy hitter at TidalScale and is still chairman and chief technology officer.

After getting bachelor, master, and PhD degrees in computer science from Stony Brook University in New York, Nassi was an engineer at minicomputer innovator Digital Equipment, and in the 1980s was a vice president at supercomputing NUMA system innovator Encore Computer before joining Apple to help develop its MacOS operating system and languages. (Nassi is one of the creators of the Ada programming language and helped create the Dylan programming language for the Apple Newton handheld.) He founded mesh networking pioneer Firetide in 2001, and then moved on to be chief scientist at the SAP Research arm of the ERP software giant, serving from 2005 through 2011.

Nassi is also a long-time adjunct professor of computer science at the University of California Santa Cruz.

TidalScale has raised $43 million in two rounds of venture funding, and in 2016 brought in Gary Smerdon as chief executive officer. Smerdon is a longtime semiconductor executive from AMD who subsequently did stints at Marvell, Greenfield Networks (acquired by Cisco Systems), Tarari (acquired by LSI Logic where he stayed in various executive roles for six years), Fusion-io, and Pavalion Data Systems.

We spoke with Smerdon when TidalScale 3.0 was launched, back when we were doing Next Platform TV during the height of the coronavirus pandemic. And to our chagrin, we have not given the company the proper coverage of its TidalScale 4.0 release last year. But we are doing so during the TidalScale 4.1 release, which just came out.

The company launched its initial HyperKernel and related system management stack for virtual NUMA servers in 2016 and followed up with a 2.0 release in 2017 but it is with the TidalScale 3.0 product in September 2019 that improved the scalability of the NUMA machines that could be composed with HyperKernel, with a maximum of 64 TB of memory across a software-defined NUMA server, a maximum of 12 TB in any node in the NUMA cluster, up to the point that 128 TB is addressable. This is a limit in the Xeon SP server architecture from Intel, which at the moment is the only CPU that TidalScale supports.

There is nothing, in theory, that keeps TidalScale from supporting AMD Epyc, IBM Power, or Arm server CPU architectures with its HyperKernel. But in practice, a startup has to focus and despite all of the competition Intel is facing, the Xeon SP architecture is still the dominant one in the datacenter.

In a TidalScale NUMA cluster, the server nodes do not have to have the same configuration in terms of compute or memory. With TidalScale 3.0, a maximum of 128 sockets were permitted across the NUMA cluster. Obviously, with these maximums, you can create NUMA clusters with a wide variety of server nodes in an asymmetrical fashion or you can build them like NUMA hardware-based cluster sellers like HPE, IBM, and others do with every node in the cluster being the same.

With the TidalScale 4.0 release, all of these capacities were doubled up. So you could have a maximum of 24 TB for any given node, up to 256 sockets, and up to 128 TB of total memory across a software-defined NUMA cluster.

Pushing Against The Tide, But Also Riding A Rising Wave

In a way, TidalScale is pushing against the hardware tide a little bit. And in several other ways of looking at it, there has never been a better time for an “ubervisor” or a “hyperkernel” or whatever you want to call the kind of hypervisor that Virtual Iron, RNA Networks, ScaleMP, and TidalScale have all individually created.

Moreover, some of the innovations inside of TidalScale’s HyperKernel make it unique compared to its predecessors, and so long as someone doesn’t buy the company and sit on it, there is a chance this software-based NUMA clustering technology could become more pervasive – particularly with PCI-Express, Ethernet, and InfiniBand networks offering high bandwidth and low latency.

First, let’s talk about the nature of compute and NUMA clustering in the datacenter today and how it mitigates against the need for NUMA in a lot of cases.

For one thing, the few CPU designers left in the datacenter market have been implementing better and better NUMA hardware designs over the decades, and this really has involved adding more and faster coherent interconnects as smaller transistors allowed more things to be crammed onto the die. At this point, there is hardly any overhead at all in a two-socket NUMA system, and the operating systems that run on X86 iron – Linux and Windows Server – have long since been tuned up to take advantage of NUMA. Intel supports four-socket and eight-socket systems gluelessly – meaning without any additional external chipset.

With its upcoming “Sapphire Rapids” Xeon SPs, we expect at least 60 usable cores per socket and eight DDR5 memory controllers that will be able to support maybe 4 TB per socket with 256 GB memory sticks, possibly 8 TB per socket with 512 GB memory sticks. This is a very dense socket compared to the prior several generations of Xeon SPs. And Intel will support eight-way NUMA gluelessly, which should mean supporting up to 480 cores, 960 threads, and 64 TB gluelessly. That’s a pretty fat node by modern standards.

With its “Denali” Power10-based Power E1080 server, IBM can support 240 cores that do roughly 2X the work of an X86 core, with 1,920 threads (IBM has eight-way simultaneous multithreading on the Power10, as it did with Power8 and Power9 chips), across 16 sockets and up to 64 TB of physical memory across those sockets. IBM has up to 2 PB of virtual addressing on the Power10 architecture, so it can do more memory whenever it feels like it, and as far as we know, it doesn’t feel like it and is not going to make 512 GB differential DIMMs based on DDR4 memory; 256 GB DDIMMs are the max at the moment.

These hardware-based NUMA servers do not come cheap, which is why HPE and IBM still make them – often to support databases underneath massive, back office enterprise application stacks powered by relational databases. The SAP HANA in-memory database is driving a lot of NUMA iron sales these days.

But the expense and rigid configuration of hardware NUMA clusters is what presents an opportunity for TidalScale even as each socket is getting more powerful and a certain amount of NUMA is built into the chips. And so does an invention at the heart of the TidalScale HyperKernel called the wandering virtual CPU.

In hardware-based NUMA architectures providing a shared memory space and carved up by a server virtualization hypervisor, virtual machines are configured with compute, memory, and I/O and pinned down to those resources. Any data outside of that VM configuration has to be paged into that VM, and this creates latency.

With the HyperKernel, Nassi and his team created a new kind of virtual CPU, one that is not a program running on a carve-up hypervisor, but rather a low-level computational abstraction that has a set of registers and pointers assigned to it. This is a very small abstraction – a page or two of memory that fits inside of an Ethernet packet. There are zillions of these virtual CPUs created as applications are running, and instead of moving data to a fat virtual machines, these virtual CPUs flit around the server nodes and across the networks linked nodes in the software-defined NUMA cluster, moving to where the data is that they need to perform the next operation in an application.

This wandering vCPU architecture is described in detail in this whitepaper, but here is the crux of it: “Each vCPU in a TidalScale system moves to where it can access the data it needs, wherever that data physically sits. The larger the TidalScale system, the more physical CPUs and RAM are available for its wanderings, and the more efficient the wandering vCPU becomes compared to the other options for processing lots of data.”

Efficiencies are gained, in part, with machine learning algorithms that study computation and data access patterns on running applications and databases and vCPUs can be aggregated and dispatched to data, much like a vector math unit can bundle up data and chew on them in parallel.

The net of this is that for those who need a shared memory space of a big NUMA server, to run a fat database management system or in-memory analytics, for instance, customers can build one from normal Xeon SP servers and Ethernet switching, and save money over hardware-based NUMA systems.

“Customers have told us that we lower their overall cost by around 50 percent,” Smerdon tells The Next Platform. “And versus IBM Power Systems, I believe the all-in savings is much higher – probably on the order of 60 percent to 75 percent lower cost overall.”

We can think of other use cases, like the ones that ScaleMP was chasing a decade ago. In a lot of HPC clusters, there are hundreds or thousands of two-socket X86 nodes and then dozens of fat memory X86 nodes based on heftier NUMA iron for those parts of the workload that require more memory capacity to run well.

Imagine if these fat nodes could be configured on the fly as needed from commodity two-socket servers? Better still, you could make a cluster of only really fat memory nodes, or a mix of skinny and fat memory nodes, as needed. As workloads change, the NUMA cluster configuration could change.

Surviving Increasing Memory Instability

What is the most unreliable part of a server? It’s processor? It’s system software? It’s network? Nope. It’s main memory. And as that main memory is getting denser and denser, it is getting easier for a cosmic ray or alpha particle to flip bits and cause all kinds of havoc.

DRAM is getting less and less reliable, says Smerdon, as the capacity on a DRAM chip increases, citing a figure of 5.5X lower reliability from an AMD study from 2017. At the same time, the amount of main memory in the server has gone up by a factor of 10X, and weirdly, because Intel had to tweak memory controllers to support Optane 3D XPoint main memory, one of the things it did was back off error correction bits so it could shoehorn Optane into the architecture, which resulted in a higher rate of runtime uncorrectable errors in memory for the Xeon SP chips. When you multiply that all out, says Smerdon, memory has a failure rate in a Xeon SP server that is about 100X higher than it was a decade ago.

And there are all kinds of horror stories relating to this, which cannot be easily mitigated by saying that these errors can be covered by the fault tolerance mechanisms inherent in modern distributed computing platforms. Without naming names, Smerdon says one hyperscaler had to replace 6,000 servers with over 40 PB of main memory because of high failure rates. (Small wonder, then, that the AMD Epyc architecture is eating market share among the hyperscalers and cloud builders.)

The good news is that DRAM correctable errors, power supply and fan failures, and network and storage loss of connection and device failures all have a high probability of occurring and they all provide some kind of warning by way of system telemetry that a crash is coming. And on a virtual NUMA cluster with many nodes, when such a warning occurs, you therefore have a chance to move workloads and data off that server to an adjacent machine before that crash happens. You fix the server, or replace it, plug it in, let the HyperKernel take over, and you are back in action.

That, in a nutshell, is what the TidalGuard feature of TidalScale 4.1, just released, does. Downtime is expensive, so avoiding it is important. Smerdon says that 85 percent of corporations need to have a minimum of “four nines” of availability for their mission critical systems, and that means having redundant and replicated systems. This is expensive, but worth it given that a third of the enterprises polled by TidalScale say that an hour of downtime costs $1 million or more, and even smaller firms say they lose at least $100,000 per hour that their systems are offline.

By using TidalGuard on the HyperKernel, the mean-time to repair a system is about ten minutes, a factor of 10X to 1,000X improvement over other methods, and customers can add another “two nines” of availability to their infrastructure.

And TidalGuard can also take on a problem we haven’t seen a lot of talk about, which is server wear leveling in the datacenter.

“You know you have to rotate your tires, and we are all familiar with the fact that flash memory has to balance out wear leveling or it doesn’t work,” explains Smerdon. “Now servers – their main memory, really – are wearing out at an ever-increasing rate, but there has been no monitoring of this. But we can monitor this in our clusters, and based on the usage health, we are able to estimate amongst a pool which servers have the most life left and prioritize usage on them first.”

And with this feature, you can start to predict when you will need a new fleet and how long it might last.

Interestingly, reliability, uptime, and agility are the main reason companies are deploying TidalScale today, and scalability and cost savings are secondary priorities.

If software-defined NUMA clustering finally does take off – and we think it could be an important tool in the composable infrastructure toolbox – then TidalScale could become an acquisition target. And given how long it has been in the field and that we have no heard of widespread deployments of its technology, we would bet that TidalScale would be eager to be picked up by a much larger IT supplier to help push adoption more broadly.

If Intel has not returned some memory bits back to the DRAM with Sapphire Rapids Xeon SPs, Intel might be interested in acquiring TidalScale to help correct this problem. But that would mean admitting that it created a problem in the first place, so that seems unlikely.

Given that AMD is limited to machines with one and two sockets, it may be interested. The Arm collective could also use virtual NUMA servers, too, as could one of the cloud suppliers not wanting to buy distinct four-socket and eight-socket servers for their fleets. (Amazon Web Services has been playing around with TidalScale internally, and no doubt other hyperscalers and cloud builders have, too.) VMware, looking for a new angle on server virtualization, might be interested, and ditto for server makers like HPE, Dell, and Lenovo.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

14 Comments

  1. ScaleMP was acquired by SAP two years ago.

    With this omission of fact and your glossing over the apparent disappearance of the most successful company in the field, you have effectively justified the rest of a incidental or even irrelevant technology diorama. CXL is going to drive SoftNUMA into OS layered products (shows my age and system leanings but I’m unabashed) or kernel.

    That acquisition was motivated either by the needs of HANA or to quash the technology.

    All the other SoftNUMA acquisitions seem very clearly to have been about protecting hardware sales.

    Given that SoftNUMA remains a serious potential disruption to both classical scale up and K8s et.al. scale out high margin and consulting rich business units, I think that there’s a genuine argument against investing in another SoftNUMA deployment without contractual commitments including source escrow.

    This has been the dismayed conclusion my organization arrived at, and although not insignificant, our sensitivities aren’t comparable to those of the customers you need to make a difference.

    Intel has been taking a imo divisive and commercially perilous course deprecating >2S chipsets for the most recent generations. I’m privately convinced that this has been conducted in concert with software vendors and the switch to per core licensing. Your recent article on POWER10 per core system cost performance telling well shy of half the story. SoftNUMA is presently dead in the water without industry wide reconsideration of licensing. A genuine pity. Expect the technology to reappear both higher and lower in the stack, for future practical implementations

  2. As footnote to a major undercurrent of my earlier comment, I am convinced that anyone entering a SoftNUMA contender in the future, needs to be equipped with real anti trust chops. And probably the commensurate congress dollars and a substantial budget for sponsoring academic research into the micro and macro economic effects of licensing.

  3. What a bunch of hype. Hyper-v has been numa aware for a decade. Yeah, hate MS, but they have bested all other OS for being numa aware and high thread count scheduling for a decade. If you don’t understand how those ate intertwined, you need to think a bit more. You also need to think about non-blocking atomic updates and how that interacts with distributed memory access.

    The big boys have been working this for thirty years.

    • I am well aware of being NUMA aware. But I am not talking about that. I am talking about software that creates the NUMA cluster, not the software that is embedded in the OS kernel or hypervisor that figures out how to cordon off workloads so they have locality in a NUMA region.

      You need to be less mean.

    • Enlighten us with your software stack, then, and how it does what Virtual Iron, RNA Networks, or ScaleMP did.

  4. Reading the whitepaper I have to wonder about whether special compilers or data design are involved to take advantage of this.

    Being able to send the registers/stack from one physical machine to another is interesting. Just pick up on the new machine and access the data need on that machine. But how do you design some locality into the system to prevent a situation where the virtual CPU is hopping around between physical hardware every 4th instruction or so because the data needed for a tight block of code is distributed all over the place. This would slow down execution tremendously.

    Sure, a fair number of workloads likely exhibit locality. But not all do. Can static compiler/analysis tools even reason about this problem given that things can change dynamically at runtime? Years ago when dealing with NUMA architectures with large memory spaces I did encounter situations where even within a single physical box the costs of migrating a process across physical sockets to get to the memory it needed could strain a system in interesting ways. Here those costs involve hops across a network which are orders of magnitude slower.

    And then there is the question of stack frames. Does every machine have to have the same binaries installed with the same jump locations available so a process can get to the code it needs regardless of which machine it’s on? Is there a fixup task that needs to be performed to deal with this? Does stack frame layout randomization interfere with the ability to move cpu state around between machines and would the need to turn that off create security problems?

    I’m sure other have thought about all this. Would be interesting to know more about the details. Creating a uniform machine model with a huge memory and compute space that transcends the capabilities of a single machine yet seamlessly shares a single binary sound fascinating.

    • We built Popcorn Linux in Academia. But we use DSM, which means we move memory instead of CPU. However, processes can migrate to other machines to scale out.

  5. Interesting effort. But it brought back a memory for me: At the beginning of our attempt decades ago to get our local management to step up to ccNUMA (cache-coherent NUMA) – you know which OS – our third line manager – whom we highly respected – said in so many words “Cool, I’d love to do it. But, instead bring me a solution that allows me to do the same thing over multiple clustered nodes.” Our answer was “Yes, of course, sir” but all the while we were thinking “It’s all about the relative latency, stupid!”. (With all due respect, Timothy.)

    Yes, it certainly can be done, just as this company is doing. But, just as you say, you want software to perceive this as being just one big NUMA (and I assume ccNUMA) with a single address space per process, with threads spanning the processors of multiple nodes. Yes, you can have software create the illusion of a single address space over multiple nodes. But there is one more aspect of ccNUMA that is critical to latency. It does not matter how many copies of a block of data there are across all of the processors, the cache coherency of those caches make it look as though there is exactly one version of it. From the software’s point of view, there is only ever one block, one location for that data. EVER. Sure, software can create that last illusion as well, simply by creating exactly one instance of a data block over all of the non-NUMA nodes of the IO-linked (or non-cache coherent) cluster. If the data is changing, the data block is on one node at a time. (If not, a copy can be made and reside on multiple.) If data isn’t changing, this solution can work great, just like normal distributed computing. Change it frequently and performance hell breaks through.

    Again, from the application’s point of view, it can be made to work. And, yes, the performance of the links between systems have gotten to be remarkably fast just as you have been reporting for years. But now picture your application wanting to rapidly change some many data blocks, just as you would want to do transparently in a ccNUMA system. It’s not on your system, so it needs to be found, a global lock of sorts needs to be used to change ownership, software-IO needs to be invoked to move the data, and then you get told of its availability, all functionally transparent to you. But now picture, during the time that this is occurring, it gets pulled away from you by one and then N others all competing in exactly the same way. ccNUMA gets into wars like this but resolves it rapidly.

    When we were looking at it, we were thinking DBMS as an example. It’s not just the data in the DB that is important here. It is all of the supporting structures as well, and they are all being accessed rapidly by scads of threads modifying those common structures. More power to this company, but as with distributed computing, there needs to be a design in the application that takes the extra latency into account. For the right usage, a win. Use it wrong, well, I sure hope I am wrong for the programmers using this system, for I don’t see this as a solution transparent for them as assumed for a ccNUMA system.

    • We built Popcorn Linux [1] in Academia using a DSM (we also have support for a distributed hypervisor on top of KVM that distributes vCPUs all over the infrastructure). If the application running atop is serial or exhibits little memory sharing between its threads, it works great. However, for multi-threaded software with lots of memory sharing, you’re screwed! There is an enormous slowdown, and it’s implemented using InfiniBand RDMA.

      [1] http://www.popcornlinux.org/

  6. Tidalscale was the first of these players to commercialize an inverse hypervisor. Ideas that did similar work in the memory field existed as software defined memory controller (ScaleMP since 1993), SW Distributed Shared Memory (Treadmarks, etc), HW DSM (HPE Superdome and Dragonhawk etc), and even a single system image cluster JVM published by Assaf Schuster and colleagues at Technion and IBM Haifa [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.422] in 2000.

    I invested in TidalScale because commercialization requires working with existing workloads (imposing neither new consistency models as DSMs did nor excessive code optimization burden (as NUMA and HW DSMs did), creating a reasonable control plane, productizing around value proposition beyond out of the box performance which is hard, supporting custoners, creating and delivering to roadmaps, all of which Tidalscale has done better than anyone else that went before them and trust me because I was on the other side of these technologies for a while (meaning not a purveyor or investor but rather a customer or potential acquirer).

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.