Flexibility is a kind of strength. It is just more subtle than brute force.
If we had to identify one thing that plagues all systems, whether we are talking about transportation systems or food production networks or the datacenters of the millions of companies in the world, it is that the elements of that system are static and you have to do capacity planning and to overprovision capacity in its many shapes and sizes ahead of time.
What if we didn’t have to do this anymore? What if a server was truly logical, malleable, almost infinitely configurable from the point of view of applications? Meaning that you are not just cutting something big into smaller pieces, as both server virtualization and containerization do, with distinct workloads that might have otherwise been on distinct machines, to drive up utilization across the whole system, but rather composing even big machines together on the fly when needed. What if this kind of composability was available for compute, networking, and storage?
This is the future we are, hopefully, heading towards – perhaps in the next next platform, or the next next next platform. But close enough down the road that we can see it on the horizon.
Back in the day, people didn’t talk about servers. Before the commercialization of the Internet and the distributed computing technologies it embodied and the emergence of something called information technology, people in the data processing industry talked about systems, and by that they largely meant a single, self-contained collection of computing, storage, and networking devices that worked collective to perform one or more tasks. Perhaps running a database management system and processing transaction online during the day and doing batch runs of billing statements and or analytical reports to run the business at night when the orders were not coming in because people were sleeping. The term server is apt, and it is part of a collective generally and understood as such, although there are lots of small businesses where a single server can be the whole system for the company.
That metal server skin implies a certain implementation that is self-contained, having a certain amount and kind of capacity for compute, storage, and networking. There are ranges for each of these, depending on how many CPU sockets and I/O lanes it supports, but it is a relatively fixed amount. You can glue a bunch of these together with NUMA chipsets to create progressively larger systems, with more of everything, but there are limits to this and it is also very, very expensive to create big shared memory systems. A lot of HPC and now AI and data storage or workloads either create or chew on huge datasets, respectively, that are so large they cannot be contained in dozens or hundreds or thousands of servers, so they are spread across multiple machines in a cluster glued together in a fairly loose way by MPI or some other workload and memory sharing mechanism. Other workloads, like web serving and application serving, are embarrassingly parallel in nature and do not need such coupling at the compute level, but do at the storage or database access layers. But ultimately, IT shops are trying to figure out what metal-skinned machine of a certain capacity – or multiple such machines – are needed to run a particular workload.
In the future, you won’t care about that. You will have disaggregated (or as the case may be, perhaps not) and composable infrastructure, so you can link together elements of a system together like a Tinker Toy, which is really just a kind of flow chart with its own elemental radix.
In this future, you don’t overprovision at the server level (but perhaps a little at the datacenter level) and you don’t even create preconfigured virtual machine instance types as the public clouds do today. We had thousands of possible physical server configurations from the OEMs and ODMs, and now we have thousands more instance types from the large cloud builders and heaven knows how many VM types among on-premises cloudy gear. This has been progress of a sort, since cloudy infrastructure helps drive up utilization of components to 60 percent or maybe even 70 percent of peak, leaving room for spikes but also leaving something like 30 percent to 40 percent of the money used to buy the underlying hardware going up the chimney. Or the water chiller, we suppose, since datacenters don’t really have chimneys.
Instead, in this glorious future we are dreaming about, the center of the universe will be a DPU, which virtualizes network access to compute and storage engines. No, we didn’t say that backwards. And hanging off that DPU will be serial processors with fat but slow memory that we call CPUs. These serial CPU accelerators will have a mix of DDR and PMEM memory is likely put into CPU DIMM slots, and some of them may even have a mix of small capacity, fast HBM memory hanging off their pins to accelerate certain functions. These DPUs will have PCI-Express ports and maybe even PCI-Express switch complexes embedded on them, which will allow the DPUs to connect directly to banks of those serial CPU accelerators as well as to banks of parallel GPU accelerators with their own HBM memory or to dataflow FPGA accelerators with their own DDR or HBM memory. The PCI-Express switch fabric will also link these devices to local storage, such as NVM-Express flash or Optane or other PCM persistent memory, and ideally, this storage will be accessible over the PCI-Express network locally and across the node interconnect fabric (even though the server node itself may disappear as a concept) as well. There may be giant banks of main memory, which some have called a memory server, running remotely from all of these machines, linked to the “nodes” by fast optical interconnects and the Gen-Z protocol or perhaps something like what IBM is doing with the Power10 processor and its memory area network.
Let’s talk about that for just a minute. When most of us think about disaggregated and composable with think about physical disaggregation of CPUs, memory, storage, and I/O and then the recomposition of them using something that sits in the space between firmware and middleware. While it would be useful to break CPU memory from the CPUs, for a lot of esoteric reasons having to do with how proprietary a CPU feels about its memories – and feeling the same way ourselves, we understand this tenacity – this is the hardest link to break and the one that will be broken last.
We have come to the realization that maybe all of these components can be crammed into a server just like before, but be composable over PCI-Express and high speed InfiniBand or Ethernet switching just the same. With NVM-Express flash, for instance, you can reach any external flash with the same latency as locally attached flash inside of a server. So who cares where it sits? The same will hold true for other devices and the various interconnects. So Maybe you can create an application server with just CPUs and not much else, a storage server with some CPUs and flash, an accelerator server that has space for lots of GPUs or FPGAs, and then use composability software that is an amalgam of what GigaIO, Liqid, and TidalScale offer to create NUMA systems of any size and various types of accelerated nodes over the network. The logical “servers” thus created could very well have elements all within one physical server, or spread across many physical servers. This is just as logical – pardon the pun – as putting all CPUs and their memory in one rack, all GPUs in another rack, all FPGAs in their own rack, and all flash in yet another rack and then composing across four racks. As long as the PCI-Express fabric can link them all together logically, who cares where they are physically?
As we think about this future of fluid infrastructure, there are a few principles that we see emerging.
General purpose is not dead; it’s just not limited to an X86 CPU anymore. There is probably nothing worse than buying something that is very expensive and only fit for one purpose. We believe strongly in any device that can be used for many different jobs, even if not optimally. System architects have to optimize over time, over money, and over workloads, and having the fastest most dedicated ASIC may not matter as much as having a more general purpose device that can adapt to ever-changing – and quickly changing –situations. The more things that a device can do, the better. That’s why we like CPUs and FPGAs, and why we also like Nvidia’s “Ampere” A100 accelerator, which can do visualization, virtual desktops, machine learning inference, machine learning training, database acceleration, and HPC simulation and modeling all really well.
Software when you can, hardware when you must. Whenever possible, compute, networking, and storage functions should be done in software where reasonable performance can be attained. If you have to accelerate something, use the most generic and malleable compute engine or network ASIC that does the trick. This might mean sticking with a CPU or a GPU for certain functions, or even using an FPGA.
Wean yourself off any closed appliances in your datacenter. This is a corollary to the principle above. In every place you can, break control planes from application and data planes. Use storage or networking or virtualization or containerization layers that span as many architectures as possible. Don’t reward proprietary behavior, and don’t lock yourself in.
Make sure every compute and storage device is overprovisioned with networking. Don’t trap devices inside of whatever box you put them in. Supporting a wide variety of interconnects and protocols broadens the usefulness of a device and drives up its utilization. Stop skimping on networking and realize that networking should cost a quarter of the value of a complete system because that networking is what will drive utilization from 25 percent or 30 percent to something closer to 60 percent to 70 percent. You will buy less hardware to do more work in a shorter period of time if you get the interconnects right.
Start experimenting now, give input to composability vendors early. This is perhaps the most important thing. With disaggregation and composability still in its infancy, but maturing fast, now is the time to start getting a handle on this technology before your competitors do. There are benefits that can be gained right now, and you can help drive that system software stack to a place you are trying to get to.
Flash is slow as shit so it doesn’t matter, put it anywhere cause you’re gonna wait for IO.
DRAM and ReRAM memory is local to the CPU, and increasingly on-package with HBM, because of wire latency and the very very very high power consumption of long interconnects. If your fancy pants distributed memory takes longer than 20ns and 20pJ/bit for a random read, you can just sod off.
“Flash is slow”
I feel that your are living in the past. The new bottleneck is memory bandwidth.
Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation
https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/
I could not agree more. So many components are fast and relatively cheap. The trick will be doing things that mask those different memory latencies. It’s a NUMA-ish problem, and we can solve it with a mix of very fast, medium fast, and not so fast memory. Just like we have done with the cache hierarchy inside CPUs.