Success can be its own kind of punishment in this world.
Since the dawn of modern computing 130 years ago with tabulating machines derived from looms, there have always been issues of scale when it comes to compute and storage. While all modern businesses worry about the IT infrastructure and how dependent they are on it, there are special classes of systems that are at organizations that have intense computing and storage demands, and usually also severe networking requirements, and they of necessity push the boundaries of what can be done simply because things need to be done.
They have no choice but to figure out some way to do something because one computer is not enough. And so we have myriad ways of making two or more computers look as if they were one. This is by no means meant to be an exhaustive list, but rather an illustrative one, and to remind us all that no matter what you call it, it is always distributed computing and it is not new so much as many twists on some very old ideas. You can avoid a whole lot of hype in this world by being rooted a little bit in history.
In the early days of corporate computing, systems were set up in regional and national datacenters, and instead of trying to run all of the work on one machine, the data and the transactions were all done in a distributed fashion, with summary data being passed back up to corporate headquarters for further processing. This was processing distributed in time and space, not across memories and their attached compute elements. By the way, coupling systems together, whether loosely or tightly or in-between – really should be looked at from the point of view of the memory and not the other way around as we usually talk about it.
Way back in the day, in the dawn of corporate computing, the venerable System/360 from five decades ago had support for two processors working somewhat in concert, but it didn’t really work well until the Extended Architecture (XA) tweaks to the MVS operating system came in the 1970s that making two processors sort of appear as one actually worked. If you look at the latest System z14 mainframes today, which employ non-uniform memory access (NUMA) clustering to create a shared memory space across the ten z14 cores on a chip, two threads per core, and across multiple sockets in a system. Some of those cores are used as I/O or security processors, some as generic compute processors, and the largest System z14 has 32 TB of main memory and a maximum of 170 cores that can do work together at one time, in unison, against that memory. Somewhere around half the compute capacity of this chip complex goes up the chimney to provide that single system image, which means the programming model is easier than deploying a cluster. In reality, most mainframe shops carve such a beast up into logical partitions and run multiple jobs on the machine. They may or may not share data across the partitions, and in a sense, this ends up just being what would be a rack of individual machines occupying the same frame. It is actually a main frame, as the name was originally intended.
Of course, many other types of systems had symmetric multiprocessing, or SMP, where each processor has simultaneous access on a memory bus to any chunk of memory) and in the late 1990s, the NUMA approach, where some memory is local to a processor and other memories (tied to remote processors) are hooked to each processor with a certain amount of delay. NUMA is not quite as tight a coupling as SMP, but it does allow for systems to scale a shared memory complex to as many as 128 processors if server makers want to push it. Such machines tend to be very pricey indeed, often costing 5X more than a collection of two-socket boxes with the same computing.
There is a very good reason why supercomputers stopped being based on SMP and NUMA server nodes with federated and pretty tight clustering technologies to go beyond the usual NUMA limits in the late 1990s. It wasn’t just convenient that Beowulf clustering and the Message Passing Interface (MPI) that manages work in the memories across very large scale machines was invented. It was simply not affordable to do large scale simulation and modeling – which does not require the kind of tight coupling that a relational database doing transaction processing did at the time – in any other fashion than to move to MPI on Linux clusters. And it is ironic that some of the biggest malcontents about this clustering approach in the HPC arena flipped their adversarial bit and the whole industry got behind it in very quick fashion, and by the end of the 1990s, nearly all supercomputing was running Linux on scale-out clusters.
There are so many distributed computing frameworks, sometimes assisted by compute and networking hardware, that have been invented over the years that it is hard to keep track of them all. Long before there was databases and datastores like Google MapReduce and its open source Hadoop/HDFS clone or Memcached, Cassandra, CockroachDB, MongoDB, Couchbase, Redis, and what seems like a zillion others, Tandem Computers had created a distributed, fault tolerant database that competed well against IBM mainframe clusters and went on to run governments and stock brokerages and, among other things, Dell Computer. DEC was way ahead of the competition with its VAXcluster system clustering, which also extended to its rdb database. Data sharding and compute parceling to those shards was what seemed to be this shiny new idea when Google talked about it in 2005, but relational databases had been doing this across disk drives and controllers for decades. The ideas were implemented differently, to be sure, but the central idea – move the compute work to the data, not the other way around – is not a new idea at all. The implementation of that idea was clever and important, but the idea was not even though Google gets credit for it.
We have wondered for years why the hyperscalers, who have workloads that scale to 10,000 to 50,000 nodes (with a reasonable level of concurrency) and storage systems that scale to exabytes, did not deploy some technologies developed by the HPC community long since. But they are, as the demands of machine learning press upon them, learning. Almost all scale-out frameworks for machine learning are using some form of MPI, or a clone of it, to parcel out work across multiple fat nodes crammed with a few CPUs and a whole lot of GPUs. The machines that are being built using Power9 chips and NVLink have just about every kind of memory sharing possible happing all at once, which is neat. We look forward to seeing NVSwitch clusters of sixteen GPUs married to a pair of Power9 chips, all with shared and cache coherent memory across the CPUs and GPUs, hopefully by ISC18 and certainly by SC18. You can’t do this with a Xeon host, but there may be a possibility to get something similar with a combination of AMD’s Epyc processors and Radeon Instinct GPUs at some point in the future.
A lot of the clustering that goes on in the world is not, by our definition, distributed computing, but is rather more a massively parallel implementation of a two-tier or three-tier server stack with some routing at different levels of the compute. Having a rack of web servers sitting in front of a rack of application servers that are in turn sitting in front of a rack of database servers is not distributed computing as such, any more than running multiple partitions on a mainframe is. We get impressed by the large number of nodes that enterprises and even hyperscalers and cloud builders have for such work, but the computing is not organized and shared in the same way as that of a Hadoop cluster or NoSQL database cluster. It’s just many servers talking to many servers in parallel, but all of the work is distinct. These systems also present a management nightmare, to be sure, and can be as complex as a shared compute or storage cluster. Ditto for virtual machine clusters and even virtual SANs that sometimes run on them. These are still many computers, real or virtual, acting like many computers – not as one.
That brings us to the idea of the scale of distributed computing. Most companies have NUMA servers, so even their isolated systems have some modicum of distributed computing embedded in them. The NUMA in a two-socket server is so good these days that the overhead is nearly invisible – like a factor of 1.95 out of a possible factor of 2.00 – and therefore is utterly transparent. The typical big iron NUMA server has four or eight sockets these days, and maybe for the largest relational database engines in the enterprise, you are talking about 16 or maybe 32 sockets. They might cluster for high availability, and do different work on each half of the cluster, and in the case of the very largest IBM mainframes, they might use Parallel Sysplex database clustering to have a database run across two, four, or eight machines for very big jobs. (Think banks and telcos.) That’s it for the NUMA scaling.
For other workloads, for risk management and data analytics clusters, enterprises have dozens to perhaps as many as a couple hundred server nodes all running against the same work at the same time. This is neither advanced or extreme scale as far as we are concerned, but normal.
Going beyond that, with true distributed computing that spans many hundreds to several thousands of nodes, there are very few organizations that are truly operating at scale. It is very rare in the enterprise, and these days, such a distributed computing platform is a data lake running Hadoop that is more a glorified file system for unstructured data that feeds into other compute frameworks than it is a distributed real-time compute framework in its own right. There are exceptions, especially for those using SQL overlays atop Hadoop or big Java messaging platforms (based on Tibco or Platform Symphony or similar platform) to do financial transactions or risk management at volume and speed.
But there is, in our view, a big gap between the normal scale of distributed computing in the enterprise and that being employed by the hyperscalers. There are very few organizations that spark that gap. This could change, of course, as more companies want to chew on more data. But thus far, the scale of the problems and the nature of the data have been very different and therefore have required very different distributed computing architectures.
The one interesting development in distributed computing, which circles back to the days when workloads were distributed across space and time. You might call this hyperdistributed computing, if you wanted to coin a new phrase – and we don’t want to do that, actually, unless it is warranted. But here is the idea. With the buildout of its wireless network and the backend network that supports it and the applications that run on it, AT&T anticipates that it will be running what could be tens of thousands of OpenStack cloudlets out there on the edge, which will do as much computing as possible for that network before making requests up the line to its main datacenters. This hierarchical computing will be distributed locally and run serial up and down the line, so it is not, in the strict way that we are talking about it, distributed computing across the whole. But there is no saying that applications can’t – and won’t – run asymmetrically across the OpenStack cloudlets in 5G base stations, for instance, and the larger OpenStack clouds back in the datacenters. We will have to see what happens here. But there probably will not be shared memory spread up and down that line, and hence it will not really be distributed computing as we think about it.