The Intertwining Of Memory And Performance Of HPE’s Machine
January 11, 2016 Mark Funk
In the previous article on Hewlett Packard Enterprise’s future system, called The Machine, we talked about the organization of its basic building block, a node of compute and storage as seen again below. Each node had processors, volatile memory, non-volatile memory, and both direct memory linkage and traditional communications interconnects. Both types of memory are byte addressable from the processor’s point of view, which is very handy.
We also observed that, if they were equal in speed, the inclusion of both volatile and non-volatile memory would seem redundant. If accessed with the same speed, does it really matter for data otherwise acceptable in volatile memory if that same data were held temporarily in non-volatile memory? In fact, given an otherwise volatile block of data, the energy savings possible from non-volatile memory would have us leaning toward the non-volatile memory.
But there is a difference in speed today between the volatile DRAM and the various forms of even processor-attached non-volatile memory. That is what we will talk about next and how this relates to your perception of The Machine.
Persistent Memory’s Access Latency
And now for a considerable bit of both fact and conjecture.
Even persistent memory is fast, yes, but all performance is relative. In the ideal, all memory, no matter its relative location, no matter that it is volatile or non-volatile, is accessible in the same period of time. But reads from and writes to memory take real considerable time and this latency is not a constant, as we will see here and in a subsequent section, and that difference can matter.
What we don’t yet know about The Machine is just how long it takes to
- Read a data block out of even a local node’s persistent memory.
- Return a changed data block to a local node’s persistent memory (and know that it is complete).
That won’t be known with certainty until Hewlett Packard Enterprise locks down on the technology used for supporting persistent memory and of the design of the chips providing connections to it.
Setting persistent memory aside for the moment, we can make a pretty good guess about the access latency of locally attached DRAM; we can assume that The Machine will be using the fastest local buses, and the most recent memory and buffering technologies available at the time. Even so, with a fast processor, such DRAM block reads will be taking in the range of hundreds of processor cycles. Armed with such knowledge, what is the relative latency of a similar access to and from the persistent memory? Is it nearly as fast, or is it considerably slower? If it is just as fast, why bother supporting anything else?
We know about the relative latencies of a number of persistent memory technologies. Relatively speaking, their speed is not too shabby, well faster than flash, but they will likely remain noticeably slower than DRAM at that time. HPE says that it would like to have some notion of an ideal persistent memory, but adds that it can today get fast or persistent, but not both.
Go back and take a look at the block diagram of a node. If implemented in that manner, what for DRAM is just a point-to-point link to off-chip buffering is for persistent memory an access by way of both a local fabric switch and through media controller chips. In fairness, such chips will likely offer buffering for reads, but the writes – unless the media controllers have on-chip NVRAM in the own right – need to make their way out to the persistent memory proper. The point is this: the persistent memory may well be slower, but the multiple chip crossings (and, later, node crossings) add slightly more latency for both reads and writes. Although we have intentionally not yet touched on it, accessing memory in other nodes adds still more.
Still, given devices like Intel and Micron Technology’s 3D XPoint memory, we should be able to get a pretty good sense of what is reasonably possible when The Machine is generally available. The engineer in me counsels not to suggest anything about this latency; until we learn more, we can just assume some considerably higher relative latency. Hints from elsewhere suggest that even today we can expect latency to be from multiple times slower to better that an order of magnitude slower. If true, that is enough of a difference that needs to be considered; a few tens of percent slower could largely be ignored.
Although such closely comparable latencies would be considered ideal, we like to believe that for HPE’s own performance studies, its own emulators or even their prototypes using DRAM are adjusting these latencies to help determine what latencies are acceptable and what assists might be needed. In as much as – at this writing – The Machine’s persistent memory is being emulated by DRAM, the persistent memory’s speed remains an unknown.
Whatever the actual additional latency, the existence of cache has a way of hiding it. Does the speed of a memory accesses – volatile or non-volatile – matter if the data sourced from there is always accessed from the cache? Still, with persistent memory having multiple times slower latency, overall performance difference between the two memory types will be noticeable with even a moderate cache miss rate. If the persistent memory is indeed multiple times slower and if the cache miss-induced access rate were the same for both memory types (i.e., if the cache misses were accessing the persistent memory as often as the DRAM), let’s for the moment suffice it to say that The Machine’s processors would be losing a lot of their capacity just waiting of cache fills to complete. (Gaining back that capacity, after all, even with today’s faster DRAM, is the reason that simultaneous multithreading and hyper-threading exist in today’s processors.)
Let’s pause for a moment and do some math to get a sense of the difference. We’re going to look at it with a few numerical examples, and from there get a better sense of why memory speed matters to system performance. Assume – sans cache misses – that a processor is executing one instruction per processor cycle. Most modern processors can do better than this, but let’s call this normal with 1 cycle per instruction, or 1 CPI. Keeping it simple, let’s next assume
- a cache miss to memory averaging one miss every 400 instructions.
- a DRAM memory access takes 400 cycles.
So, 400 instructions would take 400 cycles, followed by 400 cycles to access memory one time, totaling 800 cycles;
- 800 cycles / 400 instructions = 2 CPI.
Call that our base performance. That is on the fast side, by the way. Now, let’s now move all of those cache misses to be sourced by a memory taking five times longer to access, call this 2,000 (5 * 400) cycles. So we now have 400 cycles executing instructions as before plus 2,000 cycles per memory access;
- 2400 cycles / 400 instructions = 6 CPI.
Three times slower. This was just an example to get a point across, but another point is that not all of the memory accesses will be from the slower memory. We will let you do the math for this play example, but if 10 percent of the cache misses were from the slower memory (so 90 percent from fast memory), the CPI becomes 2.4. Clearly enough, a still higher latency to the slower memory just aggravates this effect.
But that’s only if the access rate of persistent memory is on a par with that of locally attached DRAM. It needn’t be (we will get to that shortly) and there are some technical tricks that the designers can apply to help offset such a difference.
For example, even on today’s processors, it is typical that a processor core does not just simply wait in the event of a cache miss; it often goes on executing to find the next cache miss, and the ones after that. Supporting this, the processors support a number of queues, which have the purpose of managing multiple concurrently executing cache misses. Additionally, software can hint to the hardware that some number of these queues should be used to initiate cache fills, well in advance of the point where the data blocks are really needed. And as still more magic, this same cache management hardware supports a notion of streams wherein multiple contiguous block accesses can be initiated, filling multiple cache lines as though with a single operation.
Cool right, but the point is what? Recall that we are talking about a persistent storage here. There are analogs found in database management systems accessing the much slower contents of disk drives. Although a database manager can wait on individual records to be read, it is also likely that it will determine ahead of time which pages it needs and then initiate multiple concurrent reads from disk. They also guess that contiguous pages will be needed and kick those off as well. Here, with I/O operations, the tasks, not the processors, are waiting for the data to arrive. The same sort of things would seem possible with blocks of this persistent memory, to buy back some of the otherwise lost processor capacity.
As a quick side observation: Some persistent memory is claimed to have higher density than DRAM. On the order of 10X. Remarkable. Unbelievable? Taken to its logical extreme, if we can hang 1 TB of DRAM off of a single processor chip, it would seem to follow that we can hang about 10 TB of persistent memory in the same board real estate. And, based on the nodal mock up of The Machine shown earlier, it appears that the board is allocating considerably more real estate per node for persistent memory than for DRAM DIMMs. Given 1 TB of DRAM per node, and counting DIMM slots, it seems to follow that the nodal persistent memory maximum is about 30 TB. It would seem to follow that we could have 1 PB of persistent memory with around 30 nodes, and that could take up less than one rack of space.
That is our guess of the possible. From the Keith Packard presentation relating to HPE’s initial offering for The Machine:
- “Each node has a small amount of memory in it (4Tbytes), and some local memory to run a Linux kernel out of.”
- “First system is an aggregate of 80 of these nodes.”
- “… for a grand total of 320 Terabytes of Fabric-Attached memory.”
Still, this is remarkable.
A Node’s Local / Private Volatile Memory
Let’s return to talking only about the topology of a single node for the moment. The DRAM of such a node is being called local and private memory, as we saw in this following topology diagram. As we spoke of in the previous section, part of the reason for its existence is its speed; it remains considerably faster than the persistent memory.
Before going on to consider the notion of DRAM as private memory, let’s momentarily segue from the previous section.
As with data in a processor’s cache, performance on today’s systems is best when we repeatedly access the data residing in memory as opposed to repeatedly reading it from disk. Because of the huge performance difference, since like from the beginning of time, we read data off of the disks and into the more rapidly accessible DRAM and then attempt to keep it there. No processor proper reads directly from – and then waits on – a spinning disk.
So what is the corresponding analog for The Machine? There may be a couple of them.
- Cache is to memory as memory is to disk: We had earlier spoken about preloading the processors cache from persistent memory; optimized correctly, data so loaded can remain in the cache for quite a while. Here, although the software drove such cache fills, subsequent software continues to access the data buffers as though still in the persistent memory. Recall that the cache lines are tagged with the real address of the original persistent memory location; cached results are later transparently returned there.
- DRAM is a cache for persistent data: The other analog suggests that the DRAM remains a functional cache for persistent storage, here with, on The Machine, the “persistent storage” being the not-really-so-slow persistent memory. If what your program is working with for an extended period of time is transient and liable to not remain in the processor’s cache, it may be best to copy it to the DRAM. Here, though, it is possible to copy what is needed, not complete pages as is typically done with disk or even Flash drives.
Focusing on the latter approach: Have you ever really thought about what it means to copy data from one location in memory – here persistent memory – to another – here the DRAM? On many processors, it is done by way of the cache; we would be doing cache fills again. To do a data copy operation, cache fills of the source-side blocks are subsequently followed by the copying of data there into different cache lines representing the target side blocks, and from there later flushed from the cache into the DRAM. The processor is still waiting on the cache fills, often from both the source AND target sides. So, let’s not? We don’t want the processor to wait, we just want the data copied, streamed there as data blocks if you like, reporting back when complete.
Is The Machine doing this? The need for such a capability falls out naturally from the basic topology being put forth. The advantage of doing this becomes even more clear as we consider accessing persistent memory – we will start calling it fabric memory shortly – off other nodes. Still, here is what we have been told by Paolo Faraboschi at Hewlett Packard Labs:
“Longer term, we are also contemplating creating some form of bypass from the fabric memory to the local memory so that we can effectively achieve a fast transfer for short messages.”
And this from the Keith Packard presentation:
“LPMEM library aids in assuring that the cached data is actually in persistent memory.” “It also has accelerated APIs for moving data. One of the problems with moving large amounts of data in a caching environment is that when you do a large memory copy or clear, you end up using a lot of your cache bandwidth (and contents) for that. The LPMEM library provides APIs to accelerate that by bypassing the cache when you are doing data operations that are larger than the cache. ”
Impressive, and if we may opine, certainly on the right track.
With all that as background, we return to the notion of private DRAM memory. We have intentionally not yet dug into the fact that The Machine really is a multi-node system, each new node incrementing the amount of both compute capacity and persistent memory of the much larger system. But where persistent (fabric) memory is accessible across nodal boundaries, the private (local) DRAM memory is not. The DRAM may be faster, but only the processor cores residing on the same node as the DRAM can access this private memory.
So a few resulting implications are:
- The threads of processes (of OSes) do not easily migrate off of a node. Threads execute only on the processor cores of a single node, the private memory of which contains much of the thread’s state. If you are familiar with NUMA-based topologies, you know that a thread could execute on any core of any node; it might not be preferred if that means having to frequently access remote memory, but it is possible. But, with The Machine, a thread cannot access its own memory in a node’s DRAM if that thread had been assigned to a core of another node.
- Similarly, OSes, as typically implemented today, assume both full cache coherency and that they can access all of their own memory. Having such an OS reside on multiple nodes would be a problem for such OSes. In effect, the scope of an OS tends to be a single node’s processors and volatile memory. (Nonetheless, as will be seen soon, an OS can potentially access anywhere within The Machine’s persistent memory.
- An OS instance – a partition – can be moved from one node to another, but it needs to take the contents of all of its DRAM-based memory with it.
- Processes and their threads on different nodes do required a means of communicating. Persistent memory can be used, as we will see, but more traditional inter-node communications capabilities, such as the use of Ethernet, are in plan for The Machine.
In effect, sans persistent memory, The Machine is topologically like a distributed memory cluster. And the folks at HPE would agree with this perception. As a case in point, consider this quote from the Keith Packard presentation:
“.. and then we have a huge number of configuration and management utilities that are within the cluster that manage the cluster as a single administrative domain. So you can take your usual HPC or your distributed/cloud computing context and move that into this ecosystem ”
At some level, The Machine is a set of compute nodes with near equal, rapid access to a lot of persistent memory. But, as you will see in the next article, it is also a tightly coupled set of nodes, capable of efficiently sharing data as well.