In the previous article, we left off with the basic storage model having its objects first existing as changed in the processor’s cache, then being aged into volatile DRAM memory, often with changes first logged synchronously into I/O-based persistent storage, and later with the object’s changes proper later copied from volatile memory into persistent storage. That has been the model for what seems like forever.
With variations, that can be the storage model for Hewlett-Packard Enterprise’s The Machine as well. Since The Machine has a separate class of volatile DRAM memory along with rapidly-accessible, byte-addressable persistent memory accessible globally, the program model could also arrange for doing much of its processing in cache and DRAM, only later forcing the changes into persistent memory.
So now consider how long it takes to copy log records and then object data from DRAM into persistent memory as opposed to copying the same out to I/O-based disk drives. Simply saying that it is faster is a gross understatement. Still, given this speed difference, is this previously outlined storage model the right model for a system having such persistent memory? Transaction-synchronous log record writes followed by asynchronous data writes are done that way partly because of the speed of that persistent storage.
That need not be the storage model for The Machine; DRAM need not be involved in holding the object or the log records at any time. The volatile DRAM space should be used to handle all of the normally transient data, but the persistent objects proper and associated log records, both normally residing in persistent memory, can be byte-addressed and worked on directly. Well, that is, directly by way of the processor’s volatile cache.
Persistent memory, though, is not some kind of a magic bullet as it relates to the ACID property requirements of transactions. It is typically not enough just to dump changes into persistent memory from the cache. The basic issue with ACID is that any transaction can require multiple changes to an object to fully execute, and a power failure can occur between any of them. Even if the transaction’s changes had all successfully been made in the cache and most of them had succeeded in being written to persistent memory, a failure at that point still leaves the object in an inconsistent state. Keep in mind that today’s volatile-memory-based program model would have considered the object’s changes complete at the point where the changes were still residing in the cache. It follows that at least the essentials of object recovery based on log records described in the previous article are still required.
Assuming such a recovery log with records describing for each transaction the object’s state and changes, does the processing of a persistent memory object’s transaction need to look just like that of an object on disk? Maybe not. Consider, is it reasonable (read that as would transactions be sufficiently fast) for a transaction’s processing AND persistence be completely synchronous as in the following?
- An exclusive lock is applied to the object.
- A set of log entries are built describing each of the changes to be made. The transaction is finally marked in the log with a record saying “committed.” With these transaction descriptions logged and forced to reside in persistent memory, a post-“commit” power failure, then would allow recovery code to (re)build the object. If, though a failure had occurred prior to successfully forcing this “committed” entry into persistent memory, recovery after restarting would act as though the transaction had never started.
- The object’s changes are subsequently made, first in the processor’s cache, then each forced from there out to the object’s location(s) in persistent memory. (This might mean forcing out as few as one cache line.) Once all of the object’s data blocks reside in persistent memory, the changed object is – at least physically – now visible as being in a consistent state.
- Once all of these data blocks are known to reside in persistent memory, and still under the Exclusive Lock, there is now no need for the transaction’s log entries. The transaction is truly complete. These log entries can be trimmed. (Upon starting the next transaction, it can be quickly known that no recovery is needed.)
- The lock is freed. After this point, the changed object is both physically and logically accessible to other transactions.
Seems straight forward enough, certainly much faster – and much more reasonable – than attempting the same approach with disk drives as the persistent storage. But is even this fast enough? Can the process of forcing the object out of the cache and into persistent memory be deferred (that is, be handled asynchronously), perhaps allowing such persistent object methods to complete faster?
A Quirk in The Machine?
Before we go there, I need to (re)observe a bit of a quirk in this context about The Machine. Up to now, in this article, I’ve tried to speak generically about persistent memory, allowing the Program Model to apply elsewhere (i.e. on HPE’s near-future persistent memory-based competitors) as well.
The Machine’s Quirk. Cache coherency. You will recall in the disk-based storage model that, even if the changed object’s pages are not immediately forced out to disk, the changed object is still visible in volatile DRAM – once the locks are freed – to other transactions, even if those transactions execute on other processors. This is true even if all or parts of that changed object happen to still reside in the processor cache; within a single cache-coherent SMP, any processor (or attached I/O device) can still see the changes. Said differently, the changed object is globally visible by any processor within a cache-coherent SMP even if that change is still only in volatile memory (i.e., cache or DRAM).
On The Machine, though, this cache coherency is scoped to the processors of individual nodes; a change in a Processor A’s cache is visible to a Processor B if Processor B resides in the same node, but a Processor C, residing in another node, is incapable of seeing that cached change. Processor C can see any persistent memory proper, but not necessarily all of the cached data blocks holding changed data from – and so destined to – that persistent memory.
In The Machine, a changed object still residing in cache is visible to processors on the same node. For processors throughout The Machine (that is, on other nodes) to see the changed object, that change must reside in persistent memory. Again, the changed object becomes visible to off-node processors after the change successfully makes its way into persistent memory. Even though a transaction on an object is considered to be complete, in The Machine no other node’s processors can see the transaction’s results until the changes are in persistent memory.
So, yes, on The Machine, the actual write-back of changed objects to persistent memory can be handled asynchronously (allowing the object to stay in cache indefinitely) – say by later having another thread force the object out of the cache – if that object is only scoped to threads limited to a single node. It follows that only synchronous write-backs would be allowed if a persistent object is to be shared by processor residing across The Machine’s node boundaries.
This would seem to create an inconsistency in a persistent memory program model; or, more to the point, perhaps what it creates is two views of persistent memory.
For example, consider some existing multi-node system which today is also fully cache-coherent NUMA-based system. In this, and within each node, let’s replace some of the DRAM with persistent memory. All cache, all DRAM, and all persistent memory is accessible from any processor. Being fully cache-coherent, object changes still residing in any cache are visible from any processor on any node. In such a globally cache-coherent system, the two views of persistent memory need not exist.
But, please, don’t get me wrong; we are here talking about a subtlety with the programming model. Even without a fully cache-coherent system, all that it takes to allow a persistent object to become globally visible throughout The Machine is to be additionally aware that the changed object does need to be explicitly pushed out of the cache and into persistent memory. It’s a difference, yes, but still a subtlety. For comparison, ask yourself what it would take to do the same thing in a truly distributed-memory cluster (i.e., one where the nodes are connected via even a high performance Ethernet or InfiniBand link); big bucks and a lot of smarts go into minimizing that effect.
Rather than add this extra level of confusion to the remainder of this article, the program model we will be discussing in the next section assumes full cache coherency.
Still, let’s momentarily take a look at post-failure recovery given we instead (additionally?) used asynchronous writes of object changes into persistent memory.
Recovery Based On Asynchronous Writes To Persistent Memory
Recall in the disk drive-based persistent storage, the transaction was allowed to complete before the actual changes to the object had made their way to disk. The log writes to disk were synchronous with the transaction (i.e., they were done before the transaction completed), but the actual object changes in memory made their way out sooner or later – asynchronously. So, let’s suggest doing the same thing with persistent memory and see what that means.
In this model we are requiring synchronously writing log entries to persistent memory before the transaction completes, but we will allow the actual persistent object changes to continue to reside in cache even after the transaction completes and be visible from there.
Again, in what follows, we are talking about persistent memory, not disk drives.
Unlike the synchronous model, we are not forcing the actual changed object into persistent memory before the transaction’s lock is freed. The beauty of this is that, simply based on normal aging of data blocks in each cache, the changed object tends to make its way back out to persistent memory sooner or later. Tends to, yes, but any data block can also stay in some processor’s cache for who knows how long. So, the object sooner or later becomes persistent, yes, the problem is that you just don’t know when.
Recall, though, that you need the changes described in the recovery log to remain there until you do know. In short, the log can’t be trimmed until you know for sure. Post-failure recovery needs it.
But, just as with asynchronous page writes to disk drives, you can know when the object has been flushed from the cache and do asynchronous write backs.
Perhaps more to the point, you can know that the cached object write-backs are done prior to some point in time, not necessarily exactly when. You can have the write backs be done later by another thread of execution, one separate to the thread(s) doing the program processing. All that that support thread needs is a list of all of the data blocks – data blocks which may or may not still reside in cache – for which that support thread is responsible for forcing back out into persistent memory. Once successfully written by this thread, this thread can then also know that the persistent object’s changed state is now again consistent in persistent memory, at least from the point of view of the transaction for which these changes were made. Knowing this, it can trim the log. Interestingly, the log itself describes the location of those very data blocks for those transactions which are already known to be complete.
Changing concepts, we observe that recovery after restarting post-failure is also a bit more complex. Since you can’t know which objects might be in a damaged and inconsistent state at recovery time, all of the logs need to be processed prior to restarting most anything to ensure that persistent objects really are in a consistent state prior to any access.
So we’ve just said that ensuring that an object’s changes have been forced from the cache and into persistent memory – making it durable – is a prerequisite for cleaning up the recovery log. Let’s turn it around and assume that the hardware had quickly aged some transaction’s object’s changed data block out into persistent memory, doing this prior to the point that that the transaction log had recorded a “commit” for that transaction in persistent memory. That is, the changed object now resides in persistent memory before the log also says that the transaction had reached a “committed” state. Yes, this can easily happen. Let’s next also say that, prior to that “commit” event, and so also before the lock(s) are freed, a power failure occurs. This object is in a consistent state in persistent memory, but are the results of this transaction durable (in the sense of ACID requiring Durability)? Upon recovery and processing the log, the recovery code will find the transaction in the log, but recovery will not find that it had been committed. As far as this recovery code is concerned, sans commit record, there might have been more changes required to that object to make it consistent. So the recovery code is responsible for restoring the original – pre-transaction – state of the object, backing out the object’s changes that just happen to be already residing in persistent memory.
More On The Program Model
All of the preceding many words, though, have largely been background into the Program Model. Of course, the folks working on attempting to abstract all of this know this stuff to the extreme, and likely have recurring bad dreams involving it.
We are going to attempt to outline next some of the work being done by Dhruva Chakrabarti and his team at Hewlett-Packard Labs in support of such a persistent-memory programming model.
Let’s start with their notion of a persistent region (PR) in persistent memory. Although there are OSes for which virtual addresses are also persistent (IBM i’s Single-Level Store comes to mind), this program model starts – much like a named file – by having your program first access an object in even persistent memory by way of some type of name; a file name in a directory or global object handle are examples. The name effectively represents a Persistent Region in persistent memory in which your object resides. In their program model, you
- Provide an object name to get an object handle.
- Provide an object handle to get a Process-local virtual address representing a root into the object.
At the bottom of it all, though, this region is really a portion of persistent memory (where every byte is addressable), so this persistent region also represents a contiguous real address space, a physical portion of this persistent memory. Just like tracks on a disk drive used for a file, it exists as something physical, but you don’t really need to worry much about where; you are provided a virtual address to allow you to find and work directly within it.
Perhaps you could think, for example, of a persistent region as being used for a named persistent heap. At any moment in time, the storage of this persistent heap could be in use by a large set of persistent objects. Once these objects become freed, their storage returns to the persistent heap for subsequent reallocation. The persistent heap object roots the Persistent Region such that memory backing both the objects residing in the heap and the freed storage managed by the heap are addressable within this persistent region. Indeed, part of this programming model has, just like today’s volatile memory heap, the ability to have un-referenced objects in the persistent heap be automatically garbage collected.
This program model also provides the means of creating a persistent region after having detected that it does not yet exist.
The program model also works to provide the expected transaction semantics (i.e., ACID). If a program does not happen to be executing within the bounds of a transaction, you can assume that the object’s state is consistent. As each transaction executes, executing as though one transaction follows the next, the object’s state effectively transitions from one consistent state to the next.
We saw earlier that transactions are actually assumed to be concurrently executing. Such potential concurrent sharing often requires locks. An exclusive lock helps ensure that one and only one thread is updating an object at any moment in time. Such locking is required today even for volatile memory; the locking is protecting the fact that the object is shared, not necessarily that it is persistent. Ok, let’s now turn it around. Let’s also now put that object into persistent memory, say on a persistent heap. You have further gone to the trouble of putting a lock around that object, again because it is shared. So, can the program model assume that – with an object in persistent memory and protected via a lock – what you really intended is for that object to be protected as though it were part of a transaction?
So, quoting from a paper by members of HPE’s Program Model team – Atlas: Leveraging Locks for Non-volatile Memory Consistency:
“For lock-based programs, our goal is to guarantee that durable program state appears to transition from one consistent snapshot to another. We preserve all existing properties of lock-based code. There is no change to the threads memory model and memory visibility rules with respect other threads. Isolation is provided by holding locks and it remains the responsibility of the programmer to follow proper synchronization disciplines. The only semantics we add is failure-atomicity or durability, only for memory locations that are persistent”, i.e. those within a PR.”
Within this paper, they go on and say that such consistency is guaranteed only between transactions; while executing a transaction, the state is often inconsistent. Failure, though, can occur at any time. If the failure occurs at a time when no transactions are executing, the object’s persistent state is, upon restart, in a consistent state. If not, it is for such cases where the object recovery is needed.
So part of the trick is for the program model to provide the needed, but minimal, hints needed to allow your programs to describe the scope of transactions and the recovery needed upon failure. So:
“We thus assume that data structures are inconsistent only in critical sections, and hence treat lock and unlock operations as indicators of consistent program points. We will call program points at which the executing thread holds no lock-objects thread-consistent. If no lock-objects are held by any thread, all data structures should be in a consistent state.”
So there is part of it. Recall also that the program model also knows at least one more thing about your objects; they reside in either volatile or non-volatile memory and the program model knows the difference. After all, your program told it which one when it constructed the object. So, at compile time, the program model can know what you consider in the need of protection from failure by looking also for both what is to be saved in persistent memory AND what you are protecting as a transaction via locks.
According to that same paper Atlas:
“A compilation pass instruments synchronization operations and store operations that appear directed to persistent memory. This results in calls to the Atlas runtime library, whereby synchronization operations and stores to persistent memory are tracked in a persistent log.”
It would seem to follow that this same instrumentation would include the code necessary for ensuring that the persistent objects, as well as log entries, really have been forced out of the processor cache and into persistent memory. From the paper:
- “Log entries become visible in [persistent memory] before the actual stores to the corresponding locations, so that all stores not part of the consistent state can be properly undone.
- Log entries become visible in [persistent memory] in order.
- Every log modification step is an atomic operation and its effect is visible in [persistent memory] atomically.”
You will recall the earlier overview on trimming the log and maintaining a consistent state. Apparently Atlas support is intended to manage it asynchronously …
“After a failure, a recovery phase, that is initiated in a programmer-oblivious manner, performs the reconstruction. A helper thread examines [the log] asynchronously and computes a globally consistent state periodically. … Computation of a consistent state renders some entries of [the log] unnecessary, allowing them to be pruned out.”
As implied, the log is maintained in persistent memory:
“The initialization phase must be called at program start-up. This phase performs two main tasks:
- Creation of a process-private persistent region … to hold [the log].
- A helper thread is created to perform consistent state identification and log pruning.”
[Note: A process-private log suggests that this program model is limited to only the threads of individual processes and to cache-coherent environments. You may recall that The Machine maintains cache coherency only amongst the processors of its individual nodes. Persistent memory resides on many nodes and all of it is accessible from any processor on any node, so such objects could reside anywhere and be accessed from anywhere in The Machine, but this program model is not yet to a point where concurrent cross-process or cross-node sharing is supported. This is not to say that cross-process sharing via persistent memory is not possible, it is just handled differently.]
So, What Is To Follow?
Very fast persistent storage. Some of us will look at that and understand that we can save and load our files a lot faster. True. Others of us will look at that and realize that this rather does change things up quite a bit. Anything that we could have created in volatile memory, we can now – given some constraints on addressing – be much more rapidly created and maintained in non-volatile memory and be sure it is as we left it when the power comes back on. We are going to see this as something rather revolutionary. Indeed, quoting Dhruva Chakrabarti, “The aim is to embed persistence in programming languages so that it is readily available for potentially all data. The approach described in Atlas tries to provide automatic support for most of the additional tasks required, hoping to ease the transition of existing code into the world of persistent memory.”
There is, though, still a bit of a gap between the program model being produced to abstract away the warts on such hardware and the world most of we programmers live in. When programming, all we really want to say is that we want an object – name your favorite object – to be constructed, do all the things that we do to it today in volatile memory, and, oh by the way, we want that object to reside in persistent memory and be failure atomic. Easy to say, right?
Fortunately, with some work from the open-source community, it can be relatively easy. I had earlier referred to a dictionary object, an object class today supported in a number of different languages. Many of us have used dictionary objects, and have been glad that others have done the enabling of such an object for us. The semantics of a dictionary’s use really are easy to use; we don’t much care how it is really done, as long as it is fast. By now many of you can picture what it would take to enable even a dictionary object for persistence. Most of you reading this also know that enabling that one object for persistence is really just the tip of the iceberg. There are scads of objects out there.
That is why the folks at HPE have been asking the open source community for their involvement. For as much as I honor the HPE folks for investing big time in their The Machine and of getting out ahead of this technology, the folks at HPE know something else as well; they will not be the only company developing systems like The Machine. When they ask for assistance, they know that the open-source community is not particularly interested in developing single-vendor software. But HPE, and all of the companies that will be following soon in their wake, know that an entire software stack, a complete solution, is needed before such systems really take off. When the hardware could hit the market – and soon it will be hardware from a number of vendors – they all know that the software needs to be available as well.