We have published a number of stories lately that talk about the innovative uses of Intel’s 3D XPoint Optane persistent memory modules, which are a key component of the company’s “Cascade Lake” Xeon SP systems and which are also becoming a foundational technology in clustered storage based on NVM-Express over Fabrics interconnects from a number of storage upstarts.
For one of the storage upstarts, called Memverge, the key thing that this company wanted to do with byte-addressable memory was to leverage the additional capacity that Optane DIMMs afford but doing so while not having to alter applications in systems supporting it. This is a worthy intent. To do so, Memverge’s product consigns the Optane memory to an I/O subsystem acting as a cache for slower forms of persistent memory in IO space. In a way, their product does for SSD what SSD does for hard drives. It, along with their enhanced I/O driver, speeds accesses to persistent storage in I/O space. Certainly, faster and measurably so.
This is good, of course, but it got me wondering whether it followed from requirement A – no rewrite of applications – that the solution B automatically follows – that the likes of Optane persistent memory must reside in I/O space. After all, such persistent memory was created to be directly attached to the processor chips, and be byte addressable just like RAM. Think paradigm busting, outrageously fast commits of data to persistent storage. Said differently, can a processor complex be created with directly attached persistent memory and where the typical use of that system does not require changes to the applications?
Before going there, let’s start by observing that under the normal rules of virtualization and the cloud, Memverge’s solution fits in very nicely. Within those rules, your application, and indeed the OS instance within which it exists, is intended to run on almost any processor, and so resides within the RAM attached to those processors. It is virtual, the OS gets to float from one processor and its RAM-attached complex to another. All that is additionally required is that when it executes it is able to find its data and code somewhere within I/O space. This is great, but now let’s add persistent memory to each of those processor complexes, right next to the volatile RAM.
In such an environment, with the persistent memory attached to the processor chips, let’s arrange for a database application to be running in one. You may know that a database manager supporting that application has some pretty strong rules about transactions and the use of persistent memory – something we’ll be touching on shortly – but here we will just say that the application can not be told of the end of the database operation until the database changes are in persistent storage. So, let’s here arrange for the database changes to be first committed as being persistent into one system’s persistent memory, perhaps with these changes later paged out to slower storage out in I/O space. Life is good, transactions – as we’ll see more on later – are completing unbelievably fast. OK, let’s add another wrinkle. You are happily – and speedily – updating your database when suddenly … power failure. As with any good database, if your transactions were known to have completed, enough data has been committed to persistent memory to be sure you can start up later as though the power failure had never happened. All is good, right? Starting up again means you’ll find your data. But virtualization allows you to wake up most anywhere. And where is your data? It’s persistent all right, but it’s in the byte-addressable persistent memory attached to what is now another separate set of processors. And you can’t get to it. Oops.
Don’t get me wrong. This is observation is not a fatal weakness of the likes of Optane. If you want its real benefit, where the persistent memory is directly attached to the processors, we are now talking about dedicated systems. And, again, this article is about enabling its benefits without application changes. The remainder of this article is going to focus on such dedicated systems, perhaps with virtualized OSes, but they continue to reside within the same physical processor complex.
Because I am familiar with it, I’m going to use the IBM i operating system (formerly known as OS/400 and leveraging the single-level storage architecture of the System/38 that dates from the late 1970s) as a bit of a touch point for this thought process. In IBM i, the “i” stands for integrated. In it, much of its database and file system support and practically all of its integrity and security are embedded in the operating system kernel; read that as being very close to that system’s hardware. Indeed, because of this integration, one of the selling points of this OS is that major changes can be introduced to the hardware with no observable change to applications.
Another key – and here very applicable – concept basic to the IBM i operating system is that single-level storage. Even decades ago with the System/38, SLS meant that when your application used a secure token as an address to access data, it did not matter whether that data then was first found on disk or in RAM. Even after a system restart – say occurring due to a power failure – you restarted using exactly the same address token the address the same data. In a very real sense, the systems relatively smaller RAM was managed as a cache of the contents of all data on disk, where disk even way back when was the only form of persistent storage. And, aside from performance, the applications did not have a clue where the data actually came from. As you will see, I am going to claim that persistent memory – memory attached to the processors, just like volatile RAM – can be introduced, and then be managed as a persistent-data cache of the contents of data also residing on very much slower persistent storage residing in I/O space. And because of that, applications are not at all aware of its existence, aside from the undeniably outrageously fast transactions.
Let’s start by looking at a simple database update transaction. And, by the way, the key word here is transaction, a word formally defined in database theory. Another is the pertinent word “ACID”, standing for Atomicity, Consistency, Isolation, and Durability. In short, what that all means is that when your application tells you your database operation is done, it is completely done (or not at all), no matter how complex, or how many others were trying to do the same thing at the same moment, or in the event of concurrent system failures. When you believe it succeeded, it really did succeed.
OK, let’s next consider the following database update transaction in a system without persistent memory …
- The database manager interprets your query, starts your transactions, finds the pertinent record(s), having potentially read whatever is required from hard disk into RAM, and applies locks. The locks ensure that the state of the database at that moment is the state upon which your updates will be applied. No other concurrently executing transaction can alter that state; others effectively must wait until you are done. (Similarly, your transaction waits until the locks of preceding transactions are first freed. More on this later.)
- Along with actually making the update, a description of that change is built, stored into what IBM i calls a journal (what others call a commit log), and that description is then written to disk. This disk write must be complete – read that committed to persistent storage – and the database manager must become aware of it, before the transaction is allowed to continue. Think tens of milliseconds. Use of the Journal in this manner avoids needing to write all of the changes to disk before continuing with the transaction.
- With the completion of the journal writes and the updates of the database records in RAM, the locks can be freed, allowing waiting transactions to proceed. The changed database records are then written to disk in the fullness of time, but the sooner the better.
- The transaction is allowed to be perceived as having been completed.
In isolation, this – even years ago and even some much more complex transactions – is perceived as fast, very fast. We’re just not wired to perceive time down to anything much faster than about 100 milliseconds. But now picture thousands of concurrently executing transactions, some touching and so locking up the same data, all delivered as incoming requests to your database system via your web site. No matter the number of processors in your system, each transaction accessing and maybe updating the same data finds the locks set by preceding transactions. These transactions wait their turn, waiting for preceding transactions to free these locks. As each waiting transaction gets its turn, they set their locks, later synchronously forcing their journal records to disk, with subsequent transactions waiting on these. Think in terms of a train with cars representing related transactions, and your transaction is for now the caboose. And occasionally, latter cars of that train get detached, get attached to another train, and so get to start their complex transactions all over again. That kind of time you can perceive. And you had wanted your stock query and related trade to occur now. Sorry, no. Did SSDs and now Memverge’s product improve upon this? Yes, but we are still using I/O-based protocols to support it.
Let’s now install some persistent memory into the same slots of the processor complex used for volatile RAM and then repeat.
- As before, the database manager still finds and locks up the pertinent records based on the query.
- As before, journal records are created, these describing the intended change to the database. As before, although unstated, these journal records are first created within the processor cache – as is the cases with practically all data – and then flushed into memory. Unlike before, rather than that cached-data being written into RAM, cache lines holding the journal records are flushed into the correct location in persistent memory. Once the code managing the journal knows that these synchronous cache flushes are complete, it also knows that the records are persistent. There is no requirement to first write this data to disk or SSDs or what-have-you I/O-based operation. It’s done, and it took only a few hundred processor cycles to do it – with processor cycles executing in fractions of nanoseconds – and the database code never needed to give up the processor to do it.
- With the persistence of the journal records a fait accompli, the remainder of the database updates are completed, the locks are freed, and the transaction ends, very likely without this code needing to give up the processor to do it.
- In the fullness of time, as with the database records, the Journal records are paged out to slower persistent storage in I/O space, just to free up that persistent storage for subsequent use.
See the difference, along with the impact on throughput as a result? The locks, required in any case since time is passing, are held for a minimum of time. The probability of any subsequent transactions seeing these locks decreases significantly. Subsequent transactions don’t as often need to wait, and when they do their wait time is far less. In our train metaphor used earlier, a train doesn’t even get built anywhere nearly as often. Life is good.
So, back to our initial intent. We want to install such persistent memory, use it when appropriate, and have no impact on our applications. Let’s again assume – as with the IBM i – that the database manager and its journal component is in the operating system kernel. Adding support for persistent memory to these, and also to the operating system main storage manager, although not trivial, is also not that tough either. Everything needed is embedded in the OS. A new release of the kernel and you have it. And you will notice that at no point was an application even aware that all of this had occurred.
Get the picture? Remember also that I had described for IBM i that the main storage (historically RAM) acts as a cache for all of its single-level store, which addresses practically all of what resides in persistent storage (HDDs and SSDs, for example). All we did was to make some of that main storage consist of persistent memory. “Main storage” is still partially RAM certainly, because the RAM remains faster for the cache to access than the persistent storage. But where data had paged into the RAM as needed, and most of it paged out when sufficiently aged, the same model for main storage management can be used for the persistent storage as well. Perhaps the OS wants the whole of the database as well as the journal to reside in persistent storage. No big deal. The OS knows already what constitutes the database. And the reason for doing so is faster recovery after system failure.
The additional fact that persistent memory has a higher density than RAM – read that more memory per memory socket – means additionally that more of the database can be held longer close to the processor (rather than paging it as needed from I/O space) is no small matter as well. Again, the OS knows all of this with no change to the application required. “Blue skies smilin’ at me. Nothin’ but blue skies do I see.” Of course, that assumes that IBM will again see a benefit to investing in such systems. Well, if not Big Blue, perhaps HPE will do so. Whoever, we are in for quite a change – and a largely transparent one at that – in the computer industry when it arrives.