Taking The Locks Off Transactional Memory
November 19, 2015 Mark Funk
The earliest paper on Transactional Memory that can be found – Transactional Memory: Architectural Support for Lock-Free Data Structures – was written about twenty years ago. It’s hard to say what all led to this realization that transactional memory could be created and improve system performance, but it is useful to picture a few database gurus and processor designers sharing nachos and beer and slowly realizing the similarities between the sharing and exclusive access to data in both database systems and processor caches supporting cache coherency.
With those similarities and one monster of a cache, what would it take to execute complete multi-step – as opposed to single atomic – transactions, detect conflicts without locks, and quickly restart transactions upon conflicts? Of such is genius born.
In the part one of this series on transactional memory which focused on atomicity and cache coherency, we saw that an attempt by a processor P1 to change a data block held in the cache of processor P0 soon results in the data block being stolen from P0. P1 needed exclusive access to that block to store the change into the block and got it by stealing the block from P0’s cache. Usefully, P0 knows when and what occurred to allow that to happen. Said differently, an attempt by P1 to change data previously read and/or changed by P0 is easily detected by P0.
Similarly, if P0 had changed a data block and P1 had subsequently requested to read it, again P0 would be aware that a change in state of its affected cache line is required. What had been changed and exclusively held is either going to be stolen or shared in both P0’s and P1’s cache. As a result, P0 would again be aware of P1’s access attempt.
It happens that that same knowledge is what is needed to determine that a transaction’s isolation is violated.
Great, that’s one data block, but we want to have transactions consist of multiple storage operations accessing potentially many data blocks (and so many cache lines). Over that full set of operations on those data blocks, we want the previously described semantics of a transaction supported. A one-operation transaction might be interesting, but we want to be able to expand that into many more. So how?
For the single operation case, we were able to tell the hardware our intent by having software call the operation “atomic.” Across the entire read-modify-write operation, we said that we wanted it to always appear as a single atomic operation. So with our multi-operation transaction, we need a way of describing to the hardware when the transaction starts and when it ends. Over that entire period, we want the hardware to tell us that we have either succeeded in doing the entire operation as a transaction – and so are done – or that the transaction’s atomicity and isolation was somehow violated – and so needed to be restarted/recovered in some way.
Cache coherency states, this time across multiple data blocks, is what is tracked in order to determine transaction success or failure. Let’s next take a look at some cases in point.
First, recall again the notion of a data commit in a database transaction. It is not until the commit at the end of a transaction that the changes made by the transaction are allowed to be seen by another transaction. So let’s have a transaction T1 begin – marking its beginning for the hardware – and then make a set of changes with those changes residing in multiple cache lines of processor P0.
At any time prior to T1 indicating its end to P0, rules of transaction isolation say that those changed data blocks are not allowed to be seen by any other transaction; they cannot escape to another processor or to memory prior to their being committed at transaction’s end.
Recall also that for some types of transaction isolation, data blocks seen by transaction T1 – and held in it processor’s cache – cannot be changed by another transaction before T1’s end because T1 might want to re-access that data and expect to find the contents unchanged.
Of course, since there are no locks, without transactional memory there is nothing to stop transaction T2 – executing on another processor – from reaching in and reading one or more of T1’s changed data blocks or changing data blocks previously read into T1’s processor’s cache. In doing so T2 has obviously violated T1’s isolation intent. So now what?
We could have T1 fail; T1 does know what T2 had done. But it is T2 that was not supposed to see T1’s changed data, at least until T1 ends and commits its data. So would seem that it is T2 that needs to fail.
And how would that Transaction know to do so? T1’s processor knows that those cache lines were associated with transaction T1; that data had been accessed after the processor knew that T1 had begun. Since T2 is reaching into T1’s processor’s cache, it is that same processor that would be responding with the data block (and changing that source cache line’s state), assuming that the response wouldn’t have violated T1. But it is violating transaction T1 and T2’s processor needs to be told that it is. In doing so, T2’s processor is effectively told that T2 must be forced to subsequently fail.
T1 continues, hopefully reaching the successful end of its transaction. Once T2 actually fails and restarts, this time is will hopefully be accessing T1’s now committed data blocks. These data blocks are no longer associated with a current transaction and so can be accessed by T2; these same data blocks now become associated with T2 and hopefully committed (i.e., made visible) to subsequent transactions.
And this was all done without any locks.
Transactional Memory Observations
As you have seen, transactional memory is tightly coupled to the normal way that the processor hardware manages its cache. A task on a processor tells the hardware that it is starting a transaction and the hardware there must assume that all storage accesses, accesses which are cache accesses, are part of that transaction. Later that task tells the hardware that the transaction is done, allowing the changes made there to be committed and generally available.
All that assumed that the transaction’s task stays on that processor for the entire period of the transaction. But what if, part way though the transaction, the task finds that the needed data is not even in memory; it page faults. The task takes an interrupt to bring that page into memory, leaving the processor and waiting until the I/O read is complete. So what happens to the transaction and all of the transaction’s changed data blocks in the mean time? It aborts and the changes are lost.
We bring this up to point out one big difference between database transactions and the transactions supported by transactional memory. A database transaction can be a long running operation, spanning many such page faults. Indeed, a database transaction can be supported by multiple tasks within the database management system. The transactions supported by transactional memory, because the entire operation must take place within a single processor’s cache and be successful in all of its memory accesses, is therefore a considerably more limited piece of work.
Notice also that there is a bit of a trade-off involved with using transactional memory. One of the reasons to use it is to avoid locks; instead the transaction takes the risk that it can run from beginning to end without conflict and so the need to restart. In an environment where data conflict on shared data is improbable, the risk seems reasonable. But let’s have both a set of long running transactions and a moderate probability that the shared data will be accessed. How often will the transactions fail? And, if it fails, did the extra cost of starting over justify the benefit of not spending the time to take locks?
And for those of you who are really into this sort of stuff, I can hear you advising “Yes, but what about False Sharing?” Good observation. It happens that the data blocks I have been referring to need to be nicely aligned in order to fit into a processor’s cache line. I like to picture such data blocks as being 128 bytes in size and 128-byte aligned. How this relates to False Sharing is that most software – and so the very data objects being accessed in these transactions – is not much concerned about the alignment and location of such data. So picture a bunch of objects being accessed within a transaction residing in memory and in one or more of those data blocks resides some highly shared data which is completely independent of the transaction(s). The transaction is executing nicely when, out of the blue and for completely independent reasons, the independent data item gets accessed and then changed. The transaction fails and needs to restart, perhaps with this happening all over again. Not nice.
Wrapping It All Up
Those previous observations are, of course, warnings about best practices. We have seen that transactional memory is better if it is not being used in support of long running database transactions, even though the concepts we have looked at certainly stem from there. We’ve also seen that there is no particular value in using transactional memory for updating some single primitive data item; such atomic updates are using part of the underlying concepts already.
But, those of you who do multi-threaded programming know that there is a large realm of data sharing that fits between these two extremes. For example, I know that many of you have jumped through hoops to make otherwise independent data items look as though they were atomic primitives, just so that you could avoid the use of locks. Now you don’t have to. Clearly, transactional memory can be used to avoid locking on still more complex shared objects as well.
The trick, though, in deciding to use transactional memory is to determine whether your transaction can usually get from beginning to end without having that same data be accessed at the same time by another processor. When it typically can, transactional memory seems like a win. Which is why IBM Power8 and Intel “Haswell” Xeon processors support it.