What Is SMP Without Shared Memory?

This is the second in the series on the essentials of multiprocessor programming. This time around we are going to look at some of the normally little considered effects of having memory being shared by a lot of processors and by the work concurrently executing there.

We start, though, by observing that there has been quite a market for parallelizing applications even when they do not share data. There has been remarkable growth of applications capable of using multiple distributed memory systems for parallelism, and interestingly that very nicely demonstrates the opportunity that exists for using massive compute capacity when applications find that data sharing is minimal.

So why bring that up? When such applications are fortunate enough to find no real sharing, an application just kicks off work on each of the systems, points the work at each system’s portion of the data, and off they go. If fact, in such a serendipitous environment, it hardly matters whether the processors are tightly coupled within a shared memory SMP, in some distributed memory systems, or some variation in between, like various forms of NUMA-based systems. Each processor accesses data loaded into some portion of its own memory, or its own address space, and it happily grinds upon it. For such, sure you could use the processors of a shared-memory SMP, or the single already huge address space of a process, but if you are not sharing, is there advantage for doing so?

But this time around, we are not really talking about such embarrassingly parallel applications. We are talking about sharing of data – perhaps frequently and concurrently – between the units of work, and doing so rapidly. Sure, the units of work could still be spread across distributed memory systems, and sharing is possible and does occur (and a lot of research has gone into its enablement), but then we are talking about the highly nontrivial effort and performance of moving data between these I/O-attached memories to enable such sharing. And the more sharing that occurs, the higher the overhead and complexity to support it. If the memory containing an application’s data can’t be accessed by a processor running part of an application, then there is no choice but to move or copy that same data to where the “work” is executing (or more the “work” to the data).

So here we are going to talk about the far simpler programming model enabled by units of work – here represented by threads within processes, with processes each supporting a single address space – all able to access and rapidly share data amongst the processors to which these threads are assigned.

Far simpler, yes, but it happens that there are a few quirks stemming from the concurrent execution of that work accessing and changing the same data, even within the same sub-nanosecond moment. You share and these quirks matter, just because of the physical nature of the hardware. Where you don’t share, it does not matter.

Even though we have these nice abstractions like process and threads, and even though we are encouraged to think – if we think about it at all – that in an SMP is a bunch of processors sharing a common memory, when we get around to having an application execute on a number of these processors, the actual physical characteristics of that shared memory start to show through into our otherwise nice clean programming abstractions. We begin to see that there really are separate processors hung on the common memory we attempt to access and that they change the same data at the same time. We might prefer not to know about the effects we will be looking at in a moment, but they really are there because, well, the hardware is the hardware.

Even so, let’s again observe that even an application based on threads within a process on a single SMP can also be embarrassingly parallel. Picture, for example, the processing of a single large array, or perhaps separate lines of a single file. Let’s say that each array entry or file’s line represents some longer running “something” independent of every other “something.” Your application can create a set of threads, points each at some number of entries, and then says GO to each. Each thread takes off, processing its share of the data, all independent of each other. Life is good. And then . . .

And then they need to all provide their results. The threads, though, are all asynchronously executing units of work; who knows when or how long they really took to execute or when they actually ended. How does a master thread, the one kicking off this whole massive enterprise, know when to pick up all of the otherwise independently produced results?

As one answer, many programming languages support the notion of a join; from Java for example “The join method allows one thread to wait for the completion of another.” The master thread joins with all of the worker threads it created, and when the OS reports in that they are dead, the master knows that the result data is available. Rather morbid perhaps and, in some applications, inefficient actually.

So how might you do it yourself via shared memory?

First recall that your thread creating all of the others, call it Thread M (for Master), had provided a set of parameters – via shared memory – to each of the created threads, call one Thread A. Again, both Thread M and Thread A have a common buffer, a common structure in memory by which they can communicate, using the same address to that common memory as a further bonus. At its simplest, all that Thread A needs to do upon its completion is to set a Boolean (True/False) in that structure saying to Thread M “I am done. Your data is ready.” And all that Thread M needed to do is periodically check this flag (actually these flags, for all of the other threads as well). Once they all report completion, Thread M knows that the worker threads are complete. There are still better ways which avoid the polling, which we’ll get to later, but you get the idea.

But now we get to look at what the hardware knows about setting this Boolean.

Shared Memory Ordering

Simple enough, right? Well, there is a gotcha. Note that your program executing in Thread A had intended for the “I am done” flag to be set after it had produced its results in memory. That was your intent, but not necessarily reality. As odd as it may seem, when Thread M sees that “I am done” flag, the hardware can necessarily guarantee to Thread M that the also sharable result data produced by Thread A is actually in memory at that moment. (And for those of you staying ahead of me, no, I don’t mean the data is still in the processor’s cache and so not in physical DRAM memory. I really do mean that programmatically the master thread really can perceive that the Boolean flag is saying “I am done” before the result data is perceivable as having changed.) Not nice.

To explain, from the point of view of Thread A’s program code alone, if Thread A stores two different items to memory, say X followed by Y, Thread A itself is guaranteed that it can perceive later that X really was followed by Y; from Thread A’s point of view, there is no case where, if Thread A sees a changed Y, that it did not also see a changed X. This perceived storage order is guaranteed within the bounds of a single thread. It happens, though, that that perceived storage ordering is not similarly guaranteed for any other thread executing on another processor; it might be true, it’s just not guaranteed. Said differently, Thread M might see the change to Y – the “I am done” Boolean – before it is able to see the shared data X stored by Thread A. Pretty weird right? (See Wikipedia’s Memory Ordering for more.)

Fortunately, there is a straightforward tool to get this right. It is called a Memory Barrier with more detail available here with even more here. A set of memory order operations for C++ can be found here. In short, though, all this is a special call that you stick in your code which says, “Make absolutely sure that the variables referenced prior to this point are accessed before any variables referenced after this point.” To use, Thread A executes this barrier between the variable stores and Thread M executes this barrier between its memory reads. Using this, if Thread M sees the “I am done” flag, it also knows that Thread A had first stored its results.

Yes, weird, but fortunately you are not likely to be working at this level. Most of the operations used for synchronizing access to shared memory, operations which we’ll touch on soon, handle this odd ordering notion for you. Still, we bring this to your attention because in some cases – even in OpenMP – you do need to be aware of this effect. There is nothing harder to debug than such timing-related effects.

Volatile Shared Memory

An interesting variation on correctly accessing shared memory relates to compiler optimization. In that same example as above where Thread M is periodically accessing the “I am done” flag Y, Thread M wants to be sure that its program really is reading from the storage location Y. It might not be. Consider the following C code snippet for example:

void isItDone ()

{

while (Y == NotDone) sleep(1); // Keep looping here until Y says we are done.

}

(And, no, you don’t necessarily need to sleep while in this loop.)

Naturally, you assume that Y is the shared data structure is being repeatedly read, read from memory. Maybe. Maybe not. The compiler generating this code might note that Y is not changing in anything that it controls. As such the compiler might decide that Y, once read into a register, can remain in a register – not memory – and repeatedly tested there. The compiler does not know what you know; a whole different thread will be changing Y, the Y in memory. In effect, this source code needs to tell the compiler “Hey, every time you access Y, go get it from memory.”

That is done with a variable identifier called volatile. In C and Java, at least, it is like a type keyword as in

volatile Boolean Y;

Nice descriptions of this can also be found here and here.

Volatile is all about suppressing compiler optimization, ensuring that you always go to memory when accessing such a variable. In the loop above, that would mean that each and every time through the loop, Y would be re-read from memory.

When actually changing a shared-memory variable, though, identifying it as an Atomic variable, as described in the next section, might be what you want.

Atomic Shared Memory

The basic picture most of us have is that if we – say – increment the contents of a variable, that is done as though by a single operation in memory. For example, if we say:

A = A + 1;

we think that that one line of code is done as one operation before we go on. Nope. Even if your processor executes just one instruction to do so, which is not at all guaranteed, the memory location A is read into the processor, incremented there, and then stored back to the memory location A. During this multi-step operation – one that could actually take a while – there is nothing to stop another Thread on another processor and sharing A, from doing this same thing, and doing so in the meantime.

Where you had intended for A to be incremented twice, once on each processor executing this increment, the result of both attempts could as well be just one increment by 1. Um, what? It is possible for both processors to read the same value of A (the state prior to any increment), both increment the same value, and so both store the same now incremented value to A. Remember, this entire increment process takes time and so is not done as a single atomic operation. Interestingly, most of the time, you really will find A incremented as you would expect, but clearly now, not always.

But you can guarantee that it be so, with some help from the processor architecture; all you need to do is to tell it what you really intend. To get that, though, you need to tell the compiler of your intent, and that comes from defining primitive variables as atomic. (We imagine that you could define all variables as atomic, but the result would be quite a drop in the speed of your program.)

We are talking here about individual, typically very primitive variables like integers or Booleans. More complex data types and structures (e.g., C++/Java objects) require another slower form of atomicity based on locks which we will be getting to shortly.

So suppose you want to tell the compiler, and from there the hardware, that you wanted to atomically add something to a variable. In C++ you might say something like the following:

atomic<int> counter;

counter.fetch_and_add(1);

More on this can be found here. You are telling the compiler that counter is an integer and that you intend for the hardware to atomically add one to it. A far more complete set of these data types and operations can be found here or here in a C++ Reference.

Java plays the game a little bit differently. These references from Oracle, one in tutorials and the other on the atomic class. will do a better job than us.

For those of you who are really into this type of thing, I like to point at the way that the Power architecture implements atomic operations. There, the compiler

Generates a Load instruction which also marks the storage location as “reserved”.
Generates code to work with this reserved variable.
Generates a conditional store, one that only succeeds if the reserved variable had not been changed by another processor.
Generates a conditional branch instruction which loops back to the top of this code and restarts the entire operation if the conditional store had failed.

It might take a few retries to succeed, but success ultimately occurs.

Before we go on into a little more complexity, do you see what I’m getting at about being willing to look a bit under-the-hood of the higher level abstractions? Before you read this short section on atomic variables, you might have thought it was a dangerous thing to get involved with. Nope. It is nothing more than managing the concurrent memory accesses of multiple processors. And, increasing the level of complexity just a bit, that’s all that we are doing in the next section as well.

Managing Complex Shared-Memory Structures Between Threads

Thank goodness that sharing memory between threads is as easy as it is. For threads, sharing the same process, not only is data in the same physical memory, but the address used to access this shared data is the same amongst all of them. And it is one and the same data; no data copying needed as in applications on distributed-memory systems, unless, of course, you really want to. Easy peasy.

The trick, though, is to manage the asynchronous execution of the threads. You really have no idea when the threads will be executing. And that includes those moments in time during which both threads are accessing – say one reading, the other changing – that same complex data, the same object. It is during these moments that access needs to be controlled. If data, though, is not being shared or if the shared data is only being read, just let the threads execute when they do; you don’t need any more control. But when any two or more threads are intending to change the same object.

All that is required is a way for Threads to hint ”Hey, other thread, if you want a consistent view of this bunch of data I am about to change, you might want to hold off for a moment.” From another point of view that hint might say “Hey, other threads, I’d rather like this bunch of data that I am about to read to stay consistent while I’m reading it. If you intend to change it, please hold off for a moment until I am done reading it.” During the moment while your thread is accessing some object, you want to be sure that the consistent object you are accessing isn’t going to change out from under you. Once you are done with it, does it really matter if some other thread changes it?

It’s generically called a lock. Before accessing that shared data, a thread acquires – or attempts to acquire – the lock. Once done, the thread frees up the lock. You want unfettered access to an object, you lock it down; once done with it, you unlock it, you free up the lock. In those, hopefully, rare moments where a Thread A holds the lock and a Thread B attempts to acquire it, the OS arranges for Thread B to wait a moment until Thread A frees it; Thread B, having waited, is then provided the lock, giving Thread B the right to access the object. Beauty.

Using locking is sort of like everyone agreeing to abide by the same rules. If you have code that wants to access an object and this code (or some other) is potentially changing the object, you all agree that you first get a Lock prior to the access. Again, when done, you all agree that as soon as possible afterward the Lock should be freed up. Oh, and as to the storage ordering mentioned earlier, the lock code itself handles the storage ordering for you.

A simple view of Lock’s use can be seen in C# synchronization, and as in the following example:

lock (myObject)

{

… Do stuff using myObject …

} <– Free the lock over myObject.

Any code surrounded in this was by a lock allows only one thread at a time access to myObject. Once a Thread A leaves this code (or any code similarly protecting myObject), freeing the lock, a Thread B waiting on the lock request at the top, then gets the Lock and can enter this code. Notice in C# that the freeing of the lock is implicit by simply leaving the scope defined by braces. Notice that it doesn’t take much to express your intent.

Java is similar, but it does have a few variations, one being the use of a lock object which represents – but is not actually – the data being protected; it is a handle, something representing the object being protected. One advantage of this approach is its support of the notion of a TryLock. With a TryLock, if the Lock can be successfully acquired, TryLock gets the lock. If not, the Java method just returns a Boolean indicating that there would have been a lock conflict; some other thread has the lock and so – if only lock was used – would have meant that your thread would need to Wait. TryLock let’s you first check for a conflict. A cool variation on this is allows a TryLock to wait for a short while on a failed attempt to acquire the lock; in the event of a lock conflict, your thread will wait, but only for a short period of your choosing. If, during this short period of time, the lock had not been freed and been passed to your thread, the TryLock then returns the failure indication, allowing code to again continue execution. To see code examples go here.

So ends the basics of managing concurrent access to shared memory. In the next article in this series, we show some of the basics of putting these pieces together to create multi-Threaded applications.

The Essentials Of Multiprocessor Programming