The Essentials Of Multiprocessor Programming

One near constant that you have been seeing in the pages of The Next Platform is that the downside of having a slowing rate at which the speed of new processors is increasing is offset by the upside of having a lot more processing elements in a device. While the performance of programs running on individual processors might not be speeding up like we might like, we instead get huge capacity increases via today’s systems by having the tens through many thousands of processors.

You have also seen in these pages that our usual vanilla view of a processor or a system is nowadays being challenged in a lot of strange and ingenious ways.

The Next Platform brings these to you almost religiously. While most – but by no means all – of these processors share a common concept in that they are driven by software and access data and instructions from some form of memory, it is also true that something in the system is assigning work for them to do.

Practically every programmer knows how to write a program – representing a single unit of work – to run on an individual processor. Many are familiar with how to write web-based and/or database applications that are inherently running on multiple systems. Still, it seems like an accepted given that breaking your needs up into tens to thousands of units of work to use the capacity available in today’s – and certainly tomorrow’s – systems is considerably more difficult.

It has been like searching for a holy grail to find ways to do that automatically, so results have been mixed. More recently, though, we have seen ways for programmers to provide parallelism hints via the likes of

Hadoop, to spread work over many distributed memory systems, each with sets of processors accessing their own memory.
OpenMP, to allow for the parallel execution of loops within a process associated with a shared-memory multiprocessor.

These abstractions and productivity enhancements are intended to hide much of the detail of having a programmer managing the parallelization of the work. These are all great, all worthy efforts. Who would be against better performance with not much extra effort?

Also true, though, is that occasionally, perhaps for reasons of better performance, you want to be in control of what constitutes the “work” and when and where it is supposed to execute. For that matter, don’t you really want to have a better idea of just what it is that these productivity-based abstractions are doing on your behalf. And for that matter, is knowing and then using what these tools are doing really all that tough? In the next few articles, we are going take a look under the hood, hopefully showing you that there is not all that much to fear in making use of these ever-increasing number of processors.

So What The Heck Is A Processor?

For most of you folks reading this, the answer to that is easy enough, right? It’s an Intel, Power, or ARM processor, which are general-purpose processors with relatively primitive instructions, some supporting virtual addressing, driving small units of data through simple functional hardware units. The series of these instructions together – a “program” if you like – accomplish some higher-level purpose.

For reasons of parallelism, and since the chip real estate exists, there might be multiple of such processors on a single chip. Each and all access a shared memory from which both instructions and data are fed and ultimately reside. It’s an SMP, short for symmetric multiprocessor. Standard stuff.

And since the instructions can execute far faster than they can be fed with instructions and data, we know that there are various levels of faster intermediate – largely software transparent – memory called cache between the processors and shared memory. And, even with as much capacity as these on-chip processors can drive, shared-memory multiprocessors get built from multiples of these chips, typically all keeping the cached data coherent with the expected program state. Making all of that happen is pretty impressive stuff actually.

Each such processor does its work fast, and it has been doing so for decades (we just keep finding more types of work to use its improved speed):

For a long time now, these processors are so fast that units of work have had to be interleaved even within individual available processor(s) while other units of work waited to get data out of slower I/O devices.
These processors are so fast that it appears to humans, which move at much slower speeds, that our work was making progress even while the work was really just taking turns, sharing the available processor(s).

Still, the notion that all computing can be done by rapidly executing strings of primitive instructions, as screamingly fast as it had become over the decades, is not making the progress at improving as in the past. As much as can be done on today’s systems, even that seems to be not enough. We talk now of the need for accelerators to make the next jumps in performance. Even that, though, is not really all that new. For example, units of work driving encryption/decryption were early on offloaded to crypto coprocessors residing in IO space, outside of the traditional processor.

The rate of development of all sorts of strange and wonderful accelerators and associated architectures has really increased of late. This is certainly not a complete list, nor even approaching one, but these may have been built based on:

Coprocessors sharing the same chip as the processors
Graphics Processing Units (GPUs), previously encompassing systems in their own right
Field Programmable Gate Arrays (FPGAs), approaching actual circuitry driving specialized high-level function
Application Specific Integrated Circuit (ASIC), actual circuitry specialized for a complex operation

The latter two are means of implementing operations in hardware more complex than the relatively primitive operations driven by the instructions supported by the traditional processors.

We are not going to get in deep as to the organization and purpose of these accelerators, maybe that being for another day, but that is not really the point of this article. What is the point is that work can be asynchronously offloaded to any of these, hopefully resulting in faster execution than is possible by offloading to another of an SMP’s processors. In either case, the results are processed some short time after completion of the offloaded function. Although they might not be processors as you have come to think of them for, well, like forever, we still define work for them, perhaps – in doing so – invoking some strange and wonderful form of a program to execute there. The point is that we don’t care here how the work is accomplished, perhaps by millions, perhaps by one instruction, but either way we need to tell these executable units that we have some work for them to do.

Further, and although not strictly an accelerator, for the last decade or so individual processor cores, previously executing one instruction stream at a time, now are capable of concurrently executing the instruction streams of multiple units of work. IBM calls it simultaneous multi-threading (SMT) and Intel calls it Hyper-Threading (HT). Their purpose is to squeeze considerably more compute capacity out of the processor cores both by allowing other instruction streams to execute during delays such as cache misses and also by allowing fuller use of the massive fine-grained parallelism available in a modern processor core. From the point of view of individual units of work, those single cores look and act like a multiplier of the number of processors.

It is worth additionally pointing out here that not only can we have access to a lot of processor-based compute parallelism, but each of these processors has become increasingly cheaper. Maybe it is because we know that there is a price for each, and so it would seem that we need to make good use of each and every one of them, it is also increasingly becoming true that we can leave some of them essentially unused. But, if unused, perhaps instead we should have some processors doing work which most of the time we simply throw away. Instead of grinding our way through to just the right answer, it increasingly becomes reasonable to generate all possible answers in parallel, only later choosing the answer which is correct or seems to be the most probable.

Of Multi-Threading And Tasks

All of those, whether vanilla cores in a symmetric multiprocessor or accelerators of all sorts and stripes, that’s just the hardware. It just sits there consuming power until we ask parts of it to do some work. And the work, generally speaking, executes when it executes; there is no direct relationship between any of the concurrently executing work assigned to these processors. The work is largely just sets of independently executing instruction streams. Every one of them started executing on some processor at some one instruction, with each processor having to be loaded with an initial set of data parameters. From there, individual processing of the work just continues until complete or until some external resource is needed.

That’s the essence of a task. When you have some work to do, you point at where the processing is to begin, pass in a few parameters, and say GO. The work of the task, as defined by your program just takes off doing its processing from there.

On an SMP, you typically do not even say which of the processors will execute your task. You provide the task to the operating system and that system software takes it from there, making the decision of when and where your task will execute. You can concurrently provide hundreds or thousands of tasks and the task dispatcher in the operating system still decides when and where each of these tasks will be executing alongside other tasks. This is true no matter the number of processors.

Getting a bit more detailed, the operating system manages all of this by having any given task be in one of the following states:

Executing: The task’s instruction stream is at this moment executing on a processor.
Dispatchable: The task is in all ways available to be executing on a processor, but – typically – no processor is at that moment available to execute that task’s instruction stream. (It soon will be, if for no other reason than the OS task dispatcher wants to provide the illusion that all dispatchable tasks are making progress.)
Waiting: The task is waiting for the availability of some resource and won’t enter a dispatchable state until it becomes available. Examples, among scads of such, include waiting for data

To be read from disk.
To be received over a communications link
To be made available by another task.

When such data does become available, the OS transitions the waiting task to be dispatchable. Conversely, when the executing task finds it needs some unavailable data, the OS transitions the task to a waiting state.

You can see each of these states in this animation using MIT’s Scratch. Tasks on the wait queues are Waiting, in the task dispatch object, Dispatchable, and on a core, Executing.

The basic point is that all that is really needed is to provide work to the system is to create tasks and making them dispatchable. From that the OS will make the decision, among all of the currently dispatchable tasks, just when and where those tasks will be dispatched. Keep in mind, though, that the maximum parallelism possible – the maximum number of concurrently executing tasks – is the same as the number of processors. Realize also that, because of the SMT effect mentioned earlier, the number of “processors” might be a few multiples more than the number of cores.

Now go back and replay that animation. How much of the available processor capacity – the CPU utilization – is being consumed by these tasks? If a task is not on a processor, an available processor is available capacity. Why are they available? Rather than spending all their time on the processors, these tasks are waiting for something, reads from I/O-based storage like disk drives for example. If that is what your application will be doing, more capacity is available for multi-task parallelism, so still more Tasks – well over that of the processor count – could be created to use that capacity.

Of Creating Threads

Catch the name change there? For most programmers, the notion of a task is really that of a thread, a thread – potentially one a very many – within a process. When applications are first started, they appear first as a process with a single thread, with the thread representing all that a task represents. Such initial threads start their execution at the beginning of the program and things progress from there, including, perhaps, the creation of more threads. These additional threads are scoped to the same process and so are capable of easily sharing the same data; they are, after all, sharing the same address space represented by the process.

Again, with all due respect to OpenMP – which we will get to in a later article in this series – the creation of a new thread on your own is relatively straightforward. All a new thread really needs is the place to start and some initial data.

Let’s start with some examples from C++, C# and Java.

With C++, you can create and start a thread using PTHREAD_CREATE. At its most simple, you just provide it with the addresses of:

The code where the new thread starts execution.
A set of data parameters.
A pthread_t structure used for subsequent control of that new thread.

I had intended to provide an example here, but a far better one than what I had in mind can be found here at TutorialsPoint. The key concepts to catch are that once the PTHREAD_CREATE executes, both the new thread and the pre-existing one executing this command concurrently execute. The new thread starts where it is told and the current thread just continues it processing after the PTHREAD_CREATE.

The parameter data structure created for the new thread is actually accessible to both threads, largely because both threads belong to the same process. With appropriate control over shared memory, which we’ll get into in the next article, either thread can change and read the contents of this structure, using that for communicating if they desire.

C# builds off of its object-based orientation as in the following:

Define the starting point – the starting method – of the new thread via construction of a ThreadStart
Construct a Thread object, using the ThreadStart object as a parameter.
Using that Thread object, execute its Start method.

Recalling that most C# methods are associated with an object – one addressed with the THIS pointer – the object associated with the Start method can hold the data parameters common to bother threads. Both the original thread and the new thread can access this same object and are both capable of understanding the contents of – and of executing the methods associated with – such an object. Again, this object is available for communication between the threads. Since both threads could be concurrently executing, they might also be concurrently executing the same methods – the same code – and accessing the same objects. Pretty cool, but also quite straightforward once you get the basics out of the way. More can be found on Microsoft’s Developer Network and TutorialsPoint.

Java, also being object-based, has two ways of creating and starting new threads. In one, you are doing the following:

Construct an object whose class is inherited from the Thread class; there you will execute the code at its object’s constructor under the scope of the original thread.
Execute that object’s start() method, again within the scope of the original thread.
Executing start() results in the new thread beginning its execution at the beginning of the object’s run() method, here under the scope of the new

Once again, the original thread invokes the object’s start() method, which automatically results in the JVM starting the new thread at the run() method. (Odd, perhaps, but that’s the architecture and it works fine once you get your head around it.)

The other means of starting a new Java thread is similar:

Construct an object whose class is inherited from the Runnable This constructor executes under the scope of the original thread.
Execute that object’s start() method, again within the scope of the original thread.
The newly created thread begins execution at the object’s run()

Either way, the constructed object is known to both of the threads and can act to hold shared data. Please see TutorialsPoint or Oracle Java Documentation for more.

Thread creation is easy, right?

If you have glanced at the documentation at some of the provided links, you may have seen some examples with thread creation within a loop. With each loop iteration, a thread was created. So, suppose you have a list, maybe an array, and you want separate portions of it concurrently processed by separate threads. Making this happen is simple. All that is needed to get this started is to tell each thread where in the list it should begin, including a count of the number of items each should process. Then just tell the thread(s) to start.

Done and – almost – done. At least you have the list processing started. The actual difficult part is knowing for certain when the list processing was totally complete. You can easily arrange for each thread to kill itself off when it completes its part, but how do you know when any one or all of them really have completed, allowing the initial thread to continue?

One way, but we’ll get into more interesting ones shortly in subsequent articles, is through the notion of a JOIN. The purpose of a threading Join is to have the thread executing the Join wait until a particular other thread has ended. You can find here documentation of this for C#, Java, C/C++ . In all cases, the join function requires some notion of a handle to the thread being monitored; this is easily available to the thread which did the thread creation. In short, the creating thread can ask to wait on the completion of any particular thread it had created.

Others means of knowing the same thing are available via shared memory, a subject which we cover in the next article. As a bit of a preview, know again that every thread of a process shares the same address space; one can pass a given address to any other in the process and it means exactly the same thing. Additionally, all of these threads can use any of the processors of an SMP, wherein all of the processors can access the same physical memory; their shared data resides in this common memory. But if all or any of these threads are executing at the same time and at the same moment accessing and changing the same data, what keeps them from damaging their common data?

Continue to Part 2 in the series, “The Essentials of Multiprocessor Programming” –>>

OnesHistoryIsLacking says:

January 5, 2017 at 7:11 pm

“IBM calls it simultaneous multi-threading”. No the entire academic community calls it simultaneous multi-threading(SMT) and Intel’s marketing department is the one that came up with the marketing driven branding/term Hyper-Threading(TM), HT for short, for Intel’s version of SMT! Let’s also dispense with the Flash driven graphics and take some time to consider what the Register’s users mostly think of having that Flash attack vector installed on their systems in order do view your graphics in this article! It will never be on my systems thank you very much!

You are very much like that NPR CUB reporter that attributed the Phrase Military Industrial complex to “What Sen. Sen. John McCain calls the Military Industrial complex”, when in fact the phrase Military Industrial complex was coined by Dwight D Eisenhower!

Intel did not invent SMT, if you care to research SMT’s beginnings!

- Mark Funk says:
  
  January 10, 2017 at 11:21 am
  
  My goodness, such emotion. I certainly didn’t think I was entering the political realm while writing this educational article. Might I ask that you provide the readers a deeper history on SMT?
  
  Mark
  
MD says:

January 9, 2017 at 12:24 pm

Hi!

A note on C++ — you can simply use the standard library for thread support (there’s no need for the POSIX Threads library — which also have more of a C interface, and thus may not play well with some C++ constructs out of the box):
http://en.cppreference.com/w/cpp/thread

Basic multithreading (threads, mutexes, atomics, condition variables, etc.) has already been available for a considerable time (more than half a decade ago), so the compiler support is farily widespread and stable; notably, this includes standard support for task-based parallelism using “std::async”: http://en.cppreference.com/w/cpp/thread/async

(Strictly speaking, one could even use the same interface–like the one adopted in the standard–more than a decade ago with Boost.Thread: The version from 2002 was already very close to the current one in the standard. However, that’s mostly of historical interest now since, again, the compiler support hasn’t been a problem for a while now.)

The aforementioned higher-level task-based programming (supported in standard C++ in form of “std::async”, with upcoming extensions in the works, including futures with continuations — cf. Concurrency TS, mentioned later) is often simpler than manual synchronization using locks and mutexes, which are nowadays not always the best choice in C++. Perhaps it’s interesting and useful enough to incorporate in an upcoming article? Another consideration: Atomics offer support for lock-free programming and data structures — again, a choice worth considering, allowing to potentially avoid synchronization overheads associated with lock-based data structures (although it’s worth noting that it’s never black and white and sometimes lock-free data structures can be slower — this choice really has to be carefully evaluated each time on the basis of one’s specific needs and requirements, there are no royal roads here).

Here’s where one can learn more:
https://www.manning.com/books/c-plus-plus-concurrency-in-action
http://www.boost.org/doc/libs/release/doc/html/thread.html
https://theboostcpplibraries.com/boost.thread

I think it’s worth noting that for making the best use of modern hardware (especially in the HPC environment), it’s a very good idea to use very recent versions of the compiler toolchains (so that it’s possible to tune and optimize for the particular microarchitecture current CPUs support; with compiler support and optimizations usually lagging CPU release schedules you have to be pretty aggressive about continuous, rolling updates of your compilation toolchain — doing otherwise is just leaving performance on the table, and also likely less stable — with an extra risk of encountering older compiler bugs).

Given the above, the upcoming extensions in Parallelism TS and Concurrency TS are worth considering:
http://en.cppreference.com/w/cpp/experimental/parallelism
http://en.cppreference.com/w/cpp/experimental/concurrency

The former (i.e., parallel algorithms) have been already merged into the C++17 standard: http://en.cppreference.com/w/cpp/algorithm
Some compilers already ship Parallelism TS and Concurrency TS — it’s also worth noting that the important features of the latter (like features with continuations in Concurrency TS) are already available in Boost.Thread (which is permissively licensed and will work with even old compilers).

- Mark Funk says:
  
  January 10, 2017 at 11:12 am
  
  MD – Thank you very much for the additional information. Mark

The Essentials Of Multiprocessor Programming

So What The Heck Is A Processor?

Of Multi-Threading And Tasks

Of Creating Threads

Continue to Part 2 in the series, “The Essentials of Multiprocessor Programming” –>>

Sign up to our Newsletter

4 Comments

Leave a Reply Cancel reply

So What The Heck Is A Processor?

Of Multi-Threading And Tasks

Of Creating Threads

Continue to Part 2 in the series, “The Essentials of Multiprocessor Programming” –>>

Sign up to our Newsletter

Related Articles

Can SYCL Slice into Broader Supercomputing?

Programming In The Parallel Universe

Burying The OpenMP Versus OpenACC Hatchet

4 Comments

Leave a Reply Cancel reply