Multi-core systems are quickly becoming the most common environment for all types of programming from embedded systems to the most complex servers, which means that learning multi-core development practices is extremely important. The days of automatic direct gains from processor evolution are over. We now have to put extra work in delivering a faster, more responsive product to the user. We must now design our systems around our target environment by breaking down our programs into multiple threads or tasks that all work towards the goal of delivering a useful, responsive service to our end user. I will show examples of how to do this in C++ as well as examples of programs that assist us with multi-core development.
The evolution towards development platforms and environments with more and more cores did not come lightly and it did not come without cost. In the beginning of software development, programs ran on single core, single execution thread systems that ran a single program sequentially until that program completed its task at which point the computer would either wait for a new program to be loaded or be shut down. In modern times, we all have computers and we are all running dozens of programs simultaneously and we often do not shut down our computers for days (or even months!). This progression towards doing a large amount of work in a short amount of time has led towards a greater demand for performance for all aspects of our computer and most specifically the heart and soul: the processor.
This enormous demand was initially satiated by increasing processor frequencies, however physical limitations have recently been encountered. There is a limit to the amount of heat the materials can take even with the development of better cooling methods (and different materials). This initially led to having two separate processors working side-by-side in a single machine to allow it to get more work done at once. Eventually they were able to move the two processors to a single chip thus creating the first true dual core processor. Since then we have been moving towards more and more physical (and virtual) cores in a single chip.
Unfortunately this movement towards multiple cores in a single machine has led to many unforeseen problems. With the increase of frequency, no extra work was needed to get a faster, more responsive program. This focus also allowed the target environment to change very little; a change of 0.2 GHz does not necessitate very much extra planning (it may mess up timing depending on how things are handled), but adding a whole extra core means that other programs can come in and completely change the environment that the program is currently working in.
Multi-Threaded Vs Multi-Core
The two practices of multi-threaded development and multi-core development are often thought of as the same thing, but they can lead to very, very different results. Multi-threaded applications are programs written with no knowledge of the target environment. The program is broken down into multiple threads with the hopes of having the program run faster, but if the program ends up running on a single-core system, this approach can lead to significant slow-down in terms of processor waste. This waste can be the resultant of scheduling the different threads, the time it takes to swap the different threads in and out, and waiting on locked-up resources or other threads to get work done. Also, if the majority of the end-users have single-core systems, there was a waste in time and money spent planning different scenarios that do not matter to most of the customers.
On the other hand, multi-core development means that you know (or require) that your release environment has two or more cores. This can lead to a theoretical maximum performance boost of base performance times the number of cores. The reason that is not the case is again due to swap-time and locked resources. Multi-core development also means that you have to make your program scalable from 2 cores to 32 (and beyond). That does not mean just adding a large amount of threads, but making sure that your program keeps seeing gains as more and more cores are added while staying efficient with low core counts. This can be done by looking at the number of cores on the platform and adjusting the thread count to optimize the amount of work being done at runtime, developing the program around a handful of threads or maintaining the program to keep up with evolving processor trends.
Atomic statement: Single statement executed by the processor. The process cannot be swapped out during its execution.
Processor Waste: Time the processor spends sitting idle or not performing operations that will produce something valuable.
Thread: A thread is a single path of commands executed in sequence. Note: this can mean the physical construct or the concept.
Hyper-threading: Intel's focus for their new generation CPUs. Hyper-threading is the concept where one physical core has two virtual cores within it (note virtual may be misleading as there are two threads loaded in the processor, but only one execution unit). It clones the state and interrupt hardware logic to reduce CPU waste and allow for more efficient switching between processes on a single core and switching processes between cores. It also allows for sharing of resources between the two virtual threads and for two threads to seemingly execute at the same time (with one thread taking over the wasted time of another to get work done). This is an example of hardware helping software and Intel claims a thirty percent performance improvement over normal multi-core platforms, but that is mostly application dependant.
Synchronous/ Asynchronous programming: Synchronous programming is the act of programming threads that will meet up to share work or other information. This can induce CPU waste in idle time if one thread waits for another or resource waste if a message box is implemented for saved data. Asynchronous programming means the threads will work autonomously until they are done with their work or the program ends.
Parallel: If two (or more) threads are executed at the same time, they are in parallel. This requires more than one core (virtual or physical) on the execution platform.
Concurrent: If the order in which two (or more) threads are executed does not matter, they are concurrent. Note that this does not require more than one core on the execution platform.
CriticalPath: Longest chain of interdependent calculations. It is the thread that will take the most amount of time to execute.
Nondeterminism: nondeterminism is when a program has the same initial state and input, but exhibits different behaviors from one run to another. In multi-core programming this is usually due to other threads or programs changing the resources at runtime and is something that programmers should seek to limit at all costs.
Typesofthreading: There are three main types of threading, Kernel-level, User-level and a Hybrid of the two.
KernelLevelthreading is considered the most basic of the three and allows for direct mapping to the hardware (which increases scalability for multiple cores), however it does not allow for a great amount of creator control since most actions are handled by the Kernel. This means the kernel sees a process and a thread as separate types. Since the kernel manages threads, the programmer does not have to worry about time-slicing or certain types of blocking.
Userlevelthreading means that the kernel knows about the process, but not about the threads within it. That means that the application manages all the threads within it and the kernel cannot separate those threads to different cores. However the advantage to user level threading is that switching between those threads is often more efficient as not all data needs to be swapped out and no information needs to be passed down to the kernel, so this is more efficient on single-core systems. They also only use the amount of space reserved by them and not each thread.
Hybridthreading is a combination of both which means the user can decide whether he/she wants to manage the threads themselves or if they want to give the control to the kernel. This gives the user more choice, but can lead to scheduling issues as it is hard for the kernel to know how much to schedule to each process and thread. Those conflicts can be fixed, but it often requires expensive coordination between the user processes, the threads and the kernel.
POSIX (Portable Operating System Interface for Unix): Sets a standard for shells and interfaces. Focus- POSIXThreads (pThreads): IEEE standard for implementing threads. Most threading operating systems start with this as a base and change/add features as needed.
The evolution towards a multi-core platform is very much a hardware-driven one. The hardware could not keep up with our increasing demands for performance and speed, so the hardware engineers threw two processors at us and we have not gone back since. The reason they could not keep up was that they had hit the limit of how much heat the processor components would take. The formula for heat generation is P = C × V2 × F where P is the power generated, C is the capacitance being switched over, V is the voltage supplied and F is the frequency of the processor (it is a little misleading since all the numbers are tied together, but it does show the looming problem). This shows that two cores can do the same amount of work at half the frequency, but using a fourth of the power and thus generating less heat.
Another Reason for the transition towards multi-core systems is that computers are designed to replace us. Since the beginning, they have been used to compute and keep track of complex algorithms and numbers that humans would have a difficult time computing. They need to evolve to match our thinking power and since our minds are not single-minded and sequential, computers must also be able to process many things at once.
Not all multi-core processors are equal, however, AMD, Intel and ARM all have different outlooks on how the processor should be evolving. AMD and Intel focus on desktop, server and laptop processors and ARM focuses almost exclusively on embedded and mobile processors. Since ARM processors are mainly used on mobile platforms, they have focused mainly on power consumption reduction and performance-per-watt. That means their processors are relatively simple, but since their performance-per-watt is good, they have managed to take over and dominate the mobile and embedded market (which is very lucrative at this time). It has only been recently where mobile phones have become sophisticated enough to need multi-core systems, so ARM has been slow to develop them (and even slower on 64 bit processors, the first of which is being developed now). Due to that relative recency of their multi-core processors and the intrinsic simplicity of their processor design, ARM has not done anything special with their processors. Intel, on the other hand, has changed up the basic structure of their processor rather dramatically with their hyper-threading technology on their latest multi-core processors. This drastic transition has come with some growing pains, but as OS developers have come to understand how to best utilize hyper-threading, Intel processors have come out on "top." This has made it so that if a person is looking for a top performance-oriented workstation or server, they usually choose Intel (on a list of the best performing processors, Intel takes the top 19 spots http://www.cpubenchmark.net/high_end_cpus.html and the current reported market share for Intel processors is at 72% http://www.cpubenchmark.net/market_share.html). AMD has tried to make a cross between the two and has recently come out with a high-performance low-power solution. They have kept a lot of the simplistic design of the original processors, while adding some of the shared resources and integrated graphics that Intel has in their processors. This has set them up to take over the very high-end mobile platforms and mid-end desktops. It has also been thought that since AMD stuck with adding as many cores to each processor as it could, it will be accepted by server operators, but as the newest architecture is only several weeks old, it has yet to be seen (it has also been said that Windows 8 improves thread management and compatibility with AMD, so a performance gain will be seen there as well). Another notable mention is the IBM/Sony Cell Processor used in the PS3 system. This is a heterogeneous processor as it has a single PowerPC Processing Element (PPE) and then seven Synergistic Processor Elements(SPE). The PPE is a dual core heavy-duty processor that acts as the main processor and when basic multi-threaded code is written, it is assigned exclusively to this part of the processor. The SPEs are simple Single Instruction Multiple Data "cores" (some do not even consider them to be cores as they are so simple) which means that they all perform the same operation on different data (or parts of data). These SPEs have to be explicitly coded for as they act very different from a normal processor and access data differently than the PPE. This leads to the Cell Processor being very difficult to code for and has led to the processor being used for large scientific applications and game development.
Differences in Operating Systems
Luckily most of the work of managing threads and the specifics of how threads are treated are abstracted by the operating system to the point where the differences are almost negligible to the end programmer, however some differences do exist and a little basic understanding of the target environment can help design and track down defects that occur.
Microsoft originally went with the kernel-level implementation of threading in win32 which is used in windows 95, NT and XP (interesting side-note: the original "Hello world" program in WinAPI took 150 lines). With Windows 7 (and 8), Microsoft has moved towards a hybrid approach. Windows has its own form of pThreads called (unsurprisingly) Windows Threads. Windows threads are designed to be more programmer friendly and attempts to remove what Microsoft believes are excess methods and attributes that essentially do the same things; they also allow more interaction and behavior. pThreads are considered (by some) to be low-level constructs and Windows threads are considered high-level.
OS X is considered fully POSIX compliant in its implementation of threading and is at the kernel-level. After reading a large amount of feedback from OS X developers, I have noticed that most come to the conclusion that threading in that environment is very restrictive and the OS does not like it when the developer does not play by the rules (which is almost the Mac mantra), however those rules (or which ones are being offended) are not always apparent which can lead to frustration for the developer who wants to control his/her own program without worrying about the OS intruding. Also something of note is that OS X reserves separate amounts for what it considers "main threads" and "secondary threads;" for main threads it reserves 8 times what Windows reserves, 8MB (but only 1MB for iOS), but for what it considers secondary threads, it reserves one sixteenth of that, 512KB. However, as with windows, these are growable stacks and can be configured to be a different value. Note that iOS (Apple's mobile platform) is based largely on OS X, so threading is mostly the same on both platforms.
Since most of the uses for the Linux platform revolve around servers and hugely multi-core environments (even in the early days), Linux (or more specifically the different flavors of Linux) has been very supportive of the multi-threaded environment. Support for different functionality varies based on the flavor of Linux chosen, but it is based around the pThreads API and almost all are fully POSIX compliant. Note that the Android platform is based on the Linux platform (specifically Apache), however App coding is in Java and hardware coding is in C.
Fortran (surprisingly) has support for threading and specifically supports pThread as well as openMP. The latest update (called Fortran 2008, but released in September 2010) added several new parallel execution data types and improved on existing multi-threading support. However, Fortran is slow to add new functionality and new updates mostly exist for maintenance purposes. It is interesting to note, however, that Intel said that it borrowed its idea/implementation for concurrent arrays from Fortran, which is then being borrowed by Microsoft to add to C++. (interesting side-note: during a 1976 standards committee, a proposal was brought forth to eliminate the allowance of the character 'O' due to confusion between it and the number 0 as well as to stop developers from using "GO TO" statements)
Several implementations of Ruby have built in threading, however some still implement what are called "green threads" which are threads that are scheduled by a virtual machine in the user-level and are thus tied to a single core that the main program is running on. The "main" implementation of Ruby called MRI has built in threading which is used in Ruby on Rails projects, however, Ruby is often criticized for having issues with scalability (twitter had several outages attributed to this) and performance, which is the opposite of what most people who are doing multi-threading want.
Since Java strives to be platform independent, it executed on a java virtual machine (JVM) which translates the code into something the hardware can understand and execute. Because of this, Java has its own implementation of threading that work separately from the OS. Since Java wants to run platform independent, it has to do a balancing act between what it has to do for threading and what it passes along to the OS to have it do; which leads to more overhead than is seen in other languages which can lead to a performance hit when compared to C++ (though it is not as prevalent as it was before). It does have some very nice atomic classes and Java is constantly being optimized for each OS, but, as with Ruby, the performance issues are counterproductive (note, though, the performance hit is not as bad as in Ruby).
A lot of the threading in C/C++ is left up to APIs to be included and used by a specific operating system, but there are 3rd party APIs that help bridge that gap and allow cross-platform execution (like Intel's threading building blocks and the pThreads-w32 project). Because of this, C++ is often optimized for the platform it is running on and usually shows the best performance (along with easy readability and writability) of these programming languages.
Behind every good multi-core application, there is a good design. Designing is the most important part of building a multi-core application that will last and deliver good performance, so I have broken down the design aspect into the usual software development process.
Remember, a “code-and-fix” laissez faire mentality WILL NOT WORK, so a great deal of time has to be spent in this phase making sure that all contingencies are planned out. The reason this will not work is because there are too many things that can go wrong, and the behavior may be erratic which makes pin-pointing the problem very difficult. The planning phase is the single most important step as problems here will cascade into other phases and will quickly snowball into much, much worse problems which will take longer to fix. A clear vision of what service the program is going to accomplish is needed as well because the behavior needs to be well defined to be decomposed into the type of threading the program will execute. Another thing that needs to be decided early on is how deep into threading and thread management the code should go. Some developers akin direct thread creation and management to writing assembly and say that it goes against the practice of abstraction, however allowing the language or OS to take care of threading for you will put you at the mercy of updates to the OS and language and may lead to unintended behavior or even breaking your program (an example of this is Microsoft updating .NET to be more aggressive with threading which has led to a few of Dr. Clifton's examples for Computer graphics no longer functioning).
The opportunity to break a program into threading comes during the decomposition phase. The two main ways to decompose a program's behavior into threads is by tasks (or actions being performed) or by "persons" performing that task (can also be thought of where that task is being performed). If modeled by "persons," less sharing of data is needed as each person is a separate entity and handles their own data, it is also how most problems are thought out (we are performing these actions), however this approach only scales to how many people the program has, which is usually less than the number of tasks needing to be completed. Tasks are often less heavy duty and easier to design, however they can be harder to track when you get many of them going at once.
Modeling is also a very important part of this phase. Since programs are usually build on real-life concepts, the characteristics that are naturally separate have to be defined and explored to effectively make a multi-threaded application. Also, modeling the thread states and thread interaction is very important as changes in one thread can drastically change another and that behavior needs to be mapped so if a defect comes along, the root cause can be easily identified. That interaction includes variable manipulation and access so dataflow diagrams are also important. You also want to make sure that thread interaction is kept to a minimum as it produces overhead in terms of resource (mailbox) or CPU waste (synchronization). Above all remember: it is a programmers job to find a need, design a model, and apply the right solution.
Since there are a lot of C++ libraries and APIs that help ease the workload of multi-core programming, it often comes highly recommended by those who focus on performance and readability in an object-oriented language. Also, since most of the programming is done on a Windows platform, going with a Windows product (Visual Studio .NET) is preferred as Windows knows their OS better than anyone and it has a lot of nice built-in features that help you see what your program is doing. Due to this, most of the support usually goes Windows then the different flavors of Linux then Mac. A programmer also has to keep a look out for CPU specific commands that can improve performance but are obviously not very portable (except in the same processor family).
During the Implementation phase, one must ensure that resources are being managed correctly and that the design documents are being updated as features are added because those are most valuable when a defect is encountered and the cause needs to be pinpointed. Using existing libraries is often desired as it eliminates extra work and they are often much more efficient than other code. A close look-out needs to be kept for evolving trends. If two threads need to communicate a lot, maybe there are things that can be merged or swapped to eliminate slow-down and overhead. Tuning can also be important as threads become bloated resource hogs or take a long time to accomplish their duty. Remember, responsiveness is key in this phase and it is helpful to think about the simplest case and then expand into the more complex. Always keep the future in mind in this phase: more cores will always be added, more threading features will always be added.
Testing can be very difficult depending on the tools chosen in the planning phase. There are many tools than can help test multi-core applications and visualize what is going on (and going wrong) in your program. Adding features and testing is the best way to make sure that threads are interacting as expected, but if something cannot be pinpointed, break down the program into the simplest interacting parts and see what goes on. Race conditions are usually the most prevalent and result from two threads racing to get at the same resource. Tweaking for performance is important in this phase. Critical paths also need to be identified and hopefully split apart, but you must remember that the bearing of a child takes nine months no matter how many women are assigned. That means that there are some paths that simply cannot be split down anymore. Non-determinism can cause a great deal of headache in this phase, but if enough planning was done, that hopefully will not be much of a problem.
Deployment in a multi-core environment is mostly the same as with other environments, but you want to pay specific attention to what platforms the program is being released on and how the program can be tuned to better fit the consumer's needs.
Maintenance can also be very difficult without very specific design documents as adding new features must be done in a way as to not affect the performance of the whole too much as well as not intrude on other threads' resources. The easiest way to do this is to add new features to existing threads, however that does not always make sense and can lead to a lot of resource sharing and CPU waste. Also remember that much more testing is needed to make sure that new features do not infringe on old features.
What about adding threading support into existing systems? This can be very difficult depending on how decomposed the current program is. The main focus of adding in threading should be the largest time wasters: the user IO, disk access, and complex algorithms.
Intel's threading building blocks is an API adds a lot of features to C++ while trying to be like the Standard template library (in that it promotes ease of use and generality, but it tries to be more aggressive in terms of timing and resource management). It adds Algorithms, containers, mutex, atomic statements, timing, scheduling to the existing framework while staying OS independent (works on Windows, OS X and Linux). It also implements "task stealing" which allows another thread to take over a core if the current thread in that core is deemed "idle" (much like the hyper-threading built in to their current processors). It automatically creates threads for the user and is fashioned much like the pipe-line in a graphics application in that it allows work to be done independently and then assembled at the end. TBB is a bit more memory and cache oriented than the current STL and tries to really take advantage of Intel's own processor architecture (or implement something similar on others'). Intel also adds in atomic operations (much like Java's) and a fair amount of concurrent data-types for easy scalability.
OpenMP is an API that assists in shared memory multi-processing in C++ and Fortran. Its focus is a main thread that dishes out work as the master and then the other threads join back in when they are done. Since the master thread is the one who controls the others, it can be used to manage the specifics of the others allowing for more control for the programmer than those APIs that take care of everything themselves (but you do not need to if you do not want to).
Microsoft Visual Studio
Microsoft is constantly updating Visual Studio with new ways to view and manage a program. It has a view (like the output window) that allows the user to view the thread activity as well as thread interaction and CPU usage. However most of this functionality is only available in Windows 7 as it uses specific call-stack records to keep track of where it is in the code. It also breaks down the threads and visually shows how much time of each thread is spent doing different things like execution, synchronization, IO and sleeping. It also allows for in-depth thread debugging to see where all the threads are at and pause and run specific threads at a time. It can also break down the thread interaction into a nice viewable diagram so you can see if any interaction is missed in your current modeling.
MULTI - Green Hills
MULTI by Green Hills is an IDE that is built specifically around debugging multi-threaded applications. It shows how threads stack, interact and call methods. It also keeps a record of the history so that when an error occurs (and you happen to know when) you can see what thread was accessing what data and what happened back in time (time machine) It gives a very nice visual representation of what is going on on the machine and in the program.
Total View - Rogue Wave
Total View by Rogue Wave is a "GUI based source code defect analysis tool" which means that it is similar to Visual Studio and MULTI, but it is mainly focused on debugging and resource management. It allows the user to monitor threads and shows memory allocation and where bottlenecks are. It breaks down the threads into viewable chunks and identifies parts of the code that can be optimized or sharing between threads that is not needed. It also takes features from MULTI like its thread recording and tracking features, but is much more specific and specifically points out resource mismanagement (it is also expensive).
While multi-core development is still fairly new and many languages are still struggling to keep up with evolving trends in hardware, multi-core is the future and with enough planning and technical know-how, a fast, responsive program can be made that will scale and continue to please users well into the future. It may take more work now to keep up with evolving trends, but that hard work pays off; especially when designing a new system to compete with an existing one that scales poorly across multiple cores.
Buttari, Alfredo, Jack Dongarra, Jakub Kurzak et all. The Impact of Multicore on Math Software
Hughes, Cameron, and Tracey Hughes. Professional Multicore Programming Design and Implementation for C++ Developers. Indianapolis, IN: Wiley Pub., 2008.