The last chapter introduced synchronization operations and demonstrated their usage in some relatively simple examples. The next chapter provides more complex but realistic and useful examples and describes a general synchronization model that solves many practical problems and enhances program reliability. This short chapter is concerned with the impact that synchronization can have on application performance and techniques to minimize the impact.
While thread synchronization is essential, there are some significant performance pitfalls, and we describe some of the major performance issues, both on single-processor and multiple-processor (SMP) systems. There are also trade-offs among alternative solutions. For example, CRITICAL_SECTIONs (CSs) and mutexes are nearly identical and solve the same fundamental problem. CSs are generally, but not always, the most efficient locking mechanism. CSs are also not as convenient to use as mutexes, as Chapter 10 demonstrates. In other cases, interlocked operations are sufficient, and it may even be possible to avoid synchronization altogether with careful design and implementation.
CSmutex trade-offs are discussed first, along with SMP implications. CS spin counts, semaphore throttles, and processor affinity are other topics. The chapter ends with a set of performance guidelines.
Note: NT 5.0 made some significant performance improvements. Some of the issues identified here were much more severe with earlier NT versions and 9x.
Synchronization Performance Impact
Synchronization can and will impact the performance of your program, and you need to be especially careful when running on SMP systems. This may seem counterintuitive, as we might expect that SMP would generally improve performance and certainly never slow down a program. However, it turns out that internal implementation mechanisms and processor contention for memory access can produce unexpected effects, including dramatic performance degradation.
CRITICAL_SECTIONMutex trade-offs
The first step is to assess the performance impact of synchronization and compare CRITICAL_SECTIONs to mutexes. Program 9-1 shows statsMX.c, which uses a mutex to synchronize access to a thread-specific data structure. statsCS.c, not shown but included on the book's Web site, does exactly the same thing using a CRITICAL_SECTION, and statsIN.c uses interlocked functions. Finally, statsNS.c, also not shown, uses no synchronization at all; it turns out, in this example, that synchronization is not required because each worker accesses its own unique storage. See the cautionary note after the bulleted list following the program. The actual programs allow any number of worker threads, although, for simplicity, Program 9-1 as listed can support only 64.
This set of examples not only illustrates the relative performance impact of three types of synchronization but also shows the following concepts.
-
Synchronization can sometimes be avoided with careful program design.
-
The interlocked functions can be used in simple situations, such as incrementing a shared variable.
-
CSs are measurably faster than mutexes in most situations.
-
A common technique is to specify the thread argument data structure so that it contains state data to be maintained by the thread along with a pointer to a mutex or other synchronization object.
Program 9-1. statsMX: Maintaining Thread Statistics
/* Chapter 9. statsMX.c */
/* Simple boss/worker system, where each worker reports */
/* its work output back to the boss for display. */
/* MUTEX VERSION. */
#include "EvryThng.h"
#define DELAY_COUNT 20
/* Usage: statsMX nthread ntasks */
/* Start up nthread worker threads, each assigned to perform */
/* "ntasks" work units. Each thread reports its progress */
/* in its own unshared slot in a work-performed array. */
DWORD WINAPI worker (void *);
typedef struct _THARG {
int thread_number;
HANDLE *phMutex;
unsigned int tasks_to_complete;
unsigned int *tasks_complete;
} THARG;
int _tmain (DWORD argc, LPTSTR argv [])
{
INT tstatus, nthread, ithread;
HANDLE *worker_t, hMutex;
unsigned int * task_count, tasks_per_thread;
THARG * thread_arg;
/* Create the mutex. */
hMutex = CreateMutex (NULL, FALSE, NULL);
nthread = _ttoi (argv [1]);
tasks_per_thread = _ttoi (argv [2]);
worker_t = malloc (nthread * sizeof (HANDLE));
task_count = calloc (nthread, sizeof (unsigned int));
thread_arg = calloc (nthread, sizeof (THARG));
for (ithread = 0; ithread < nthread; ithread++) {
/* Fill in the thread arg. */
thread_arg [ithread].thread_number = ithread;
thread_arg [ithread].tasks_to_complete = tasks_per_thread;
thread_arg [ithread].tasks_complete = &task_count [ithread];
thread_arg [ithread].phMutex = &hMutex;
worker_t [ithread] = (HANDLE)_beginthreadex (NULL, 0, worker,
&thread_arg [ithread], 0, &ThId);
}
/* Wait for the threads to complete. */
WaitForMultipleObjects (nthread, worker_t, TRUE, INFINITE);
free (worker_t);
printf ("Worker threads have terminated\n");
for (ithread = 0; ithread < nthread; ithread++) {
_tprintf (_T ("Tasks completed by thread %5d: %6d\n"),
ithread, task_count [ithread]);
}
return 0;
free (task_count);
free (thread_arg);
}
DWORD WINAPI worker (void *arg)
{
THARG * thread_arg;
int ithread;
thread_arg = (THARG *) arg;
ithread = thread_arg->thread_number;
while (*thread_arg->tasks_complete <
thread_arg->tasks_to_complete) {
delay_cpu (DELAY_COUNT);
WaitForSingleObject (*(thread_arg->phMutex), INFINITE);
(*thread_arg->tasks_complete)++;
ReleaseMutex (*(thread_arg->phMutex));
}
return 0;
}
You can use the timep program from Chapter 6 (Program 6-2) to examine the behavior of the different implementations. Tests performed on otherwise idle systems with 256,000 work units and 1, 2, 4, 8, 16, 32, 64, and 128 worker threads show the following results.
-
For a small number of threads (4 or fewer), the NS (no synchronization), IN (interlocked functions), and CS (CRITICAL_SECTION) programs all require about the same amount of time. The CS version can be marginally (1020 percent) slower, showing a typical synchronization slowdown. The MX (mutex) version, however, requires two to three times as long to execute.
-
CS performance does not always scale with the number of threads on a single-processor system when the number of threads is 5 or more. Results vary from one NT5 system to another but appear to be consistent on a given system. On some systems, elapsed times double at each step using 1, 2, 4, . . . threads, but in one case (Windows 2000, 1GHz, Pentium laptop) the times, in seconds, were 0.5, 1.0, 2.0, 4.0, 14.9, 16.0, 32.1, and 363.4; in another (Windows 2000, 50 MHz, Pentium desktop), they were 1.2, 2.3, 4.7, 9.3, 42.7, 101.3, 207.8, and 1212.5. The breaks usually occur after 4 or 8 threads, but performance is acceptable until 128 threads.
-
MX performance is slower than CS on a single processor, with ratios varying from 2:1 to 10:1, depending on the system.
-
Performance on anSMPsystem can be very poor, by a factor of 10:1 or even 100:1. Intuitively, we would expect better performance with multiple processors, but the internal implementation means that the processors are contending for locks and memory access, which explains why the MX and CS results are nearly equivalent. Tuning the CS spin count can help, as described in a later section.
-
A semaphore can be used to limit the number of ready worker threads without changing the basic programming model. This technique is the subject of a later section in this chapter.
Cautionary note: The task_count array deliberately uses 32-bit integers to allow for a large task count and to avoid the potential for "word tearing" or a "cache line conflict" on SMP systems. Two separate processors, running adjacent worker threads, could concurrently modify adjacent task counts, making the modification in their caches (32 bits on Intel x86 systems). Only one cache, however, would actually be written back to memory, producing invalid results. Prevention requires that each thread's working storage be properly separated and aligned according to cache size. In this example, the task count could be grouped with the thread argument, and there is no good reason not to use a 32-bit count. Exercise 96 explores this subject.
A Model Program for Performance Experimentation
The book's Web site includes a project, TimedMutualExclusion, that can be used to experiment with different boss/worker models and application program characteristics. Program features, controlled from the command line, include the following.
-
The use of either a CS or a mutex lock.
-
The depth, or recursion, count.
-
The lock holding time, or delay, which models the amount of work done in the critical code section.
-
The number of worker threads, limited only by system resources.
-
The number of sleep points where a worker yields the processor, using Sleep(0), while owning the lock. Sleep points model a worker thread that waits for I/O or an event, while the delay models CPU activity.
-
The number of active threads, as explained in the later section on semaphore throttles.
The delay and sleep point parameters significantly affect performance because they affect the amount of time that a worker holds a lock, preventing other workers from running.
The program listing contains extensive comments explaining how to run the program and set the parameters. Exercise 91 suggests some experiments to perform on as wide a variety of systems as you can access. A variation, TimedMutualExclusionSC, supports spin counts, as explained in the next section.
Note: TimedMutualExclusion is a simple model that captures many worker thread features. It can often be tuned to represent a real application, and if the model shows performance problems, the application is at risk for similar problems. On the other hand, good performance in the model does not necessarily indicate good performance in the real application, even though the model may assist you in application performance tuning.
Tuning SMP Performance with CS Spin Counts
CRITICAL_SECTION locking (enter) and unlocking (leave) are efficient because CS testing is performed in user space without making the kernel system call required by a mutex. Unlocking is performed entirely in user space, whereas ReleaseMutex requires a system call. CSs operate as follows.
-
A thread executing EnterCriticalSection (ECS) tests the CS's lock bit. If the bit is off (unlocked), then ECS sets it atomically as part of the test and proceeds without ever waiting. Thus, locking an unlocked CS is extremely efficient, normally taking just one or two machine instructions. The owning thread identity is maintained in the CS data structure, as is a recursion count.
-
If the CS is locked, ECS enters a tight loop on an SMP system, repetitively testing the lock bit without yielding the processor (of course, the thread could be preempted). The CS spin count determines the number of times ECS repeats the loop before giving up. A single-processor system gives up immediately; spin counts are useful only on an SMP system.
-
Once ECS gives up testing the lock bit (immediately on a single-processor system), ECS enters the kernel and the thread goes into a wait state, using a semaphore wait. Hence, CS locking is efficient only when contention is low or when the spin count gives another processor time to unlock the CS.
-
LeaveCriticalSection is implemented by turning off the lock bit, after checking that the thread actually owns the CS. The kernel must also be notified if there are any waiting threads, using ReleaseSemaphore.
Consequently, CSs are efficient on single-processor systems if the CS is likely to be unlocked, as shown by the CS version of Program 9-1. The SMP advantage comes from the fact that the CS can be unlocked by a thread running on a different processor while the waiting thread spins.
The next steps are to show how to set spin counts and how to tune an application by determining the best spin count value. Again, spin counts are useful only on SMP systems; they are ignored on single-processor systems.
Setting the Spin Count
CS spin counts can be set at CS initialization or dynamically. In the first case, replace InitializeCriticalSection with InitializeCriticalSectionAndSpinCount, where a count parameter is added. There is no way to read a CS's spin count, however.
VOID InitializeCriticalSectionAndSpinCount (
LPCRITICAL_SECTION lpCriticalSection,
DWORD dwCount)
You can change a spin count at any time.
VOID SetCriticalSectionSpinCount (
LPCRITICAL_SECTION lpCriticalSection,
DWORD dwCount)
The Microsoft documentation mentions that 4,000 is a good spin count for heap management. The best value is, however, application specific, so spin counts should be adjusted with the application running in a realistic SMP environment. The best values will vary according to the number of processors, the nature of the application, and so on.
TimedMutualExclusionSC is on the book's Web site. It is a variation of the familiar TimedMutualExclusion program, and it includes a spin count argument on the command line. You can run it on your host processor to find a good value for this particular test program on your SMP systems, as suggested in Exercise 92.
A large number of threads contending for a single resource, such as a mutex or CS, can degrade performance on both single- and multiple-processor systems. Frequently, the negative performance impact can be minimized by spin counts, careful use of the right synchronization objects, or restructuring of your program to increase lock granularity and locking times.
If all those techniques fail, it might seem as if there is no recourse but to reduce the number of threads, but this could force a single thread to multiplex operations that naturally should be performed by individual threads. Semaphores, however, give a natural way to retain a simple threading model while still minimizing the number of active, contending threads. The solution is simple conceptually and can be added to an existing application program, such as the TimedMutualExclusion example, very quickly. The solution, called a semaphore throttle, in a boss/worker system uses the following techniques.
-
The boss thread creates a semaphore with a small maximum value, such as 4, which represents the maximum number of active threads, possibly the number of processors, compatible with good performance. Set the initial count to the maximum value as well. This number can be a parameter and tuned to the best value after experimentation, just as spin lock counts can be tuned. Another possible value is the number of processors, which can be obtained at run time (see the next section).
-
Each worker thread waits on the semaphore before entering its critical code section. The semaphore wait can immediately precede the mutex or CS wait.
-
The worker thread should then release the semaphore (release count of 1) immediately after leaving the critical code section.
-
If the semaphore maximum is 1, the mutex is redundant. This is often the best SMP solution.
-
Overall CS or mutex contention is reduced as the thread execution is serialized with only a few threads waiting on the mutex or CS.
The semaphore count simply represents the number of threads that can be active at any one time, limiting the number of threads contending for the mutex, CS, or other resource. The boss thread can even throttle the workers and dynamically tune the application by waiting on the semaphore to reduce the count if the boss determines that there is too much contention, and it can release semaphore units to allow more workers to run. Note, however, that the maximum semaphore count is set at create time and cannot be changed.
The following code fragment illustrates a modified worker loop with two semaphore operations.
while (TRUE) { // Worker loop
WaitForSingleObject (hThrottleSem, INFINITE);
WaitForSingleObject (hMutex, INFINITE);
... Critical code section ...
ReleaseMutex (hMutex);
ReleaseSemaphore (hThrottleSem, 1, NULL);
} // End of worker loop
There is one more variation. If a worker is considered to be "expensive" in some sense, it can be made to wait for several semaphore units. As noted in the previous chapter, however, two successive waits can create a deadlock. An exercise in the next chapter (Exercise 1011) shows how to build an atomic multiple-wait compound semaphore object.
TimedMutualExclusion, the familiar example, adds a sixth parameter that is the initial throttle semaphore count for the number of active threads. You can experiment with this count as suggested in one of the exercises. Figure 9-1 shows the time required for six worker threads synchronizing with a single CS, with 1, 2, ..., 6 active threads. In all cases, the amount of work is the same, but the elapsed time rises sharply with more than 4 active threads.
Figure 9-1. Thread Performance with a Semaphore Throttle
These times were originally obtained on an older, slower system. On a much faster Windows 2000 586 (Pentium) system, the corresponding time values for 16 threads, in seconds, are 0.8, 0.8, 2.3, 21.2, 28.4, and 29.0, and these results can be reproduced consistently. In this case, performance started to degrade with only 3 active threads. However, on a random collection of similar systems, the times were approximately constant, regardless of the number of active threads. After some experimentation, it appears that the following conclusions hold.
-
NT5 made substantial improvements over NT4, where results such as Figure 9-1 were seen consistently.
-
Foreground and background operation, determined by whether or not the application window is in focus, can have some impact, as does the amount of other activity on the system.
-
Mutexes are generally slower than CSs, but elapsed time is fairly constant on NT5, regardless of the number of active threads.
-
SMP systems benefit from a throttle with a count of 1. The semaphore then makes the mutex redundant. For example, on a two-processor, 1.8GHz Xeon system, the CS timings for 1, 2, and 4 active threads were 1.8, 33.0, and 31.9. The corresponding mutex timings were 34.0, 66.5, and 65.0.
Summary: A semaphore throttle can maintain good performance for both foreground and background operation, even on a busy system. On an SMP system, they can be essential, where the active thread count should be 1. Semaphore waiting appears to be much more efficient than mutex waiting.
Processor Affinity
All the preceding discussion has assumed that all processors of an SMP system are available to all threads, with the kernel making scheduling decisions and allocating processors to threads. This approach is simple, natural, and consistent with SMP. It is possible, however, to assign threads to specific processors by setting processor affinity. Processor affinity can be used in several situations.
-
A processor can be dedicated to a small set of one or more high-priority threads.
-
Worker threads that contend for a single resource can be allocated to a single processor, avoiding the SMP performance issues illustrated earlier.
-
Alternatively, the worker threads can be distributed over the available processors.
-
Different classes of worker threads can be allocated to different processors.
System, Process, and Thread Affinity Masks
Each process has its own process affinity mask, which is a bit vector. There is also a system affinity mask.
-
The system mask indicates the processors configured on this system.
-
The system mask indicates the processors that can be used by the process's threads. By default, its value is the same as the system mask.
-
Each individual thread has a thread affinity mask, which must be a subset of the process affinity mask. Initially, a thread's affinity mask is the same as the process mask.
There are functions to get and set the masks, although you can only read (get) the system mask and can only set thread masks. The set functions use thread and process handles, so one process or thread can set the affinity mask for another, assuming access rights, or for itself. Setting a mask has no effect on a thread that might already be running on a processor that is masked out; only future scheduling is affected.
A single function, GetProcessAffinityMask, reads both the system and process affinity masks. On a single-processor system, including any Windows 9x system, the mask values will be 1.
BOOL GetProcessAffinityMask (
HANDLE hProcess,
LPDWORD lpProcessAffinityMask,
LPDWORD lpSystemAffinityMask)
The process affinity mask, which will be inherited by any child process, is set with SetProcessAffinityMask.
BOOL SetProcessAffinityMask (
HANDLE hProcess,
DWORD dwProcessAffinityMask)
The Microsoft documentation says that the new mask must be a proper subset of the values obtained from GetProcessAffinityMask. A quick experiment, included in the TimedMutualExclusion code, shows that the new mask can be the same as the system or preceding process mask. Such a limitation would not make sense because you would not be able to restore a system mask to a previous value.
Windows 9x does not support SMP and also does not support the process mask functions. The new value affects all the threads belonging to this process.
Thread masks are set with a similar function.
DWORD SetThreadAffinityMask (
HANDLE hThread,
DWORD dwThreadAffinityMask)
These functions are not designed consistently. SetThreadAffinityMask returns a DWORD rather than a BOOL, but the effect is the same (1 for success, 0 for failure). SetThreadAffinityMask also works on Windows 9x, but the mask must be 1, which has no utility whatsoever. Also, despite the documentation, the new mask does not need to be a proper subset of the system mask.
SetThreadIdealProcessor is a variation of SetThreadAffinityMask. You specify the preferred ("ideal") processor number (not a mask), and the scheduler will assign that processor to the thread if possible, but it will use a different processor if the preferred processor is not available. The return value gives the previous preferred processor number, if any.
Finding the Number of Processors
The system affinity mask does indicate the number of processors on the system; all that is necessary is to count the number of bits that are set. It is easier, however, to call GetSystemInfo, which returns a SYSTEM_INFO structure whose fields include the number of processors and the active processor mask, which is the same as the system mask. A simple program and project, Version, on the book's Web site, displays this information along with the Windows version.
Hyperthreading and the Processor Count
The Intel Pentium 4 and Xeon processors have a feature called HyperTreading whereby wait states during a thread's execution are used to run a different thread. A second register file is used to support this feature, which is feasible because the x86 architecture has a relatively small number of registers. A Xeon or other hyperthreading processor is regarded as a single processor by GetSystemInfo and GetProcessAffinityMask.
I/O Completion Ports
Chapter 14 describes I/O completion ports, which provide another mechanism to avoid thread contention by limiting the number of threads. I/O completion ports allow a small number of threads to manage a large number of concurrent I/O operations. Individual I/O operations are started asynchronously so that the operation is, in general, not complete when the read or write call returns. However, as outstanding operations complete, data processing is handed off to one of a small number of worker threads. Chapter 14 has an example using a server communicating with remote clients (Program 14-4).
Multiple threads can provide significant programming advantages, including simpler programming models and performance improvement. However, there are several performance pitfalls that can have drastic and unexpected negative performance impact, and the impact is not always consistent on different computers, even when they are running the same Windows version. Some simple guidelines, summarizing the experience in this chapter, will help to avoid these pitfalls. Some of these guidelines are adapted from Butenhof's Programming with POSIX Pthreads, as are many of the designing, debugging, and testing hints in the next chapter.
-
Beware of conjecture and theoretical arguments about performance, which often sound convincing but can be wrong in practice. Test the conjecture with a simple prototype, such as TimedMutualExclusion, or with alternative implementations of your application.
-
Test application performance on as wide a variety of systems as are available to you. It is helpful to run with different memory configurations, processor types, Windows versions, and number of processors. An application may perform very well on one system and then have extremely poor performance on a similar one; see the discussion of Program 9-1.
-
Locking is expensive; use it only as required. Hold (own) a mutex or CS only as long as required and no longer. Setting the TimedMutualExclusion delay or sleep point parameters shows the impact of holding a lock for a long period of time.
-
Use distinct mutexes for distinct resources so that locking is as granular as possible. In particular, avoid global locks.
-
High lock contention hinders good performance. The greater the frequency of thread locking and unlocking, and the larger the number of threads, the greater the performance impact. Performance degradation can be drastic and is not just linear in the number of threads.
-
CSs provide an efficient, lightweight locking mechanism when the number of contending threads is small, but mutexes sometimes provide better performance. When using CSs in a performance-critical SMP application, tune performance with the CS spin counts.
-
Semaphores can reduce the number of active contending threads without forcing you to change your programming model.
-
SMP can cause severe, often unexpected, performance impacts in cases where you might expect improved performance. Reducing contention and using thread affinity are techniques to maintain good performance.
-
Choices such as whether to use a signal or broadcast model, explained in Chapter 10, also affect performance significantly.
-
Investigate using commercially available profiling and performance analysis tools, which can help clarify the behavior of the threads in your program and locate time-consuming code segments.
|
Summary
Synchronization can impact program performance on both single-processor and SMP systems; in some cases, the impact can be severe. Careful program design and selection of the appropriate synchronization objects can help assure good performance. This chapter has discussed a number of useful techniques and guidelines and has illustrated performance issues with a simple test program that captures the essential characteristics of many real programming situations.
Looking Ahead
Chapter 10 shows how to use Windows synchronization in more general ways, and it discusses several programming models that help ensure correctness and maintainability, as well as good performance. Chapter 10 also creates several compound synchronization objects that are useful for solving a number of important problems. Subsequent chapters use threads and synchronization as required for applications, such as servers. There are also a few more basic threading topics; for example, Chapter 12 illustrates and discusses thread safety and reentrancy in DLLs.
Additional Reading
Chapter 10 provides information sources that apply to this chapter as well.
Exercises
91.
|
Experiment with statsMX on your own system and on as many different systems (both hardware and Windows versions) as are available to you. Do you obtain similar results as those reported in this chapter? What happens on an SMP system?
|
92.
|
Use TimedMutualExclusionSC to experiment with CRITICAL_SECTION spin counts to see whether adjusting the count can improve and tune SMP performance when you have a large number of threads. Results will vary from system to system, and I have found approximately optimal points ranging from 2,000 to 10,000.
|
93.
|
Use TimedMutualExclusion, included on the book's Web site, to experiment with delay and sleep point counts.
|
94.
|
TimedMutualExclusion also uses a semaphore throttle to limit the number of running threads. Experiment with the count on both single-processor and SMP systems. Can you reproduce the result reported earlier in this chapter?
|
95.
|
Apply the semaphore throttle technique to statsMX (statsCS.c, statsMX.c).
|
96.
|
Advanced exercise: Do the four variations of statsMX all operate correctly, ignoring performance, on SMP systems? Experiment with a large number of worker threads. If at all possible, run on an SMP Windows 2000 or 2003 server. Can you reproduce the "word tearing" or "cache line conflict" problem described earlier and also in Butenhof's Programming with POSIX Threads? You may need to use a 16-bit (short integer) count to reproduce the problem.
|
97.
|
Use processor affinity as a performance enhancement technique by modifying this chapter's programs.
|
98.
|
Determine the effect of hyperthreading on application performance. The Intel Xeon processor, for example, provides hyperthreading.
|
|
Share with your friends: |