boost set thread affinity

C++11 threads, affinity and hyperthreading

Background and introduction.

For decades, the C and C++ standards treated multi-threading and concurrency as something existing outside the standard sphere - in that "target-dependent" world of shades which the "abstract machine" targeted by the standards doesn't cover. The immediate, cold-blooded replies of "C++ doesn't know what a thread is" in mountains of mailing list and newsgroup questions dealing with parallelism will forever serve as a reminder of this past.

But all of that came to an end with C++11. The C++ standards commitee realized the language won't be able to stay relevant for much longer unless it aligns itself with the times and finally recognizes the existence of threads, synchronization mechanisms, atomic operations and memory models - right there in the standard, forcing C++ compiler and library vendors to implement these for all supported platforms. This is, IMHO, one of the biggest positive changes in the avalanche of improvements delivered by the C++11 edition of the language.

This post is not a tutorial on C++11 threads, but it uses them as the main threading mechanism to demonstrate its points. It starts with a basic example but then quickly veers off into the specialized area of thread affinities, hardware topologies and performance implications of hyperthreading. It does as much as feasible in portable C++, clearly marking the deviations into platform-specific calls for the really specialized stuff.

Logical CPUs, cores and threads

Most modern machines are multi-CPU. Whether these CPUs are divided into sockets and hardware cores depends on the machine, of course, but the OS sees a number of "logical" CPUs that can execute tasks concurrently.

The easiest way to get this information on Linux is to cat /proc/cpuinfo , which lists the system's CPUs in order, providing some infromation about each (such as current frequency, cache size, etc). On my (8-CPU) machine:

A summary output can be obtained from lscpu :

Here it's also very easy to see that the machine has 4 cores, each having two HW threads (see hyperthreading ). And yet the OS sees them as 8 "CPUs" numbered 0-7.

Launching a thread per CPU

The C++11 threading library gracefully made available a utility function that we can use to find out how many CPUs the machine has, so that we could plan our parallelism strategy. The function is called hardware_concurrency , and here is a complete example that uses it to launch an appropriate number of threads. The following is just a code snippet; full code samples for this post, along with a Makefile for Linux can be found in this repository .

A std::thread is a thin wrapper around a platform-specific thread object; this is something we'll use to our advantage shortly. So when we launch a std::thread , and actual OS thread is launched. This is fairly low-level thread control, but in this article I won't detour into higher-level constructs like task-based parallelism , leaving this to some future post.

Thread affinity

So we know how to query the system for the number of CPUs it has, and how to launch any number of threads. Now let's do something a bit more advanced.

All modern OSes support setting CPU affinity per thread. Affinity means that instead of being free to run the thread on any CPU it feels like, the OS scheduler is asked to only schedule a given thread to a single CPU or a pre-defined set of CPUs. By default, the affinity covers all logical CPUs in the system, so the OS can pick any of them for any thread, based on its scheduling considerations. In addition, the OS will sometimes migrate threads between CPUs if it makes sense to the scheduler (though it should try to miminize migrations because of the loss of warm caches on the core from which the thread was migrated). Let's observe this in action with another code sample:

This sample launches four threads that loop infinitely, sleeping and reporting which CPU they run on. The reporting is done via the sched_getcpu function (glibc specific - other platforms will have other APIs with similar functionality). Here's a sample run:

Some observations: the threads are sometimes scheduled onto the same CPU, and sometimes onto different CPUs. Also, there's quite a bit of migration going on. Eventually, the scheduler managed to place each thread onto a different CPU, and keep it there. Different constraints (such as system load) could result in a different scheduling, of course.

Now let's rerun the same sample, but this time using taskset to restrict the affinity of the process to only two CPUs - 5 and 6:

As expected, though there's some migration happening here, all threads remain faithfully locked to CPUs 5 and 6, as instructed.

Detour - thread IDs and native handles

Even though the C++11 standard added a thread library, it can't standardize everything . OSes differ in how they implement and manage threads, and exposing every possible thread implementation detail in the C++ standard can be overly restrictive. Instead, in addition to defining many threading concepts in a standard way, the thread library also lets us interact with platform-specific threading APIs by exposing native handles . These handles can then be passed into low-level platform-specific APIs (such as POSIX threads on Linux or Windows API on Windows) to exert finer grained control over the program.

Here's an example program that launches a single thread, and then queries its thread ID along with the native handle:

The output of one particular run on my machine is:

Both the main thread (the default thread running main on entry) and the spawned thread obtain the thread's ID - a standard defined concept for an opaque type that we can print, hold in a container (for example, mapping it to something in a hash_map ), but not much other than that. Moreover, the thread object has the native_handle method that returns an "implementation defined type" for a handle that will be recognized by the platform-speficic APIs. In the output shown above two things are notable:

  • The thread ID is actually equal to the native handle.
  • Moreover, both are equal to the numeric pthread ID returned by pthread_self .

While the equality of native_handle to the pthread ID is something the standard definitely implies [1] , the first one is surprising. It looks like an implementation artifact one definitely shouldn't rely upon. I examined the source code of a recent libc++ and found that a pthread_t id is used as both the "native" handle and the actual "id" of a thread object [2] .

All of this is taking us pretty far off the main topic of this article, so let's recap. The most important take-away from this detour section is that the underlying platform-specific thread handle is available by means of the native_handle method of a std::thread . This native handle on POSIX platforms is, in fact, the pthread_t ID of the thread, so a call to pthread_self within the thread itself is a perfectly valid way to obtain the same handle.

Setting CPU affinity programatically

As we've seen earlier, command-line tools like taskset let us control the CPU affinity of a whole process. Sometimes, however, we'd like to do something more fine-grained and set the affinities of specific threads from within the program. How do we do that?

On Linux, we can use the pthread-specific pthread_setaffinity_np function. Here's an example that reproduces what we did before, but this time from inside the program. In fact, let's go a bit more fancy and pin each thread to a single known CPU by setting its affinity:

Note how we use the native_handle method discussed earlier in order to pass the underlying native handle to the pthread call (it takes a pthread_t ID as its first argument). The output of this program on my machine is:

The threads get pinned to single CPUs exactly as requested.

Sharing a core with hyperthreading

Now's time for the really fun stuff. We've learned about CPU topologies a bit, and then developed progressively more complex programs using the C++ threading library and POSIX calls to fine-tune our use of the CPUs in a given machine, up to selecting exactly which thread runs on which CPU.

But why any of this matters? Why would you want to pin threads to certain CPUs? Doesn't it make more sense to let the OS do what it's good at and manage the threads for you? Well, in most cases yes, but not always.

See, not all CPUs are alike. If you have a modern processor in your machine, it most likely has multiple cores, each with multiple hardware threads - usually 2. For example as I've shown in the beginning of the article, my (Haswell) processor has 4 cores, each with 2 threads, for a total of HW 8-threads - 8 logical CPUs for the OS. I can use the excellent lstopo tool to display the topology of my processor:

lstopo topology of my home CPU

An alternative non-graphical way to see which threads share the same core is to look at a special system file that exists per logical CPU. For example, for CPU 0:

More powerful (server-class) processors will have multiple sockets, each with a multi-core CPU. For example, at work I have a machine with 2 sockets, each of which is a 8-core CPU with hyper-threading enabled: a total of 32 hardware threads. An even more general case is usually brought under the umberlla of NUMA , where the OS can take charge of multiple very-loosely connected CPUs that don't even share the same system memory and bus.

The important question to ask is - what do hardware threads share, and how does it affect the programs we write. Take another look at the lstopo diagram shown above. It's easy to see that L1 and L2 caches are shared between the two threads in every core. L3 is shared among all cores. For multi-socket machines. cores on the same socket share L3 but each socket usually has its own L3. In NUMA, each processor usually has access to its own DRAM, and some communication mechanism is used for one processor to access the DRAM of another processor.

Caches isn't the only thing threads within a core share, however. They also share many of the core's execution resources, like the execution engine, system bus interface, instruction fetch and decode units, branch predictors and so on [3] .

So if you've wondered why hyper-threading is sometimes considered a trick played by CPU vendors, now you know. Since the two threads on a core share so much, they are not fully independent CPUs in the general sense. True, for some workloads this arrangement is beneficial, but for some it's not. Sometimes it can even be harmful, as the hordes of "how to disable hyper-threading to improve app X's performance" threads online imply.

Performance demos of core sharing vs. separate cores

I've implemented a benchmark that lets me run different floating-point "workloads" on different logical CPUs in parallel threads, and compare how long these workloads take to finish. Each workload gets its own large float array, and has to compute a single float result. The benchmark figures out which workloads to run and on which CPUs from the user's input, prepares the inputs and then unleashes all the workloads in parallel in separate threads, using the APIs we've seen earlier to set the precise CPU affinity of each thread as requested. If you're interested, the full benchmark along with a Makefile for Linux is available here ; in the rest of the post I'll just paste short code snippets and results.

I'll be focusing on two workloads. The first is a simple accumulator:

It adds up all the floats in the input array together. This is akin to what std::accumulate would do.

Now I'll run three tests:

  • Run accum on a single CPU, to get a baseline performance number. Measure how long it takes.
  • Run two accum instances on different cores. Measure how long each instance takes.
  • Run two accum instances on two threads of the same core [4] . Measure how long each instance takes.

The reported numbers (here and in what follows) is execution time for an array of 100 million floats as input of a single workload. I'll average them over a few runs:

accum runtime chart

This clearly shows that when a thread running accum shares a core with another thread running accum , its runtime doesn't change at all. This has good news and bad news. The good news is that this particular workload is well suitable for hyper-threading, because apparently two threads running on the same core manage not to disturb each other. The bad news is that precisely for the same reason it's not a great single-thread implementation, since quite obviously it doesn't use the processor's resources optimally.

To give a bit more details, let's look at the disassembly of the inner loop of workload_accum :

Pretty straightforward. The compiler uses the addss SSE instruction to add floats together in the low 32 bits of a SSE (128-bit) register. On Haswell, the latency of this instruction is 3 cycles. The latency, and not the throughput, is important here because we keep adding into xmm0 . So one addition has to finish entirely before the next one begins [5] . Moreover, while Haswell has 8 execution units, addss uses only one of them. This is a fairly low utilization of the hardware. Therefore, it makes sense that two threads running on the same core manage not to trample over each other.

As a different example, consider a slightly more complex workload:

Here instead of just adding the numbers up, we add their sines up. Now, std::sin is a pretty convoluted function that runs a reduced Taylor series polynomial approximation, and has a lot of number crunching inside it (along with a lookup table, usually). This should keep the execution units of a core more busy than simple addition. Let's check the three different modes of running again:

sin runtime chart

This is more interesting. While running on different cores didn't harm the performance of a single thread (so the computation is nicely parallelizable), running on the same core did hurt it - a lot (by more than 75%).

Again, there's good news here and bad news here. The good news is that even on the same core, if you want to crunch as many numbers as possible, two threads put together will be faster than a single thread (945 ms to crunch two input arrays, while a single thread would take 540 * 2 = 1080 ms to achieve the same). The bad news is that if you care about latency, running multiple threads on the same core actually hurts it - the threads compete over the execution units of the core and slow each other down.

A note on portability

So far the examples in this article were Linux-specific. However, everything we went through here is available for multiple platforms, and there are portable libraries one can use to leverage this. They will be a bit more cumbersome and verbose to use than the native APIs, but if you need cross-platform portability, that's not a big price to pay. A good portable library I found useful is hwloc , which is part of the Open MPI project. It's highly portable - running on Linux, Solaris, *BSD, Windows, you name it. In fact, the lstopo tool I mentioned earlier is built on hwloc .

hwloc is a generic C API that enables one to query the topology of the system (including sockets, cores, caches, NUMA nodes, etc.) as well as setting and querying affinities. I won't spend much time on it, but I did include a simple example with the source repository for this article. It shows the system's topology and binds the calling thread to a certain logical processor. It also shows how to build a program using hwloc . If you care about portability, I hope you will find the example useful. And if you know of any other cool uses for hwloc , or about other portable libraries for this purpose - drop me a line!

Closing words

So, what have we learned? We've seen how to examine and set thread affinity. We've also learned how to control placement of threads on logical CPUs by using the C++ standard threading library in conjunction with POSIX calls, and the bridging native handles exposed by the C++ threading library for this purpose. Next we've seen how we can figure out the exact hardware topology of the processor and select which threads share a core, and which threads run on different cores, and why this really matters.

The conclusion, as it always is with performance-critical code, is that measurement is the single most important thing. There are so many variables to control in modern performance tuning that it's very hard to predict in advance what will be faster, and why. Different workloads have very different CPU utilization characteristics, which makes them more or less suitable for sharing a CPU core, sharing a socket or sharing a NUMA node. Yes, the OS sees 8 CPUs on my machine, and the standard threading library even lets me query this number in a portable way; but not all of these CPUs are alike - and this is important to understand in order to squeeze the best performance out of the machine.

I haven't gone very deep into analyzing the micro-op level performance of the two presented workloads, because that's really not the focus of this article. That said, I hope this article provides another angle to figure out what matters in multi-threaded performance. Physical resource sharing is not always taking into account when figuring out how to parallelize an algorithm - but as we've seen here, it really should .

For comments, please send me an email .

Creating Thread-to-Core and Task-to-Thread Affinity

  • Open Access
  • First Online: 10 July 2019

Cite this chapter

You have full access to this open access chapter

boost set thread affinity

  • Michael Voss 4 ,
  • Rafael Asenjo 5 &
  • James Reinders 6  

41k Accesses

When developing parallel applications with the Threading Building Blocks library, we create tasks by using the high-level execution interfaces or the low-level APIs. These tasks are scheduled by the TBB library onto software threads using work stealing. These software threads are scheduled by the Operating System (OS) onto the platform’s cores (hardware threads). In this chapter, we discuss the features in TBB that let us influence the scheduling choices made by the OS and by TBB. Thread-to-core affinity is used when we want to influence the OS so that it schedules the software threads onto particular core(s). Task-to-thread affinity is used when we want to influence the TBB scheduler so that it schedules tasks onto particular software threads. Depending on what we are trying to achieve, we may be interested in one kind of affinity or the other, or a combination of both.

You have full access to this open access chapter,  Download chapter PDF

There can be different motivations for creating affinity. One of the most common motivations is to take advantage of data locality. As we have repeatedly noted in this book, data locality can have a huge impact on the performance of a parallel application. The TBB library, its high-level execution interfaces, its work-stealing scheduler, and its concurrent containers have all been designed with locality in mind. For many applications, using these features will lead to good performance without any manual tuning. Sometimes though, we will need to provide hints or take matters completely into our own hands so that the schedulers, in TBB and the OS, more optimally schedule work near its data. In addition to data locality, we might also be interested in affinity when using heterogeneous systems, where the capabilities of cores differ, or when software threads have different properties, such as higher or lower priorities.

In Chapter 16 , the high-level features for data locality that are exposed by the TBB parallel algorithms are presented. In Chapter 17 , the features for tuning cache and memory use in TBB flow graphs are discussed. In Chapter 20 , we showed how to use features of the TBB library to tune for Non-Uniform Memory Access (NUMA) architectures. For many readers, the information in those chapters will be sufficient to accomplish the specific tasks they need to perform to tune their applications. In this chapter, we focus on the lower-level, fundamental support provided by the TBB’s scheduler and tasks that are sometimes abstracted by the high-level features described in those chapters or sometimes used directly in those chapters to create affinity.

Creating Thread-to-Core Affinity

All of the major operating systems provide interfaces that allow users to set the affinity of software threads, including pthread_setaffinity_np or sched_setaffinity on Linux and SetThreadAffinityMask on Windows. In Chapter 20 , we use the Portable Hardware Locality (hwloc) package as a portable way to set affinity across platforms. In this chapter, we do not focus on the mechanics of setting affinity – since these mechanics will vary from system to system – instead we focus on the hooks provided by the TBB library that allow us to use these interfaces to set affinity for TBB master and worker threads.

The TBB library by default creates enough worker threads to match the number of available cores. In Chapter 11 , we discussed how we can change those defaults. Whether we use the defaults or not, the TBB library does not automatically affinitize these threads to specific cores. TBB allows the OS to schedule and migrate the threads as it sees fit. Giving the OS flexibility in where it places TBB threads is an intentional design choice in the library. In a multiprogrammed environment, an environment in which TBB excels, the OS has visibility of all of the applications and threads. If we make decisions about where threads should execute from within our limited view inside of a single application, we might make choices that lead to poor overall system resource utilization. Therefore, it is often better to not affinitize threads to cores and instead allow the OS to choose where the TBB master and worker threads execute, including allowing it to dynamically migrate threads during a program’s execution.

However, like we will see in many chapters of this book, the TBB library provides features that let us change this behavior if we wish. If we want to force TBB threads to have affinity for cores, we can use the task_scheduler_observer class to do so (see Observing the scheduler with the task_scheduler_observer class ). This class lets an application define callbacks that are invoked whenever a thread enters and leaves the TBB scheduler, or a specific task arena, and use these callbacks to assign affinity. The TBB library does not provide an abstraction to assist with making the OS-specific calls required to set thread affinity, so we have to handle these low-level details ourselves using one of the OS-specific or portable interfaces we mentioned earlier.

Observing the Scheduler with the Task_Scheduler_Observer Class

The task_scheduler_observer class provides a way to observe when a thread starts or stops participating in task scheduling. The interface of this class is shown as follows:

figure a

To use the class, we create our own class that inherits from task_scheduler_observer and implements the on_scheduler_entry and on_scheduler_exit callbacks. When an instance of this class is constructed and its observe state is set to true, the entry and exit functions will be called whenever a master or worker thread enters or exits the global TBB task scheduler.

A recent extension to the class now allows us to pass a task_arena to the constructor. This extension was a preview feature prior to TBB 2019 Update 4 but is now fully supported. When a task_arena reference is passed, the observer will only receive callbacks for threads that enter and exit that specific arena:

figure b

Figure 13-1 shows a simple example of how to use a task_scheduler_observer object to pin threads to cores on Linux. In this example, we use the sched_setaffinity function to set the CPU mask for each thread as it joins the default arena. In Chapter 20 , we show an example that assigns affinity using the hwloc software package. In the example in Figure 13-1 , we use tbb::this_task_arena::max_concurrency() to find the number of slots in the arena and tbb::this_task_arena::current_thread_index() to find the slot that the calling thread is assigned to. Since we know there will be the same number of slots in the default arena as the number of logical cores, we pin each thread to the logical core that matches its slot number.

figure 1

Using a task_scheduler_observer to pin threads to cores on a Linux platform

We can of course create more complicated schemes for assigning logical cores to threads. And, although we don’t do this in Figure 13-1 , we can also store the original CPU mask for each thread so that we can restore it when the thread leaves the arena.

As we discuss in Chapter 20 , we can use the task_scheduler_observer class , combined with explicit task_arena instances, to create isolated groups of threads that are restricted to the cores that share the same local memory banks in a Non-Uniform-Memory Access (NUMA) system, a NUMA node. If we also control data placement, we can greatly improve performance by spawning the work into the arena of the NUMA node on which its data resides. See Chapter 20 for more details.

We should always remember that if we use thread-to-core affinity, we are preventing the OS from migrating threads away from oversubscribed cores to less-used cores as it attempts to optimize system utilization. If we do this in production applications, we need to be sure that we will not degrade multiprogrammed performance! As we’ll mention several more times, only systems dedicated to running a single application (at a time) are likely to have an environment in which limiting dynamic migration can be of benefit.

Creating Task-to-Thread Affinity

Since we express our parallel work in TBB using tasks, creating thread-to-core affinity, as we described in the previous section, is only one part of the puzzle. We may not get much benefit if we pin our threads to cores, but then let our tasks get randomly moved around by work stealing!

When using the low-level TBB tasking interfaces introduced in Chapter 10 , we can provide hints that tell the TBB scheduler that it should execute a task on the thread in a particular arena slot. Since we will likely use the higher-level algorithms and tasking interfaces whenever possible, such as parallel_for , task_group and flow graphs, we will rarely use these low-level interfaces directly however. Chapter 16 shows how the affinity_partitioner and static_partitioner classes can be used with the TBB loop algorithms to create affinity without resorting to these low-level interfaces. Similarly, Chapter 17 discusses the features of TBB flow graphs that affect affinity.

So while task-to-thread affinity is exposed in the low-level task class, we will almost exclusively use this feature through high-level abstractions. Therefore using the interfaces we describe in this section is reserved for TBB experts that are writing their own algorithms using the lowest-level tasking interfaces. If you’re such an expert, or want to have a deeper understanding of how the higher-level interfaces achieve affinity, keep reading this section.

Figure 13-2 shows the functions and types provided by the TBB task class that we use to provide affinity hints.

figure 2

The functions in tbb::task that are used for task to thread affinity

The type affinity_id is used to represent the slot in an arena that a task has affinity for. A value of zero means the task has no affinity. A nonzero value has an implementation-defined value that maps to an arena slot. We can set the affinity of task to an arena slot before spawning it by passing an affinity_id to its set_affinity function. But since the meaning of affinity_id is implementation defined, we don’t pass a specific value, for example 2 to mean slot 2. Instead, we capture an affinity_id from a previous task execution by overriding the note_affinity callback function.

The function note_affinity is called by the TBB library before it invokes a task’s execute function when (1) the task has no affinity but will execute on a thread other than the one that spawned it or (2) the task has affinity but it will execute on a thread different than the one specified by its affinity. By overriding this callback, we can track TBB stealing behavior so we can provide hints to the library to recreate this same stealing behavior in a subsequent execution of the algorithm, as we will see in the next example.

Finally, the affinity function lets us query a task’s current affinity setting.

Figure 13-3 shows a class that inherits from tbb::task and uses the task affinity functions to record affinity_id values into a global array a . It only records the value when its doMakeNotes variable is set to true. The execute function prints the task id, the slot of the thread it is executing on, and the value that was recorded in the array for this task id. It prefixes its reporting with “hmm” if the task’s doMakeNotes is true (it will then record the value), “yay!” if the task is executing in the arena slot that was recorded in array a (it was scheduled onto the same thread again), and “boo!” if it is executing in a different arena slot. The details of the printing are contained in the function printExclaim .

figure 3

Using the task affinity functions

While the meaning of affinity_id is implementation defined, TBB is open source, and so we peaked at the implementation. We therefore know that the affinity_id is 0 if there is no affinity, but otherwise it is the slot index plus 1. We should not depend on this knowledge in production uses of TBB, but we depend on it in our example’s execute function so we can assign the correct exclamation “yay!” or “boo!”.

The function fig_13_3 in Figure 13-3 builds and executes three task trees, each with eight tasks, and assigns them ids from 0 to 7. This sample uses the low-level tasking interfaces we introduced in Chapter 10 . The first task tree uses note_affinity to track when a task has been stolen to execute on some other thread than the master. The second task tree executes without noting or setting affinities. Finally, the last task tree uses set_affinity to recreate the scheduling recorded during the first run.

When we executed this example on a platform with eight threads, we recorded the following output:

note_affinity id:slot:a[i] hmm. 7:0:-1 hmm. 0:1:1 hmm. 1:6:6 hmm. 2:3:3 hmm. 3:2:2 hmm. 4:4:4 hmm. 5:7:7 hmm. 6:5:5 without set_affinity id:slot:a[i] yay! 7:0:-1 boo! 0:4:1 boo! 1:3:6 boo! 4:5:4 boo! 3:7:2 boo! 2:2:3 boo! 5:6:7 boo! 6:1:5 with set_affinity id:slot:a[i] yay! 7:0:-1 yay! 0:1:1 yay! 4:4:4 yay! 5:7:7 yay! 2:3:3 yay! 3:2:2 yay! 6:5:5 yay! 1:6:6

From this output, we see that the tasks in the first tree are distributed over the eight available threads, and the affinity_id for each task is recorded in array a . When the next set of tasks is executed, the recorded affinity_id for each task is not used to set affinity, and the tasks are randomly stolen by different threads. This is what random stealing does! But, when we execute the final task tree and use set_affinity , the thread assignments from the first run are repeated. Great, this worked out exactly as we wanted!

However, set_affinity only provides an affinity hint and the TBB library is actually free to ignore our request. When we set affinity using these interfaces, a reference to the task-with-affinity is placed in the targeted thread’s affinity mailbox (see Figure 13-4 ). But the actual task remains in the local deque of the thread that spawned it. The task dispatcher only checks the affinity mailbox when it runs out of work in its local deque, as shown in the task dispatch loop in Chapter 9 . So, if a thread does not check its affinity mailbox quickly enough, another thread may steal or execute its tasks first.

figure 4

The affinity mailbox holds reference to a task that remains in the local deque of the thread that spawned the task

To demonstrate this, we can change how task affinities are assigned in our small example, as shown in Figure 13-5 . Now, foolishly, we set all of the affinities to the same slot, the one recorded in a[2] .

figure 5

A function that first runs different groups of tasks, sometimes noting affinities and sometimes setting affinities. An example output is also shown.

If the TBB scheduler honors our affinity requests, there will be a large load imbalance since we have asked it to mail all of the work to the same worker thread. But if we execute this new version of the example, we see:

figure c

Because affinity is only a hint, the other idle threads still find tasks, stealing them from the master thread’s local deque before the thread in slot a[2] is able to drain its affinity mailbox. In fact, only the first task spawned, id==0 , is executed by the thread in the slot previously recorded in a[2] . So, we still see our tasks distributed across all eight of the threads.

The TBB library has ignored our request and instead avoided the load imbalance that would have been created by sending all of these tasks to the same thread. This weak affinity is useful in practice because it lets us communicate affinities that should improve performance, but it still allows the library to adjust so that we don’t inadvertently create a large load imbalance.

While we can use these task interfaces directly, we see in Chapter 16 that the loop algorithms provide a simplified abstraction, affinity_partitioner that luckily hides us from most of these low-level details.

When and How Should We Use the TBB Affinity Features?

We should use task_scheduler_observer objects to create thread-to-core affinity only if we are tuning for absolute best performance on a dedicated system. Otherwise, we should let the OS do its job and schedule threads as it sees fit from its global viewpoint. If we do choose to pin threads to cores, we should carefully weigh the potential impact of taking this flexibility away from the OS, especially if our application runs in a multiprogrammed environment.

For task-to-thread affinity, we typically want to use the high-level interfaces, like affinity_partitioner described in Chapter 16 . The affinity_partitioner uses the features described in this chapter to track where tasks are executed and provide hints to the TBB scheduler to replay the partitioning on subsequent executions of the loop. It also tracks changes to keep the hints up to date.

Because TBB task affinities are just scheduler hints, the potential impact of misusing these interfaces is far less – so we don’t need to be as careful when we use task affinities. In fact, we should be encouraged to experiment with task affinity, especially through the higher-level interfaces, as a normal part of tuning our applications.

In this chapter, we discussed how we can create thread-to-core and task-to-thread affinity from within our TBB applications. While TBB does not provide an interface for handling the mechanics of setting thread-to-core affinity, its class task_scheduler_observer provides a callback mechanism that allows us to insert the necessary calls to our own OS-specific or portable libraries that assign affinities. Because the TBB work-stealing scheduler randomly assigns tasks to software threads, thread-to-core affinity is not always sufficient on its own. We therefore also discussed the interfaces in TBB’s class task that lets us provide affinity hints to the TBB scheduler about what software thread we want a task to be scheduled onto. We noted that we will most likely not use these interfaces directly, but instead use the higher-level interfaces described in Chapters 16 and 17 . For readers that are interested in learning more about these low-level interfaces though, we provided examples that showed how we can use the note_affinity and set_affinity functions to implement task-to-thread affinity for code that uses the low-level TBB tasking interface.

Like with many of the optimization features of the TBB library, affinities need to be used carefully. Using thread-to-core affinity incorrectly can degrade performance significantly by restricting the Operating System’s ability to balance load. Using the task-to-thread affinity hints, being just hints that the TBB scheduler can ignore, might negatively impact performance if used unwisely, but much less so.

For More Information

Posix set/get CPU affinity of a thread, http://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html

SetThreadAffinityMask function, https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-setthreadaffinitymask

Portable Hardware Locality (hwloc), www.open-mpi.org/projects/hwloc/

Author information

Authors and affiliations.

Austin, Texas, USA

Michael Voss

Málaga, Spain

Rafael Asenjo

Portland, Oregon, USA

James Reinders

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if you modified the licensed material. You do not have permission under this license to share adapted material derived from this chapter or parts of it.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2019 Intel Corporation

About this chapter

Voss, M., Asenjo, R., Reinders, J. (2019). Creating Thread-to-Core and Task-to-Thread Affinity. In: Pro TBB. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-4398-5_13

Download citation

DOI : https://doi.org/10.1007/978-1-4842-4398-5_13

Published : 10 July 2019

Publisher Name : Apress, Berkeley, CA

Print ISBN : 978-1-4842-4397-8

Online ISBN : 978-1-4842-4398-5

eBook Packages : Professional and Applied Computing Apress Access Books Professional and Applied Computing (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Boost C++ Libraries

...one of the most highly regarded and expertly designed C++ library projects in the world. — Herb Sutter and Andrei Alexandrescu , C++ Coding Standards

Introduction

The header < boost/thread.hpp > defines the classes thread and thread_group which are used to create, observe and manage threads and groups of threads.

Class thread

The thread class represents threads of execution, and provides the functionality to create and manage threads within the Boost.Threads library. See Definitions for a precise description of "thread of execution", and for definitions of threading related terms and of thread states such as "blocked".

A thread of execution has an initial function. For the program's initial thread, the initial function is main() . For other threads, the initial function is operator() of the function object passed to the class thread constructor.

A thread of execution is said to be "finished" or "finished execution" when its initial function returns or is terminated. This includes completion of all thread cleanup handlers, and completion of the normal C++ function return behaviors, such as destruction of automatic storage (stack) objects and releasing any associated implementation resources.

A thread object has an associated state which is either "joinable" or "non-joinable".

Except as described below, the policy used by an implementation of Boost.Threads to schedule transitions between thread states is unspecified.

Note: Just as the lifetime of a file may be different from the lifetime of an iostream object which represents the file, the lifetime of a thread of execution may be different from the thread object which represents the thread of execution. In particular, after a call to join() , the thread of execution will no longer exist even though the thread object continues to exist until the end of its normal lifetime. The converse is also possible; if a thread object is destroyed without join() having first been called, the thread of execution continues until its initial function completes.

Class thread synopsis

Class thread constructors and destructor, class thread comparison functions, class thread modifier functions, class thread static functions, class thread_group.

The thread_group class provides a container for easy grouping of threads to simplify several common thread creation and management idioms.

All thread_group member functions are thread-safe , except destruction.

Class thread_group synopsis

Class thread_group constructors and destructor, class thread_group modifier functions.

{{Object specifications}}

Simple usage of boost::thread

libs/thread/example/thread.cpp

The output is:

Simple usage of boost::thread_group

libs/thread/example/thread_group.cpp

Revised 09 January, 2003

© Copyright William E. Kempf 2001-2002. All Rights Reserved.

Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. William E. Kempf makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty.

sched_setaffinity(2) — Linux manual page

Name         top, library         top, synopsis         top, description         top, return value         top, errors         top, standards         top, history         top, notes         top, examples         top, see also         top.

Pages that refer to this page: systemd-nspawn(1) ,  taskset(1) ,  getcpu(2) ,  gettid(2) ,  sched_get_priority_max(2) ,  sched_setattr(2) ,  sched_setparam(2) ,  sched_setscheduler(2) ,  syscalls(2) ,  CPU_SET(3) ,  numa(3) ,  pthread_attr_setaffinity_np(3) ,  pthread_create(3) ,  pthread_setaffinity_np(3) ,  systemd.exec(5) ,  capabilities(7) ,  cpuset(7) ,  credentials(7) ,  pthreads(7) ,  sched(7) ,  migratepages(8) ,  numactl(8)

Subject: Re: [boost] [thread] thread pining (processor affinity) From: Johan Nilsson ( r.johan.nilsson_at_[hidden] ) Date: 2008-11-18 05:57:26

Boost list run by bdawes at acm.org, gregod at cs.rpi.edu, cpdaniel at pacbell.net, john at johnmaddock.co.uk

IMAGES

  1. Win32 processes and threads tutorials: Multiple processors, thread

    boost set thread affinity

  2. What is Processor Affinity & how to set Processor Affinity on Windows 11

    boost set thread affinity

  3. Thread Placement and Thread Affinity

    boost set thread affinity

  4. Permanently Set CPU Affinity and Priority with Process Lasso

    boost set thread affinity

  5. Managing Process Affinity in Linux

    boost set thread affinity

  6. Speedup for Thread Affinity (Compact and Scatter)

    boost set thread affinity

VIDEO

  1. Affinity Designer : Create Textured Intensity Brush

  2. Parallel C++: Thread Affinity

  3. Node topology and programming models

  4. 100% the best manager in efootball 2024 mobile 🔥❤️ QUICK COUNTER !! 300% coaching affinity boost 🤩🔔

  5. Everyone should know these 5 hacks to quickly boost your images in Affinity Photo

  6. RUST FPS+ UPDATE SETTING UP CPU CORE USAGE IN THE GAME system.cpu_affinity UPDATED ON APRIL 6, 2023

COMMENTS

  1. Is there a way to set thread affinity to a processor core with the

    Setting thread affinity to a CPU core without isolating that CPU core will imply that your thread will end up competing with other threads for that particular CPU core. And if another thread gets to run on that core - your cache will be polluted. So - if you are doing thread affinity - you should absolutely isolate that CPU core as well.

  2. C++11 threads, affinity and hyperthreading

    Affinity means that instead of being free to run the thread on any CPU it feels like, the OS scheduler is asked to only schedule a given thread to a single CPU or a pre-defined set of CPUs. By default, the affinity covers all logical CPUs in the system, so the OS can pick any of them for any thread, based on its scheduling considerations.

  3. Thread Management

    Objects of class boost:: thread:: id can be used to identify threads. Each running thread of execution has a unique ID obtainable from the corresponding boost:: thread by calling the get_id member function, or by calling boost:: this_thread:: get_id from within the thread. Objects of class boost:: thread:: id can be copied, and used as keys in associative containers: the full range of ...

  4. Chapter 38. Thread 4.8.0

    Overview. Boost.Thread enables the use of multiple threads of execution with shared data in portable C++ code. It provides classes and functions for managing the threads themselves, along with others for synchronizing data between the threads or providing separate copies of data specific to individual threads.

  5. Threads and Boost.Asio

    Thread Pools. Multiple threads may call io_context::run() to set up a pool of threads from which completion handlers may be invoked. This approach may also be used with post() as a means to perform arbitrary computational tasks across a thread pool. Note that all threads that have joined an io_context 's pool are considered equivalent, and the ...

  6. pthread_setaffinity_np(3)

    The pthread_getaffinity_np() function returns the CPU affinity mask of the thread thread in the buffer pointed to by cpuset. For more details on CPU affinity masks, see sched_setaffinity(2). For a description of a set of macros that can be used to manipulate and inspect CPU sets, see CPU_SET(3).

  7. Creating Thread-to-Core and Task-to-Thread Affinity

    However, set_affinity only provides an affinity hint and the TBB library is actually free to ignore our request. When we set affinity using these interfaces, a reference to the task-with-affinity is placed in the targeted thread's affinity mailbox (see Figure 13-4). But the actual task remains in the local deque of the thread that spawned it.

  8. Optimizations for C++ multi-threaded programming

    VI. Scheduling with thread affinity. Another optimization to look out for in a multi-threaded program is thread affinity: where to place threads that are in relation to each other. Placing threads ...

  9. c++

    After searching for a while, it seems that we cannot set CPU affinity when we create a C++ thread. The reason is that, there is NO NEED to specify the affinity when create a thread. So, why bother make it possible in the language. Say, we want the workload f() to be bound to CPU0. We can just change the affinity to CPU0 right before the real ...

  10. Boost users' mailing page: Re: [Boost-users] [thread] set thread affinity

    In reply to: James C. Sutherland: "[Boost-users] [thread] set thread affinity" On Thu, Jul 07, 2011 at 07:16:04PM -0600, James C. Sutherland wrote: > Is it possible to set thread affinity through boost::thread? I suspect that > I am losing performance in my application because the OS is migrating

  11. Boost mailing page: Re: [boost] [thread] thread pining (processor affinity)

    > > boost::this_thread::pin_to_processor( int)? > > Why not "set_affinity(<some type> processor_mask)"? It's more general and > uses well-established terminology. Most systems provide separate system calls for bind to one processor and set processor masks.

  12. Boost.Threads

    Introduction. The header <boost/thread.hpp> defines the classes thread and thread_group which are used to create, observe and manage threads and groups of threads. Classes Class thread. The thread class represents threads of execution, and provides the functionality to create and manage threads within the Boost.Threads library. See Definitions for a precise description of "thread of execution ...

  13. sched_setaffinity(2)

    A set of macros for manipulating. CPU sets is described in CPU_SET(3) . sched_setaffinity () sets the CPU affinity mask of the thread. whose ID is pid to the value specified by mask. If pid is zero, then the calling thread is used. The argument cpusetsize is the. length (in bytes) of the data pointed to by mask.

  14. Boost mailing page: Re: [boost] [thread] thread pining (processor affinity)

    > boost::this_thread::pin_to_processor( int)? Why not "set_affinity(<some type> processor_mask)"? It's more general and uses well-established terminology. Also, while you're at it, something like this would be valuable to be able to specify when creating a thread as well.

  15. How to set thread affinity to either performance or efficient cores?

    pthread_attr_destroy(&attr); return 0; } We can enquire as to how many cores are available. If there are N cores available, we can set the affinity mask of each thread to run on a core from 0 to N-1. Many processors now have performance and efficient cores. For example, the i9-14900K has 24 cores (8 performance and 16 efficient Cores); with the ...

  16. Boost Announcement

    - Document thread safety of each support class (or method, if it varies by method). - Document complexity guarantees per API. - Document exception safety per API. - Document supported architectures (perhaps link to Coroutine library's list); state minimum compiler versions. - Document the get/set overloads of thread_affinity() and priority()

  17. Why settting CPU affinity make threads run slower?

    The producer thread runs on core 1, allocates 5000000 small objects, put them to a global queue. The consumer thread runs on core 2 and deallocate the objects from the queue. But I found if I do not set their CPU affinity (that is, they run on same core 0), the time performance get better than setting CPU affinity (8.76s VS 14.66s).

  18. Boost users' mailing page: Re: [Boost-users] Setting thread affinity on

    Subject: Re: [Boost-users] Setting thread affinity on Linux using boost::thread From: Lennyk (lennyk430_at_[hidden]) ... >> I'm running on Linux and I would like to set a specific thread affinity >> to each thread I create (i.e. setting each thread to run on a different

  19. Boost mailing page: Re: [boost] [thread] thread pining (processor affinity)

    > after finishing unbinding the thread. I didn't say this should only be supported during thread creation, I wrote "... as well". For the case when you actually want a thread to run under a certain processor all the time, you're possibly wasting CPU cycles when starting the thread on an arbitrary processor, only to move the thread to a specific