TL;DR: This patches ensures that, on Windows, all CPU sockets and all NUMA nodes are used by the ThreadPool. The goal was to have LLD/ThinLTO use all hardware threads in the system, which isn't the case currently on multi-socket or large CPU count systems.
(this could possibly be split into a few patches, but I just wanted an overall opinion)
Windows doesn't have a flat cpu_set_t like Linux. Instead, it projects hardware CPUs (or NUMA nodes) to applications through a concept of "processor groups". A "processor" is the smallest unit of execution on a CPU, that is, an hyper-thread if SMT is active; a core otherwise. There's a limit of 32-bit processors on older 32-bit versions of Windows, which later was raised to 64-processors with 64-bit versions of Windows. This limit comes from the affinity mask, which historically was represented by the sizeof(void*) (still is that way). Consequently, the concept of "processor groups" was introduced for dealing with systems with more than 64 hyper-threads.
By default, the Windows OS assigns only one "processor group" to each starting application, in a round-robin manner. If the application wants to use more processors, it needs to programmatically enable it, by assigning threads to other "processor groups". This also means that affinity cannot cross "processor group" boundaries; one can only specify a "preferred" group on startup, but the application is free to allocate more groups if it wants to.
This creates a peculiar situation, where newer CPUs like the AMD EPYC 7702P (64-cores, 128-hyperthreads) are projected by the OS as two (2) "processor groups". This means that by default, an application can only use half of the cores. This situation will only get worse in the years to come, as dies with more cores will appear on the market.
The changes in this patch
Previously, the problem was that heavyweight_hardware_concurrency() API was introduced so that only one hardware thread per core was used. Once that API returns, that original intention is lost. Consider a situation, on Windows, where the system has 2 CPU sockets, 18 cores each, each core having 2 hyper-threads, for a total of 72 hyper-threads. Both heavyweight_hardware_concurrency() and hardware_concurrency() currently return 36, because on Windows they are simply wrappers over std::thread::hardware_concurrency() -- which returns only processors from the current "processor group".
What if we wanted to use all "processor groups" ? Even if we implemented properly heavyweight_hardware_concurrency(), what should it then return ? 18 or 36 ?
What if the user specified /opt:lldltojobs=36 ? Should we assign 36 threads on the current "processor group" or should we dispatch extra threads on the second "processor groups" ?
To solve this situation, we capture (and retain) the initial intention until the point of usage, through a new ThreadPoolStrategy class. The number of threads to use is deferred as late as possible, until the moment where the std::threads are created (ThreadPool in the case of ThinLTO).
Ideally, we should consider all "processors" (on Windows) or all "CPUs" (Linux) as all equal, in which case heavyweight_hardware_concurrency() wouldn't be needed. I'm not sure how micro-managing threads, cores and NUMA nodes will scale in the years to come (probably not well). Will it make sense to say "I don't want hyper-threads" ? Or saying /opt:lldltojobs=whatever when you have a thousand(s)-core system ? How would that work with NUMA affinity ? For example the Fujitsu A64FX has 4x "12-core tiles" on the same die, each tile being connected to an internal 8-GB HBM2 memory (each located internally on the CPU die). How would we dispatch threads in that case ? The AMD EPYC uses the same concept of "tiles", however it doesn't have internal memory yet, but most likely the EPYC v3 will use the same architecture.
@tejohnson : Teresa, since you added heavyweight_hardware_concurrency(), do you have a benchmark which compares ThinLTO running with heavyweight_hardware_concurrency() or with hardware_concurrency() ? (I haven't done that test yet)
It would make things a lot simpler if we didn't have that API, and in general considered that we could use all hardware threads in the system; and that they can perform equally.