Page MenuHomePhabricator

[ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups
ClosedPublic

Authored by aganea on Dec 20 2019, 9:20 AM.

Details

Summary

TL;DR: This patches ensures that, on Windows, all CPU sockets and all NUMA nodes are used by the ThreadPool. The goal was to have LLD/ThinLTO use all hardware threads in the system, which isn't the case currently on multi-socket or large CPU count systems.

(this could possibly be split into a few patches, but I just wanted an overall opinion)

Background

Windows doesn't have a flat cpu_set_t like Linux. Instead, it projects hardware CPUs (or NUMA nodes) to applications through a concept of "processor groups". A "processor" is the smallest unit of execution on a CPU, that is, an hyper-thread if SMT is active; a core otherwise. There's a limit of 32-bit processors on older 32-bit versions of Windows, which later was raised to 64-processors with 64-bit versions of Windows. This limit comes from the affinity mask, which historically was represented by the sizeof(void*) (still is that way). Consequently, the concept of "processor groups" was introduced for dealing with systems with more than 64 hyper-threads.

By default, the Windows OS assigns only one "processor group" to each starting application, in a round-robin manner. If the application wants to use more processors, it needs to programmatically enable it, by assigning threads to other "processor groups". This also means that affinity cannot cross "processor group" boundaries; one can only specify a "preferred" group on startup, but the application is free to allocate more groups if it wants to.

This creates a peculiar situation, where newer CPUs like the AMD EPYC 7702P (64-cores, 128-hyperthreads) are projected by the OS as two (2) "processor groups". This means that by default, an application can only use half of the cores. This situation will only get worse in the years to come, as dies with more cores will appear on the market.

The changes in this patch

Previously, the problem was that heavyweight_hardware_concurrency() API was introduced so that only one hardware thread per core was used. Once that API returns, that original intention is lost. Consider a situation, on Windows, where the system has 2 CPU sockets, 18 cores each, each core having 2 hyper-threads, for a total of 72 hyper-threads. Both heavyweight_hardware_concurrency() and hardware_concurrency() currently return 36, because on Windows they are simply wrappers over std::thread::hardware_concurrency() -- which returns only processors from the current "processor group".

What if we wanted to use all "processor groups" ? Even if we implemented properly heavyweight_hardware_concurrency(), what should it then return ? 18 or 36 ?
What if the user specified /opt:lldltojobs=36 ? Should we assign 36 threads on the current "processor group" or should we dispatch extra threads on the second "processor groups" ?

To solve this situation, we capture (and retain) the initial intention until the point of usage, through a new ThreadPoolStrategy class. The number of threads to use is deferred as late as possible, until the moment where the std::threads are created (ThreadPool in the case of ThinLTO).

Discussion

Ideally, we should consider all "processors" (on Windows) or all "CPUs" (Linux) as all equal, in which case heavyweight_hardware_concurrency() wouldn't be needed. I'm not sure how micro-managing threads, cores and NUMA nodes will scale in the years to come (probably not well). Will it make sense to say "I don't want hyper-threads" ? Or saying /opt:lldltojobs=whatever when you have a thousand(s)-core system ? How would that work with NUMA affinity ? For example the Fujitsu A64FX has 4x "12-core tiles" on the same die, each tile being connected to an internal 8-GB HBM2 memory (each located internally on the CPU die). How would we dispatch threads in that case ? The AMD EPYC uses the same concept of "tiles", however it doesn't have internal memory yet, but most likely the EPYC v3 will use the same architecture.

@tejohnson : Teresa, since you added heavyweight_hardware_concurrency(), do you have a benchmark which compares ThinLTO running with heavyweight_hardware_concurrency() or with hardware_concurrency() ? (I haven't done that test yet)
It would make things a lot simpler if we didn't have that API, and in general considered that we could use all hardware threads in the system; and that they can perform equally.

NOTE: For full effect, this patch needs rpmalloc (D71786)

Diff Detail

Event Timeline

aganea created this revision.Dec 20 2019, 9:20 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript
aganea edited the summary of this revision. (Show Details)Dec 20 2019, 9:28 AM
aganea added a reviewer: jlpeyton.

Unit tests: pass. 61057 tests passed, 0 failed and 728 were skipped.

clang-tidy: fail. Please fix clang-tidy findings.

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Will it make sense to say "I don't want hyper-threads" ?

Not sure I remember correctly, but I believe one motivation behind avoiding "hyper-threads" and other virtual cores was that while they improve slightly the performance, they also increase the peak memory requirements: using heavyweight_hardware_concurrency() seemed like a good default tradeoff for most end-users.

Also: using heavyweight_hardware_concurrency() in the linker but having multiple linker jobs schedules by the build system was another reason (I think LLVM CMake default to 2 parallel link jobs when using ThinLTO for instance).

ychen added a subscriber: ychen.Dec 20 2019, 11:22 PM
aganea added a comment.EditedDec 21 2019, 8:24 AM

Will it make sense to say "I don't want hyper-threads" ?

Not sure I remember correctly, but I believe one motivation behind avoiding "hyper-threads" and other virtual cores was that while they improve slightly the performance, they also increase the peak memory requirements: using heavyweight_hardware_concurrency() seemed like a good default tradeoff for most end-users.

It all makes sense. After this patch, the memory consumption is doubled when using both CPU sockets. Evidently then there's also a discussion about memory bandwidth, which doesn't scale in my case, when using both sockets (possibly on a AMD Epyc it could be better because is has more memory channels.). This is also why enabling the second socket only marginally decrease the timings.

In Ubisoft's case, time is immensely more valuable (both compute and human) than memory sticks. Historically we didn't really use LTO on game productions because it was really slow, and often introduced undesirable bugs or side-effects. The graphs in D71786 are for Rainbow 6: Siege, which is a "smaller" codebase. For larger games, LTO link time is more in the range of 1h 20min, both for MSVC and previous versions of Clang. If there's a LTO-specific bug in a final build, it is very hard to iterate with that kind of timings. In addition, there are hundreds of builds every day on the build system, and we want to keep all the cores in the build system busy. This is why both build and link times are important to us.

Also: using heavyweight_hardware_concurrency() in the linker but having multiple linker jobs schedules by the build system was another reason (I think LLVM CMake default to 2 parallel link jobs when using ThinLTO for instance).

Understood. If one sets the CPU affinity when starting the application, ie. start /affinity XXX lld-link.exe ..., then this patch disables dispatching on other "processor groups", even if they are available. However, there doesn't seem to be a way to restrain the application to one "processor group".

Remains to see if we want to keep the current behavior, where only one socket is use, and have an option to enable the other CPU sockets. Or if we want that multi-socket behavior by default, as in this patch.

+@gbiv @inglorion, other ThinLTO users.

To solve this situation, we capture (and retain) the initial intention until the point of usage, through a new ThreadPoolStrategy class. The number of threads to use is deferred as late as possible, until the moment where the std::threads are created (ThreadPool in the case of ThinLTO).

That seems reasonable to me.

llvm/include/llvm/ADT/SmallBitVector.h
487 ↗(On Diff #234910)

I guess we don't need these changes, since it looks like you use BitVector only below.

500 ↗(On Diff #234910)

You return A - B but then compare for equality to -1 and 1. I guess it works because you are doing it bit by bit, but it's exciting.

llvm/include/llvm/Support/Threading.h
152

Let's name the fields in a way that indicates that these numbers are the requested number of threads, not the final number. So, ThreadsRequested or something like that.

157

This could be UseHyperThreads. The first time I read it, I guessed that it indicated if the system has hyperthreads.

llvm/lib/Support/Host.cpp
1277

Another -1 case.

1334

Note: the -1 case.

llvm/lib/Support/Threading.cpp
84–85

We already have a public API, getHostNumPhysicalCores. Can this use that?

87

I see these computeHostNum* methods return int, and some versions seem to return -1 to indicate errors. I think you'll want to use int here and check if MaxThreadCount <= 0. Otherwise below we may do a signed/unsigned comparison mismatch, and return ~0U for MaxThreadCount.

llvm/lib/Support/Unix/Threading.inc
273

I'm not sure it makes sense to say "physical threads". I think "physical cores" was meant to distinguish between hardware threads and cores. Describing hardware execution resources as physical or non-physical isn't very precise or meaningful in the first place, but I don't think it should apply to hyper threads.

llvm/lib/Support/Windows/Threading.inc
160

This is cached, so maybe getProcessorGroups to indicate that it is not intended to be expensive.

212

Seems like static can be used here.

llvm/unittests/Support/ThreadPool.cpp
31

"Temporary" >_>

180

I guess this is why you added comparison operators. In any case, let's remove the commented out code in the final version.

aganea updated this revision to Diff 237197.Jan 9 2020, 2:12 PM
aganea marked 13 inline comments as done.

Updated as suggested by @rnk.
I've also removed ThreadPoolStrategy::PinThreads because it wasn't conclusive. In the best case, pinning threads to a core / hyperthread, was similar in performance to that of using a full CPU socket affinity; in the worst case, it was degrading performance. The NT scheduler seems to be doing a pretty good job here, so we'll stick with it :-)

aganea added inline comments.Jan 9 2020, 2:12 PM
llvm/lib/Support/Unix/Threading.inc
273

computeHostNumHardwareThreads() ?

ormris added a subscriber: ormris.Jan 9 2020, 2:45 PM

Ping! Any further comments?

rnk accepted this revision.Feb 11 2020, 5:52 PM

lgtm

llvm/lib/Support/Unix/Threading.inc
273

lgtm

This revision is now accepted and ready to land.Feb 11 2020, 5:52 PM
This revision now requires review to proceed.Feb 11 2020, 5:52 PM
rriddle accepted this revision.Feb 13 2020, 11:28 AM

(Appeasing herald for MLIR changes).

This revision is now accepted and ready to land.Feb 13 2020, 11:28 AM
This revision was automatically updated to reflect the committed changes.

https://bugs.chromium.org/p/chromium/issues/detail?id=1051578#c12 :

"""
FYI for those building on AMD Bulldozer family of processors and its various iterations after this commit:

https://reviews.llvm.org/D71775

Building with ThinLTO on Bulldozer and similar appears to now be capped to how Windows reports cores versus logical processors, thus now halving the number of LTO threads available when building. Manually setting /opt:lldltojobs= for LLD does not override it, as that only sets an upper limit.

Found out as I locally build on a 32-core Opteron system. Windows treats it as 16 cores and 32 logical processors, but it is not a SMT setup like Intel HyperTreading. In particular:

"A module consists of a coupling of two "conventional" x86 out of order processing cores. The processing core shares the early pipeline stages (e.g. L1i, fetch, decode), the FPUs, and the L2 cache with the rest of the module."

https://en.wikipedia.org/wiki/Bulldozer_(microarchitecture)

Naturally, build times have increased dramatically. YMMV.
"""

Sounds like this patch might have some drawbacks.

aganea added a comment.EditedFeb 25 2020, 12:14 PM

@thakis I think this is a side-effect of implementing computeHostNumPhysicalCores() for Windows, which previously returned -1, which in turn made llvm::heavyweight_hardware_concurrency() previously behave like llvm::hardware_concurrency() (on Windows only).
At first sight, Linux seems to report the same thing as Windows (2 cores per "module"). If I read this correctly, the behavior on Linux for a AMD Opteron is/was the same as it is currently on Windows after this patch, that is, only one core out of two would be used for llvm::heavyweight_hardware_concurrency().

I think there is a wider problem outside the scope of this patch, which is, some users want to use 100% of the cores/hyper-threads when using ThinLTO. Currently there's no way. An algorithm using llvm::heavyweight_hardware_concurrency() explicitly states "I only want one core out of two". It'd be nice if we had a cmd-line flag to override this, to say "I want to use all hyper-threads/cores/modules". /opt:lldltojobs=all? The impact is small on a modern Intel Skylake CPU, but it might be better depending on the CPU (AMD Bulldozer).

CORRECTION: The impact is small(-er) with SMT enabled because only half of that link-time truly multi-threaded (in this example). The other half is essentially the falloff and the regular .EXE section merging & debug info merging (of which currently none is multi-threaded for the COFF driver).
Graph of the CPU usage when linking clang.exe (here with rpmalloc):

I too can see how SMT might not afford much performance difference for LTO codegen.

CMT appears to be more significant. I do not have exact numbers right now as my build box is busy, but the change added like an hour to locally building the Chromium browser with ThinLTO optimizations enabled. Win10 running on a 32-core Opteron Piledriver (bdver2) system.

Definitely agree something like "/opt:lldltojobs=all" in a separate patch would be a good solution if possible for this particular (corner) case.

There is something puzzling to me in this patch in the way the ThreadPool was modified: the client of the thread pool cannot request a number of threads anymore, but only a *maximum* number of threads. I don't really understand why? Using the OS reported concurrency was always a "default", isn't this change orthogonal / unrelated to the stated goal of this patch?

aganea marked an inline comment as done.Feb 26 2020, 8:29 AM
aganea added inline comments.
llvm/lib/Support/Threading.cpp
94

@mehdi_amini You mean this? Testing was showing degraded performance if ThreadsRequested > MaxThreadCount, so I thought it'd maybe better to prevent that situation. More soft threads than hardware threads means you'd pay for context switching, for the cache eviction and for extra memory pressure (even more if the allocator has per-thread pools).
Do you see a cases where not having this test would be valuable? Providing --threads=50, and creating 50-threads ThreadPool when your CPU only supports 12 hardware threads? The only case I can think of is usage of async operations in threads, but then your ThreadPool sounds more like a queue, and maybe it's not the right hammer for the nail? Should we support that case and explicitly tag the ThreadPool constructor in client code with something like ThreadPool(50, AcknowledgeDegradedPerformance)?

mehdi_amini added inline comments.Feb 27 2020, 10:42 PM
llvm/lib/Support/Threading.cpp
94
Testing was showing degraded performance

I don't understand why this is relevant to anything else than the default? This is a library and if the caller override the default then IMO this layer should just let it be!

The current issue with bulldozer is a perfect example of this: the user is *requesting* more threads: why wouldn't you honor this request? If they want to shoot themselves in the foot, I'd let them (in most cases the user would know something you don't, like here).

I don't think we need the extra AcknowledgeDegradedPerformance flag, if the client override the default they can be assumed to know what they're doing (or they shouldn't use such an API in the first place), this seems more in line with the rest of the system (and C++ API in general I believe).

With this patch the Threading.PhysicalConcurrency unit test fails when run with an affinity less than the number of physical cpus. I've raised https://bugs.llvm.org/show_bug.cgi?id=45556.