This is an archive of the discontinued LLVM Phabricator instance.

[ThinLTO] Allow usage of all SMT threads in the system
ClosedPublic

Authored by aganea on Feb 25 2020, 8:20 PM.

Details

Summary

Before this patch, it wasn't possible to extend the ThinLTO threads to all SMT/CMT threads in the system. Only one thread per core was allowed, instructed by usage of llvm::heavyweight_hardware_concurrency() in the ThinLTO code. Any number passed to the LLD flag /opt:lldltojobs=..., or any other ThinLTO-specific flag, was previously interpreted in the context of llvm::heavyweight_hardware_concurrency(), which means SMT disabled.

After this patch, one can say in LLD:
/opt:lldltojobs=0 -- Use one std::thread / hardware core in the system (no SMT). Default value if flag not specified.
/opt:lldltojobs=N -- Limit usage to N threads, regardless of usage of heavyweight_hardware_concurrency().
/opt:lldltojobs=all -- Use all hardware threads in the system. Equivalent to /opt:lldltojobs=$(nproc) on Linux and /opt:lldltojobs=%NUMBER_OF_PROCESSORS% on Windows.

When N > number-of-hardware-threads-in-the-system, the std::threads will be dispatched equally on all CPU sockets (tested only on Windows).
When N <= number-of-hardware-threads-on-a-CPU-socket, the std::threads will remain on the CPU socket where the process started (only on Windows).

All cmd-line flags and code paths that lead to ThinLTO have been modified by this patch:
-flto-jobs=...
--thinlto-jobs=...
-thinlto-threads=...
--plugin-opt=jobs=...

This is a follow-up for: https://reviews.llvm.org/D71775#1891709 and: https://reviews.llvm.org/D71775#1892013

Diff Detail

Event Timeline

aganea created this revision.Feb 25 2020, 8:20 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript
aganea edited the summary of this revision. (Show Details)Feb 25 2020, 8:22 PM
ormris added a subscriber: ormris.Feb 26 2020, 1:38 PM
aganea edited the summary of this revision. (Show Details)Feb 27 2020, 6:24 AM

If a simpler (more yucky?) patch is needed to fix what @thakis was suggesting in https://reviews.llvm.org/D71775#1891709, and if we don't want this extra new flag, we can also check the CPU brand for "AMD Opteron", and keep the old behavior in that case.

Based upon the description, I think this patch is more applicable than just targeting a specific AMD proc family since it allows the end-user a choice for maximizing threading with both CMT and SMT on all supported platforms.

BTW, until if/when this patch lands, I just set a static value in source as a local workaround for now.

llvm/lib/Support/Host.cpp

#elif defined(_WIN32)
// Defined in llvm/lib/Support/Windows/Threading.inc
static int computeHostNumPhysicalCores() { return 32; }
#else

/opt:lldltojobs=N -- limit usage to N threads, but constrained by usage of heavyweight_hardware_concurrency().

I really dislike this behavior: this seems user hostile to me. I would either:

  • honor the user request (eventually display a warning), this is in line with other system behavior like ninja -j N for instance.
  • reject the user request

If you want such a behavior, then it should be another flag which express it in the name like opt:lldltomaxnumjobs.

aganea added a comment.EditedMar 1 2020, 10:02 AM

@mehdi_amini Agreed. In that case, I just need to slightly change this function to ensure threads are equally dispatched on all CPU sockets regardless of the thread strategy.

aganea updated this revision to Diff 248254.Mar 4 2020, 11:24 AM
aganea edited the summary of this revision. (Show Details)
aganea edited reviewers, added: dexonsmith, ruiu; removed: RobRich999, espindola.
aganea added subscribers: ruiu, RobRich999.

Simplify. Revert to previous behavior, where any number of threads can be specified on the cmd-line, as suggested by @mehdi_amini. See the updated Summary for cmd-line usage.

+ @dexonsmith for the Darwin part.
+ @ruiu for the LLD part.

rnk accepted this revision.Mar 4 2020, 1:09 PM

I'm busy and I haven't looked at the code in detail, but I'm OK with going back to the old way of doing things. I think honoring user requests to use more threads than cores is an important use case. We had plans at one point to add indirection points to allow us to distribute these backend actions to other machines. Essentially, this would be done by having LLD invoke $mydiscc clang -c foo.o -fuse-thin-lto-index=foo.thinlto.bc. We've gone a different direction, so that change is not likely to come soon, but it seems like a reasonable use case where one would want to pass -j LARGE and have it be honored.

This revision is now accepted and ready to land.Mar 4 2020, 1:09 PM
aganea added a comment.Mar 4 2020, 2:18 PM

Thanks Reid! I will leave this open for a few days, in case there are any other feedbacks.

As for -fthinlto-index, we have someone looking at distributing it remotely with Fastbuild. I think it is reasonable on the short-term to let the build system handle that (I assume that's what you did). One adjacent question is how to address the size of the local cache folder (ie. lto.cache when compiling LLVM) which grows big very quickly. Pruning periodically is fine, but I wonder if we could keep it compressed on disk. Maybe do DeviceIoControl (...FSCTL_SET_COMPRESSION..) and let Windows handle that? I'll take a look in the weeks to come.

rnk added a comment.Mar 4 2020, 3:12 PM

I think making the build system responsible for running the backend actions is the ideal solution, actually. The main reason we have all this threading logic in the linker is to make it easy for users of traditional build systems to use ThinLTO with a few flag flips. We haven't actually made build system changes for Chrome yet. Instead, we replaced the linker with a script that wraps the thin link + backend jobs + native link steps. https://crbug.com/877722 has a partial record of ideas we had and stuff we tried for Chrome.

Does taskset -c 0-3 lld -flavor ... restrict the number of cores?

cpu_set_t cpu;
sched_getaffinity(0, sizeof(cpu), &cpu);
CPU_COUNT(&cpu)
abrachet added inline comments.
aganea marked 3 inline comments as done.Mar 25 2020, 3:11 PM
aganea added a subscriber: respindola.

Does taskset -c 0-3 lld -flavor ... restrict the number of cores?

cpu_set_t cpu;
sched_getaffinity(0, sizeof(cpu), &cpu);
CPU_COUNT(&cpu)

Thanks for raising this! This does not seem to work (I currently only have WSL at hand, no real Linux machine). I don't think it worked before my patch. The current code in LLVM is written such as: (note the "if" statement)

#if defined(HAVE_SCHED_GETAFFINITY) && defined(HAVE_CPU_COUNT)
  cpu_set_t Set;
  if (sched_getaffinity(0, sizeof(Set), &Set))
    return CPU_COUNT(&Set);
#endif

The doc for sched_getaffinity says:

On success, sched_setaffinity() and sched_getaffinity() return 0. On error, -1 is returned, and errno is set appropriately.

So it would always fall back to std::thread::hardware_concurrency, which apparently does not always take affinity into account, according to @respindola (please see rG8c0ff9508da5f02e8ce6580a126a2018c9bf702a).

I'll write a follow-up patch to test affinity on Linux and Windows.

llvm/include/llvm/Support/Threading.h
201

After discussing offling with @abrachet , I'll leave the inline for now. It makes the symbol weak, removing inline would otherwise fail linking. I can move the function(s) to the .CPP after this patch to save on link time.

This revision was automatically updated to reflect the committed changes.
aganea marked an inline comment as done.
MaskRay added inline comments.Mar 27 2020, 12:01 PM
lld/test/ELF/basic.s
252

This change is not needed. lto/thinlto.ll has already tested the functionally.

basic.s should also be split. I did this in 34bdddf9a13cfdbbb5506dc89cf8e781be53105f

253

-verbose is not needed because verbose just prints input filenames, which has nothing to do with --thinlto-jobs=0`.

This patch seems to break when disabling threading (aka LLVM_ENABLE_THREADS == 0) as get_threadpool_strategy is undefined therefore creating linker failures in clang. (Tested on Windows)

This patch seems to break when disabling threading (aka LLVM_ENABLE_THREADS == 0) as get_threadpool_strategy is undefined therefore creating linker failures in clang. (Tested on Windows)

Taking a look now.

This patch seems to break when disabling threading (aka LLVM_ENABLE_THREADS == 0) as get_threadpool_strategy is undefined therefore creating linker failures in clang. (Tested on Windows)

Should be fixed after rG3ab3f3c5d5825476dc1be15992f7c964629de688.

phosek added a subscriber: phosek.Apr 16 2020, 11:50 AM

We've started seeing llvm-cov on our Linux bots with this error:

terminating with uncaught exception of type std::__2::system_error: thread constructor failed: Resource temporarily unavailable

Specifically, we're running llvm export which uses heavyweight_hardware_concurrency (we use the default number of threads, i.e. 0): https://github.com/llvm/llvm-project/blob/master/llvm/tools/llvm-cov/CoverageExporterJson.cpp#L169

I'm not yet sure what's the problem, but bisecting is pointing at this change.

We've started seeing llvm-cov on our Linux bots with this error:

terminating with uncaught exception of type std::__2::system_error: thread constructor failed: Resource temporarily unavailable

Specifically, we're running llvm export which uses heavyweight_hardware_concurrency (we use the default number of threads, i.e. 0): https://github.com/llvm/llvm-project/blob/master/llvm/tools/llvm-cov/CoverageExporterJson.cpp#L169

I'm not yet sure what's the problem, but bisecting is pointing at this change.

Also on runs that succeeded, we see the execution times more than doubled.

We've started seeing llvm-cov on our Linux bots with this error:

terminating with uncaught exception of type std::__2::system_error: thread constructor failed: Resource temporarily unavailable

Specifically, we're running llvm export which uses heavyweight_hardware_concurrency (we use the default number of threads, i.e. 0): https://github.com/llvm/llvm-project/blob/master/llvm/tools/llvm-cov/CoverageExporterJson.cpp#L169

I'm not yet sure what's the problem, but bisecting is pointing at this change.

Also on runs that succeeded, we see the execution times more than doubled.

This is caused by a mix of the previous change (rG8404aeb5, see https://github.com/llvm/llvm-project/commit/8404aeb56a73ab24f9b295111de3b37a37f0b841#diff-9c7f49c15e22d38241ccb9d294f880f4L950) which was restraining the ThreadPool to the number of hardware threads, and this patch which makes the number of threads in the ThreadPool limitless.

I will revert the behavior of llvm-cov to previous state, and will check if there are other cases like this.

Does taskset -c 0-3 lld -flavor ... restrict the number of cores?

cpu_set_t cpu;
sched_getaffinity(0, sizeof(cpu), &cpu);
CPU_COUNT(&cpu)

Thanks for raising this! This does not seem to work (I currently only have WSL at hand, no real Linux machine). I don't think it worked before my patch. The current code in LLVM is written such as: (note the "if" statement)

#if defined(HAVE_SCHED_GETAFFINITY) && defined(HAVE_CPU_COUNT)
  cpu_set_t Set;
  if (sched_getaffinity(0, sizeof(Set), &Set))
    return CPU_COUNT(&Set);
#endif

The doc for sched_getaffinity says:

On success, sched_setaffinity() and sched_getaffinity() return 0. On error, -1 is returned, and errno is set appropriately.

So it would always fall back to std::thread::hardware_concurrency, which apparently does not always take affinity into account, according to @respindola (please see rG8c0ff9508da5f02e8ce6580a126a2018c9bf702a).

I'll write a follow-up patch to test affinity on Linux and Windows.

Created D78324 to fix this pre-existing issue. I remember I saw a related bugs.llvm.org report yesterday but I can't find it now...

I remember I saw a related bugs.llvm.org report yesterday but I can't find it now...

This? https://bugs.llvm.org/show_bug.cgi?id=45556