This is an archive of the discontinued LLVM Phabricator instance.

[ELF] Cap parallel::strategy to 16 threads when --threads= is unspecified
ClosedPublic

Authored by MaskRay on Apr 3 2023, 8:43 PM.

Details

Summary

When --threads= is unspecified, we set it to
parallel::strategy.compute_thread_count(), which uses
sched_getaffinity (Linux)/cpuset_getaffinity (FreeBSD)/std::thread::hardware_concurrency (others).
With extensive testing on many machines (many configurations from
{aarch64,x86-64} x {Linux,FreeBSD,Windows} x allocators(native,mimalloc,rpmalloc) combinations)
with varying workloads, we discovered that when the concurrency is larger than
16, the linking process is slower than using --threads=16 due to parallelism
overhead outweighs optimizations. This is particularly harmful for machines with
many cores or when the link job competes with other jobs.

Cap parallel::strategy when --threads= is unspecified.
For some workloads changing the concurrency from 8 to 16 has nearly no improvement.

--thinlto-jobs= is unchanged since ThinLTO backend compiles are embarrassingly
parallel.

Link: https://discourse.llvm.org/t/avoidable-overhead-from-threading-by-default/69160

Diff Detail

Event Timeline

MaskRay created this revision.Apr 3 2023, 8:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 3 2023, 8:43 PM
MaskRay requested review of this revision.Apr 3 2023, 8:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 3 2023, 8:43 PM

How about turning it into a warning instead:
The parallelism is higher than 8 and --threads= was not specified.

mjguzik added a comment.EditedApr 4 2023, 2:03 AM

I have to repeat the note I made on discourse: even for something embarrassingly parallel, there is only so much work which is there to do and it very well may be majority of the threads you spawned remain unused.

Absent an informed way to spawn threads, a sanitizing cap is definitely prudent for thinlto.

I don't know what kind of binaries you use the opt with -- would chrome count? ;) I mean, for something really big, how many threads can really end up being used at the same time? Boxes around 100 hw threads are not hard to come by, if you can't get one yourself you can prod me in private and I'll be happy to arrange access to the 96 thread Cascade Lake with Ubuntu, so you can do whatever testing you want.

That said, for my use case the above patch already introduces the damage control so I'm happy to bugger off.

How about turning it into a warning instead:
The parallelism is higher than 8 and --threads= was not specified.

I'd prefer something in the verbose output, such as "Setting maximum parallelism to 8, use --threads=<threads> to increase". I think a warning could be too noisy in this case and would force most people to add an extra parameter to the link to silence it.

Code changes LGTM, I'd like to see if I can get some data on a non-X86 machine to back the decision up. Will see if I can get some time on a high core count AArch64 server to get some data. As I'm expecting a similar result if I can't get it in time, no objections.

andrewng added a subscriber: andrewng.EditedApr 4 2023, 6:13 AM

Could you please wait before committing as I'm testing on Windows x86-64. In my initial experiments, I'm seeing reasonable gains up to 16 threads on a 12C/24T system.

Also, perhaps another solution to the issue would be to allow an environment variable to override the "core" parallel strategy thread count? This would provide an easier way to control threading across an entire build.

Could you please wait before committing as I'm testing on Windows x86-64. In my initial experiments, I'm seeing reasonable gains up to 16 threads on a 12C/24T system.

do you have any way to profile what's going on in your case?

These are the link times for a UE4 based link with a LLVM 16 libc++ self-hosted release build with the rpmalloc allocator on Windows 10, AMD Ryzen 3900X 12C/24T, Samsung 970 EVO NVMe SSD, 64GB RAM:

"Regular" link:

ThreadsTime% gain
-j43.508-
-j82.76121.29
-j122.64724.53
-j162.60225.82
-j202.63025.04
-j242.66324.08

"GC + ICF" link:

ThreadsTime% gain
-j44.033-
-j83.27318.84
-j123.12822.45
-j163.06723.96
-j203.09223.33
-j243.12122.63

So there is improvement all the way to -j16, although the benefit at -j16 is quite small. However, the gain for -j12 is not insignificant. So this example already shows that limiting the thread count to 8 could be detrimental to performance, although in absolute terms it might not be considered significant. It also shows that -j20 and above reduces performance. These results are also likely to vary depending on the toolchain, runtime and allocator in use as well as the test system itself.

I was kind of hoping that the "GC + ICF" link would show better scaling but it's quite similar (perhaps some scope for improvement?). At the end of the day, it's tricky to make threading scale well and tricky to determine the right balance too, particularly when there are so many factors that can affect the outcome. However, it's also very clear that too many threads can be very bad. I think the key thing is to allow the user easy and sufficient control to be able to get "good" performance. However, what the default behaviour should be is not so obvious. Although from the evidence so far, it seems some "cap" would make sense.

do you have any way to profile what's going on in your case?

I could but I suspect that it's just a case of slightly better scaling in this particular link and/or setup.

Thanks for the results. Does not look like the win is necessarily there, and even if so, it is unclear if it is justified by the computing power thrown at it.

That is to say, can you redo these while collecting cycle counts?

I don't know how to do it on Windows in a general manner. If you have an Intel CPU you can use https://github.com/intel/pcm , AMD has a similar tool somewhere. Running pcm-core should do it.

See my opening comment in https://discourse.llvm.org/t/avoidable-overhead-from-threading-by-default/69160 for an example collected on Linux.

Thanks for the results. Does not look like the win is necessarily there, and even if so, it is unclear if it is justified by the computing power thrown at it.

I guess that somewhat depends on what is important to you, i.e. performance per cycle or wall time duration to complete task.

That is to say, can you redo these while collecting cycle counts?

I'm not setup to do that and IIRC (like most things) it's not so trivial to do so on Windows. But as I mentioned above, it partly comes down to what is considered "good" performance.

Thanks for the results. Does not look like the win is necessarily there, and even if so, it is unclear if it is justified by the computing power thrown at it.

I guess that somewhat depends on what is important to you, i.e. performance per cycle or wall time duration to complete task.

I'm trying to say the reported difference is very easily within measurement error.

What most likely happens apart from it is that the cycle count spent on it skyrockets, making the entire thing into a net *loss*.

Unfortunately I have negative Windows clue, so if you can't take a closer look maybe someone else will volunteer.

That is to say, can you redo these while collecting cycle counts?

I'm not setup to do that and IIRC (like most things) it's not so trivial to do so on Windows. But as I mentioned above, it partly comes down to what is considered "good" performance.

I have run a limited experiment linking clang on a 64 core AArch64 machine. Results seem to be similar to x86_64. No benefit seen linking clang past 8 threads. On the machine I had access to I was very limited in disk space so I couldn't check the debug build. From prior experience debug builds seems to benefit more from more threads as there is quite a lot of data to write out in the most parallel part of LLD.

mjguzik added a comment.EditedApr 5 2023, 6:13 AM

I have run a limited experiment linking clang on a 64 core AArch64 machine. Results seem to be similar to x86_64. No benefit seen linking clang past 8 threads. On the machine I had access to I was very limited in disk space so I couldn't check the debug build. From prior experience debug builds seems to benefit more from more threads as there is quite a lot of data to write out in the most parallel part of LLD.

bare metal aarch64 with sufficient(tm) disk space is not hard to come by.

if you think the debug thingy is worth checking out and you can't get sufficient hw as is, you can contact me in private and I'll arrange temporary access to something.

perhaps while the machine is disk starved, it has a lot of ram? as in perhaps you could use a tmpfs mount for the output?

I have run a limited experiment linking clang on a 64 core AArch64 machine. Results seem to be similar to x86_64. No benefit seen linking clang past 8 threads. On the machine I had access to I was very limited in disk space so I couldn't check the debug build. From prior experience debug builds seems to benefit more from more threads as there is quite a lot of data to write out in the most parallel part of LLD.

bare metal aarch64 with sufficient(tm) disk space is not hard to come by.

if you think the debug thingy is worth checking out and you can't get sufficient hw as is, you can contact me in private and I'll arrange temporary access to something.

perhaps while the machine is disk starved, it has a lot of ram? as in perhaps you could use a tmpfs mount for the output?

Thanks for the offer and for the suggestion of tmpfs. I'm more personal time limited than anything else. I really just wanted to check if there was any significant difference between AArch64 and X86_64 on higher core counts and so far there doesn't look to be anything significant. It looks like Windows (or Unreal) may favour a higher count though.

FWIW to link clang without debug.

CPUSusersystemelapsed
10.570.341.06
80.660.65.055
160.651.080.56
320.652.020.56
640.731.940.58

I've re-run my example but this time I'm using the Windows process times (instead of Python script timer) and cycle counts. Also removing the output file between each link. Similar to before these are mean values for 16 iterations of a link. I've included 1 thread and that's now the "baseline".

ThreadsTime (s)Time gain (%)Cycles (M)Cycles overhead (%)
16.903-26,947-
43.02856.1435,17030.51
82.40565.1646,29871.81
122.28966.8457,053111.72
162.24067.5472,185167.87
202.27067.1383,375209.40
242.27867.0193,264246.10

The results are pretty similar, with diminishing wall time improvements up to 16 threads (although at this point the system will be leaning more on SMT threads). As expected, these wall time improvements come at a significant cost in cycles. So as I've already mentioned, it really comes down to what's most important, e.g. minimum time, efficiency or some balance between the two. This is also somewhat dependent on the nature of the build, i.e. a build that will perform many links compared to a build that only does just one. It also depends on the parallelism of the build system itself. This problem has been around for sometime but it's becoming more prominent due to the higher core/thread counts that are now available. I don't think there's really a simple solution that "fits all".

So out of curiosity, I've done another run of the above example but this time I've built LLVM 16 entirely with MSVC 2019 (MSVC STL & allocator).

ThreadsTime (s)Time gain (%)Cycles (M)Cycles overhead (%)
17.695-30,036-
43.33057.1839,00528.55
82.61166.4250,46266.31
122.45268.4760,21998.46
162.38869.2974,422145.27
202.40769.0584,579178.75
242.42668.8093,242207.30

It's a similar pattern again but overall slower. However, the threading cycles overhead is noticeably less. I suspect that this might be due to the "lock-free" nature of the rpmalloc allocator.

MaskRay updated this revision to Diff 512637.Apr 11 2023, 6:27 PM
MaskRay retitled this revision from [ELF] Cap parallel::strategy to 8 cores when --threads= is unspecified to [ELF] Cap parallel::strategy to 16 cores when --threads= is unspecified.
MaskRay edited the summary of this revision. (Show Details)

8 => 16
update description

peter.smith accepted this revision.Apr 12 2023, 1:12 AM

I think 16 seems to be the consensus. No objections from me.

This revision is now accepted and ready to land.Apr 12 2023, 1:12 AM

LGTM too. I think 16 would be a reasonable cap for now. I'll see if I can test Windows on a system with >16 "real" cores, although not entirely sure I have access to such a PC.

Personally I think that's really pushing it given the diminishing returns as is, lack of integration with build systems and lack of a mechanism to find out how many threads even make sense for given input. That's my $0,03, you do you.

MaskRay edited the summary of this revision. (Show Details)Apr 12 2023, 1:09 PM
MaskRay retitled this revision from [ELF] Cap parallel::strategy to 16 cores when --threads= is unspecified to [ELF] Cap parallel::strategy to 16 threads when --threads= is unspecified.

Sorry, I was so focused on the value of the limit that I didn't really look at the details of the code change. Was it deliberate to have config->threadCount not match the number of threads in the parallel::strategy? IIUC, this probably doesn't affect correctness but might have an impact on efficiency.

Sorry, I was so focused on the value of the limit that I didn't really look at the details of the code change. Was it deliberate to have config->threadCount not match the number of threads in the parallel::strategy? IIUC, this probably doesn't affect correctness but might have an impact on efficiency.

Good catch. The code should set config->threadCount. config->threadCount is currently only used for parallel relocation scanning whether an over-sized value still works.
I think not setting config->threadCount is correct but weird. How about?

} else if (config->threadCount > 16) {
  log("set maximum concurrency to 16, specify --threads= to change");
  config->threadCount = 16;
  parallel::strategy = hardware_concurrency(config->threadCount);
}

Good catch. The code should set config->threadCount. config->threadCount is currently only used for parallel relocation scanning whether an over-sized value still works.
I think not setting config->threadCount is correct but weird. How about?

} else if (config->threadCount > 16) {
  log("set maximum concurrency to 16, specify --threads= to change");
  config->threadCount = 16;
  parallel::strategy = hardware_concurrency(config->threadCount);
}

Yes, this LGTM for this case, but I think something also needs to be done for the explicit setting of the --threads option just above. I think it needs: config->threadCount = parallel::strategy.compute_thread_count() after the setting of parallel::strategy. IIUC, compute_thread_count() will take into account any CPU affinity settings. Having now looked more closely at that code, I do wonder if the default value for config->thinLTOJobs should be the same as config->threadCount, i.e. take into account any CPU affinity settings.

MaskRay reopened this revision.Apr 20 2023, 9:49 AM
This revision is now accepted and ready to land.Apr 20 2023, 9:49 AM
MaskRay updated this revision to Diff 515369.Apr 20 2023, 9:54 AM

Keep config->threadCount assignment below to avoid a pitfall that config->threadCount may be larger than the available concurrency.

Good catch. The code should set config->threadCount. config->threadCount is currently only used for parallel relocation scanning whether an over-sized value still works.
I think not setting config->threadCount is correct but weird. How about?

} else if (config->threadCount > 16) {
  log("set maximum concurrency to 16, specify --threads= to change");
  config->threadCount = 16;
  parallel::strategy = hardware_concurrency(config->threadCount);
}

Yes, this LGTM for this case, but I think something also needs to be done for the explicit setting of the --threads option just above. I think it needs: config->threadCount = parallel::strategy.compute_thread_count() after the setting of parallel::strategy. IIUC, compute_thread_count() will take into account any CPU affinity settings. Having now looked more closely at that code, I do wonder if the default value for config->thinLTOJobs should be the same as config->threadCount, i.e. take into account any CPU affinity settings.

Thanks. For clarity I reverted the commit and reused the differential for a proper fix.

For in-process ThinLTO, the summary says

--thinlto-jobs= is unchanged since ThinLTO backend compiles are embarrassingly parallel.

Any measurement is welcome:) llvm::heavyweight_hardware_concurrency(config->thinLTOJobs) is used for this option, along with several other uses across llvm-project.
It will be a good measure whether using heavyweight_hardware_concurrency brings improvement over hardware_concurrency.

andrewng accepted this revision.Apr 20 2023, 10:49 AM

LGTM.

lld/ELF/Driver.cpp
1434–1437

This is the "default" value that I was referring to. I wasn't suggesting that this should be capped. But assuming that this is just passed on to llvm::heavyweight_hardware_concurrency(), then it should be fine as is because llvm::heavyweight_hardware_concurrency() should sort out any CPU affinity restrictions.

AFAIK, the "heavyweight" version only counts "real" cores and ignores any SMT ones. So on most X86 CPUs it will be half as many threads. Now that some X86 CPUs have a mix of fast and low-power cores, I'm not too sure what "heavyweight" is or should be in these scenarios.

MaskRay added inline comments.Apr 20 2023, 12:17 PM
lld/ELF/Driver.cpp
1434–1437

Indeed. I think it's worth making a https://discourse.llvm.org/ post to seek feedback whether we should take into account of hyperthreading.

In Bazel, we use distributed ThinLTO, and this value is specific to in-process ThinLTO, so this change will not affect Bazel.
I personally very rarely use -DLLVM_ENABLE_LTO=Thin (in-process ThinLTO), so I do not have experience about a sensible default...

This revision was landed with ongoing or failed builds.Apr 20 2023, 12:17 PM
This revision was automatically updated to reflect the committed changes.