This is an archive of the discontinued LLVM Phabricator instance.

[LinkerWrapper] Perform device linking steps in parallel
ClosedPublic

Authored by jhuber6 on Oct 25 2022, 10:32 AM.

Details

Summary

This patch changes the device linking steps to be performed in parallel
when multiple offloading architectures are being used. We use the LLVM
parallelism support to accomplish this by simply doing each inidividual
device linking job in a single thread. This change required re-parsing
the input arguments as these arguments have internal state that would
not be properly shared between the threads otherwise.

By default, the parallelism uses all threads availible. But this can be
controlled with the --wrapper-jobs= option. This was required in a few
tests to ensure the ordering was still deterministic.

Diff Detail

Event Timeline

jhuber6 created this revision.Oct 25 2022, 10:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2022, 10:32 AM
jhuber6 requested review of this revision.Oct 25 2022, 10:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2022, 10:32 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
tra added a comment.Oct 25 2022, 11:02 AM

I would argue that parallel compilation and linking may need to be disabled by default. I believe similar patches were discussed in the past regarding sub-compilations, but they are relevant for parallel linking, too.
Google search shows D52193, but I believe there were other attempts in the past.
@yaxunl - I vaguely recall that we did discuss parallel HIP/CUDA compilation in the past, but I can't find the details.

These days most of the builds are parallel already and it's very likely that the build system already launches as many jobs as there are CPUs available. Making each compilation launch multiple parallel subcompilations would likely result in way too many simultaneously running processes.
Granted, linking is done less often than compilation, so having parallel linking may be lucky to be the last remaining process in the parallel build, but it's not unusual to have multiple linker processes running simultaneously during the build either. Linking is often the most resource-heavy part of the build, so I would not be surprised if even a few linker instances would cause problems if they spawn parallel sub-linking jobs.

Having parallel subcompilations may be useful in some cases -- e.g. distributed compilation with one compilation per remote worker w/ multiple CPUs available on the worker, but that's unlikely to be a common scenario.

Having deterministic output is also very important, both for the build repeatability/provenance tracking and for the build system's cache hit rates. Reliably cached slow repeatable compilation will be a net win over fast, but unstable compilation that causes cache churn and triggers more things to be rebuilt.

I would argue that parallel compilation and linking may need to be disabled by default. I believe similar patches were discussed in the past regarding sub-compilations, but they are relevant for parallel linking, too.
Google search shows D52193, but I believe there were other attempts in the past.
@yaxunl - I vaguely recall that we did discuss parallel HIP/CUDA compilation in the past, but I can't find the details.

I think parallel compilation might be desirable as well, but it's a harder sell than parallel linking in my opinion. However, as an opt-in feature it would be very helpful in some cases. Like consider someone creating a static library that supports every GPU architecture LLVM supports, it would be nice to be able to optionally turn on parallelism in the driver.

clang lib.c -fopenmp -O3 -fvisibility=hidden -foffload-lto -nostdlib --offload-arch=gfx700,gfx701,gfx801,gfx803,gfx900,gfx902,gfx906,gfx908,gfx90a,gfx90c,gfx940,gfx1010,gfx1030,gfx1031,gfx1032,gfx1033,gfx1034,gfx1035,gfx1036,gfx1100,gfx1101,gfx1102,gfx1103,sm_35,sm_37,sm_50,sm_52,sm_53,sm_60,sm_61,sm_62,sm_70,sm_72,sm_75,sm_80,sm_86

This is something we might be doing more often as we start trying to provide standard library features on the GPU via static libraries. It might be wasteful to compile for every architecture but I think it's the soundest approach if we want compatibility.

These days most of the builds are parallel already and it's very likely that the build system already launches as many jobs as there are CPUs available. Making each compilation launch multiple parallel subcompilations would likely result in way too many simultaneously running processes.
Granted, linking is done less often than compilation, so having parallel linking may be lucky to be the last remaining process in the parallel build, but it's not unusual to have multiple linker processes running simultaneously during the build either. Linking is often the most resource-heavy part of the build, so I would not be surprised if even a few linker instances would cause problems if they spawn parallel sub-linking jobs.

lld already uses all available threads for its parallel linking, the linker wrapper runs before the host linker invocation so it shouldn't interfere either. My only concern is in the future we may try to support faster LTO linking via thin-LTO or some other parallel implementation. I think there's a reasonable precedent for parallel linking already.

Having parallel subcompilations may be useful in some cases -- e.g. distributed compilation with one compilation per remote worker w/ multiple CPUs available on the worker, but that's unlikely to be a common scenario.
Having deterministic output is also very important, both for the build repeatability/provenance tracking and for the build system's cache hit rates. Reliably cached slow repeatable compilation will be a net win over fast, but unstable compilation that causes cache churn and triggers more things to be rebuilt.

This is only non-deterministic for the order of linking jobs between several targets and architectures. If the user only links a single architecture it should behave as before. The average case is still probably going to be one or two architectures at once, in which case this change won't make much of a difference.

tra added a comment.Oct 25 2022, 11:54 AM

However, as an opt-in feature it would be very helpful in some cases.

I'm OK with the explicit opt-in.

Like consider someone creating a static library that supports every GPU architecture LLVM supports, it would be nice to be able to optionally turn on parallelism in the driver.

Yes, but the implicit assumption here is that you have sufficient resources. If you create N libraries, each for M architectures, your build machine may not have enough memory for N*M linkers.
Having N*M processes may or may not be an issue, but if each of the linkers is an lld which may want to run their own K parallel threads, it would not help anything.

In other words, I agree that it may be helpful in some cases, but I can also see how it may actually hurt the build, possibly catastrophically.

clang lib.c -fopenmp -O3 -fvisibility=hidden -foffload-lto -nostdlib --offload-arch=gfx700,gfx701,gfx801,gfx803,gfx900,gfx902,gfx906,gfx908,gfx90a,gfx90c,gfx940,gfx1010,gfx1030,gfx1031,gfx1032,gfx1033,gfx1034,gfx1035,gfx1036,gfx1100,gfx1101,gfx1102,gfx1103,sm_35,sm_37,sm_50,sm_52,sm_53,sm_60,sm_61,sm_62,sm_70,sm_72,sm_75,sm_80,sm_86

This is something we might be doing more often as we start trying to provide standard library features on the GPU via static libraries. It might be wasteful to compile for every architecture but I think it's the soundest approach if we want compatibility.

My point is that grabbing resources will likely break build system's assumptions about their availability. How that would affect the build is anyone's guess. With infinite resources, parallel-everything would win, but in practice it's a big maybe. It would likely be a win for small builds and probably would be a wash or a regression for a larger build with multiple such targets.

Ideally it would be great if there would be a way to cooperate with the build system and let it manage the scheduling, but I don't think we have a good way of doing that.
E.g. for CUDA compilation I was thinking of exposing per-GPU sub-compilations (well, we already do with --cuda-device-only/--cuda-device-only) and providing a way to create combined object from them, and then let the build system manage how those per-GPU compilations would be launched. The problem there is that the build system would need to know our under-the-hood implementation details, so such an approach will be very fragile. The way the new driver does things may be a bit more suitable for this, but I suspect it would still be hard to do.

lld already uses all available threads for its parallel linking, the linker wrapper runs before the host linker invocation so it shouldn't interfere either.

You do have a point here. As long as we don't end up with too many threads (e.g. we guarantee that per-offload linker instance does not run their own parallel threads, offload linking may be similar to parallel lld.

This is only non-deterministic for the order of linking jobs between several targets and architectures. If the user only links a single architecture it should behave as before.

I'm not sure what you mean. Are you saying that linking with --offload-arch=gfx700 is repeatable, but with --offload-arch=gfx700,gfx701 it's not? That would still be a problem.

The average case is still probably going to be one or two architectures at once, in which case this change won't make much of a difference.

Any difference is a difference, as far as content-based caching and provenance tracking is concerned.

However, as an opt-in feature it would be very helpful in some cases.

I'm OK with the explicit opt-in.

Might be good to start with it as opt-in for this patch and we can discuss defaults later, I'll make that change.

Like consider someone creating a static library that supports every GPU architecture LLVM supports, it would be nice to be able to optionally turn on parallelism in the driver.

Yes, but the implicit assumption here is that you have sufficient resources. If you create N libraries, each for M architectures, your build machine may not have enough memory for N*M linkers.
Having N*M processes may or may not be an issue, but if each of the linkers is an lld which may want to run their own K parallel threads, it would not help anything.

That's true, AMDGPU uses lld as its linker so we would be invoking a potentially parallel link step in multiple threads. I'm not sure how much this could impact

In other words, I agree that it may be helpful in some cases, but I can also see how it may actually hurt the build, possibly catastrophically.

My point is that grabbing resources will likely break build system's assumptions about their availability. How that would affect the build is anyone's guess. With infinite resources, parallel-everything would win, but in practice it's a big maybe. It would likely be a win for small builds and probably would be a wash or a regression for a larger build with multiple such targets.

Ideally it would be great if there would be a way to cooperate with the build system and let it manage the scheduling, but I don't think we have a good way of doing that.
E.g. for CUDA compilation I was thinking of exposing per-GPU sub-compilations (well, we already do with --cuda-device-only/--cuda-device-only) and providing a way to create combined object from them, and then let the build system manage how those per-GPU compilations would be launched. The problem there is that the build system would need to know our under-the-hood implementation details, so such an approach will be very fragile. The way the new driver does things may be a bit more suitable for this, but I suspect it would still be hard to do.

lld already uses all available threads for its parallel linking, the linker wrapper runs before the host linker invocation so it shouldn't interfere either.

You do have a point here. As long as we don't end up with too many threads (e.g. we guarantee that per-offload linker instance does not run their own parallel threads, offload linking may be similar to parallel lld.

AMDGPU calls lld for its device linking stage, as LTO has thin-lto which could potentually use more threads. But generally I think the amount of threads required for device linking will probably be small and will always be beneficial compared to running them sequentially.

This is only non-deterministic for the order of linking jobs between several targets and architectures. If the user only links a single architecture it should behave as before.

I'm not sure what you mean. Are you saying that linking with --offload-arch=gfx700 is repeatable, but with --offload-arch=gfx700,gfx701 it's not? That would still be a problem.

The average case is still probably going to be one or two architectures at once, in which case this change won't make much of a difference.

Any difference is a difference, as far as content-based caching and provenance tracking is concerned.

This does bring up a good point. The linked output is going to be entered into an arbitrary order now. We should probably sort it by some metric at least. Otherwise we'd have the same binary creating different images depending on this. I'll also make that change.

In general, I think parallelizing the linking workload for multiple GPU's in the linker wrapper is a useful feature. I am not sure whether the workload to be parallelized includes the LLVM passes and codegen, which is usually the bottleneck. Parallelizing this workload when there are many GPU arch's can significantly improve build time.

It is preferable if the parallelization can be coordinated with GNU make through the job server provided by GNU make (https://www.gnu.org/software/make/manual/html_node/Job-Slots.html#Job-Slots). However, some efforts are needed to implement that.

For now, I think an option to enable parallelization (by default off) should be fine.

jhuber6 updated this revision to Diff 470859.Oct 26 2022, 10:30 AM

Make the default number of threads one, let users use -Wl,--wrapper-jobs=N to use parallelism.

jhuber6 updated this revision to Diff 470867.Oct 26 2022, 10:52 AM

Adding a sort so the entires appear in a deterministic order. The sort is simply a lexigraphic comparison.

jhuber6 updated this revision to Diff 471994.Oct 31 2022, 7:01 AM

Ping and fix test.

yaxunl added inline comments.Oct 31 2022, 7:12 AM
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
1211–1212

should we also sort for offload kind? In the future, we may have both openmp and hip binaries embeded.

jhuber6 added inline comments.Oct 31 2022, 7:22 AM
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
1211–1212

Sure, this is used to get a deterministic order so we should make sure that those types line up.

jhuber6 updated this revision to Diff 471998.Oct 31 2022, 7:24 AM

Sorting on offload kind as well.

tra accepted this revision.Nov 9 2022, 12:44 PM
This revision is now accepted and ready to land.Nov 9 2022, 12:44 PM
This revision was automatically updated to reflect the committed changes.