This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
test/Driver/
-
Driver/
-
linker-wrapper.c
-
tools/clang-linker-wrapper/
-
clang-linker-wrapper/
1/2
ClangLinkerWrapper.cpp
-
LinkerWrapperOpts.td

Differential D136701

[LinkerWrapper] Perform device linking steps in parallel
ClosedPublic

Authored by jhuber6 on Oct 25 2022, 10:32 AM.

Download Raw Diff

Details

Reviewers

jdoerfert
tianshilei1992
tra
yaxunl
JonChesterfield
ronlieb

Commits

rG0f7e8631547a: [LinkerWrapper] Perform device linking steps in parallel

Summary

This patch changes the device linking steps to be performed in parallel
when multiple offloading architectures are being used. We use the LLVM
parallelism support to accomplish this by simply doing each inidividual
device linking job in a single thread. This change required re-parsing
the input arguments as these arguments have internal state that would
not be properly shared between the threads otherwise.

By default, the parallelism uses all threads availible. But this can be
controlled with the --wrapper-jobs= option. This was required in a few
tests to ensure the ordering was still deterministic.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Oct 25 2022, 10:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2022, 10:32 AM

jhuber6 requested review of this revision.Oct 25 2022, 10:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2022, 10:32 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

I would argue that parallel compilation and linking may need to be disabled by default. I believe similar patches were discussed in the past regarding sub-compilations, but they are relevant for parallel linking, too.
Google search shows D52193, but I believe there were other attempts in the past.
@yaxunl - I vaguely recall that we did discuss parallel HIP/CUDA compilation in the past, but I can't find the details.

These days most of the builds are parallel already and it's very likely that the build system already launches as many jobs as there are CPUs available. Making each compilation launch multiple parallel subcompilations would likely result in way too many simultaneously running processes.
Granted, linking is done less often than compilation, so having parallel linking may be lucky to be the last remaining process in the parallel build, but it's not unusual to have multiple linker processes running simultaneously during the build either. Linking is often the most resource-heavy part of the build, so I would not be surprised if even a few linker instances would cause problems if they spawn parallel sub-linking jobs.

Having parallel subcompilations may be useful in some cases -- e.g. distributed compilation with one compilation per remote worker w/ multiple CPUs available on the worker, but that's unlikely to be a common scenario.

Having deterministic output is also very important, both for the build repeatability/provenance tracking and for the build system's cache hit rates. Reliably cached slow repeatable compilation will be a net win over fast, but unstable compilation that causes cache churn and triggers more things to be rebuilt.

@yaxunl - Found it: D69582

In D136701#3883218, @tra wrote:

I would argue that parallel compilation and linking may need to be disabled by default. I believe similar patches were discussed in the past regarding sub-compilations, but they are relevant for parallel linking, too.
Google search shows D52193, but I believe there were other attempts in the past.
@yaxunl - I vaguely recall that we did discuss parallel HIP/CUDA compilation in the past, but I can't find the details.

I think parallel compilation might be desirable as well, but it's a harder sell than parallel linking in my opinion. However, as an opt-in feature it would be very helpful in some cases. Like consider someone creating a static library that supports every GPU architecture LLVM supports, it would be nice to be able to optionally turn on parallelism in the driver.

clang lib.c -fopenmp -O3 -fvisibility=hidden -foffload-lto -nostdlib --offload-arch=gfx700,gfx701,gfx801,gfx803,gfx900,gfx902,gfx906,gfx908,gfx90a,gfx90c,gfx940,gfx1010,gfx1030,gfx1031,gfx1032,gfx1033,gfx1034,gfx1035,gfx1036,gfx1100,gfx1101,gfx1102,gfx1103,sm_35,sm_37,sm_50,sm_52,sm_53,sm_60,sm_61,sm_62,sm_70,sm_72,sm_75,sm_80,sm_86

This is something we might be doing more often as we start trying to provide standard library features on the GPU via static libraries. It might be wasteful to compile for every architecture but I think it's the soundest approach if we want compatibility.

These days most of the builds are parallel already and it's very likely that the build system already launches as many jobs as there are CPUs available. Making each compilation launch multiple parallel subcompilations would likely result in way too many simultaneously running processes.
Granted, linking is done less often than compilation, so having parallel linking may be lucky to be the last remaining process in the parallel build, but it's not unusual to have multiple linker processes running simultaneously during the build either. Linking is often the most resource-heavy part of the build, so I would not be surprised if even a few linker instances would cause problems if they spawn parallel sub-linking jobs.

lld already uses all available threads for its parallel linking, the linker wrapper runs before the host linker invocation so it shouldn't interfere either. My only concern is in the future we may try to support faster LTO linking via thin-LTO or some other parallel implementation. I think there's a reasonable precedent for parallel linking already.

Having parallel subcompilations may be useful in some cases -- e.g. distributed compilation with one compilation per remote worker w/ multiple CPUs available on the worker, but that's unlikely to be a common scenario.
Having deterministic output is also very important, both for the build repeatability/provenance tracking and for the build system's cache hit rates. Reliably cached slow repeatable compilation will be a net win over fast, but unstable compilation that causes cache churn and triggers more things to be rebuilt.

This is only non-deterministic for the order of linking jobs between several targets and architectures. If the user only links a single architecture it should behave as before. The average case is still probably going to be one or two architectures at once, in which case this change won't make much of a difference.

Harbormaster completed remote builds in B194224: Diff 470555.Oct 25 2022, 11:40 AM

In D136701#3883300, @jhuber6 wrote:

However, as an opt-in feature it would be very helpful in some cases.

I'm OK with the explicit opt-in.

Like consider someone creating a static library that supports every GPU architecture LLVM supports, it would be nice to be able to optionally turn on parallelism in the driver.

Yes, but the implicit assumption here is that you have sufficient resources. If you create N libraries, each for M architectures, your build machine may not have enough memory for N*M linkers.
Having N*M processes may or may not be an issue, but if each of the linkers is an lld which may want to run their own K parallel threads, it would not help anything.

In other words, I agree that it may be helpful in some cases, but I can also see how it may actually hurt the build, possibly catastrophically.

clang lib.c -fopenmp -O3 -fvisibility=hidden -foffload-lto -nostdlib --offload-arch=gfx700,gfx701,gfx801,gfx803,gfx900,gfx902,gfx906,gfx908,gfx90a,gfx90c,gfx940,gfx1010,gfx1030,gfx1031,gfx1032,gfx1033,gfx1034,gfx1035,gfx1036,gfx1100,gfx1101,gfx1102,gfx1103,sm_35,sm_37,sm_50,sm_52,sm_53,sm_60,sm_61,sm_62,sm_70,sm_72,sm_75,sm_80,sm_86
This is something we might be doing more often as we start trying to provide standard library features on the GPU via static libraries. It might be wasteful to compile for every architecture but I think it's the soundest approach if we want compatibility.

My point is that grabbing resources will likely break build system's assumptions about their availability. How that would affect the build is anyone's guess. With infinite resources, parallel-everything would win, but in practice it's a big maybe. It would likely be a win for small builds and probably would be a wash or a regression for a larger build with multiple such targets.

Ideally it would be great if there would be a way to cooperate with the build system and let it manage the scheduling, but I don't think we have a good way of doing that.
E.g. for CUDA compilation I was thinking of exposing per-GPU sub-compilations (well, we already do with --cuda-device-only/--cuda-device-only) and providing a way to create combined object from them, and then let the build system manage how those per-GPU compilations would be launched. The problem there is that the build system would need to know our under-the-hood implementation details, so such an approach will be very fragile. The way the new driver does things may be a bit more suitable for this, but I suspect it would still be hard to do.

lld already uses all available threads for its parallel linking, the linker wrapper runs before the host linker invocation so it shouldn't interfere either.

You do have a point here. As long as we don't end up with too many threads (e.g. we guarantee that per-offload linker instance does not run their own parallel threads, offload linking may be similar to parallel lld.

This is only non-deterministic for the order of linking jobs between several targets and architectures. If the user only links a single architecture it should behave as before.

I'm not sure what you mean. Are you saying that linking with --offload-arch=gfx700 is repeatable, but with --offload-arch=gfx700,gfx701 it's not? That would still be a problem.

The average case is still probably going to be one or two architectures at once, in which case this change won't make much of a difference.

Any difference is a difference, as far as content-based caching and provenance tracking is concerned.

In D136701#3883416, @tra wrote:

In D136701#3883300, @jhuber6 wrote:

However, as an opt-in feature it would be very helpful in some cases.

I'm OK with the explicit opt-in.

Might be good to start with it as opt-in for this patch and we can discuss defaults later, I'll make that change.

Like consider someone creating a static library that supports every GPU architecture LLVM supports, it would be nice to be able to optionally turn on parallelism in the driver.

Yes, but the implicit assumption here is that you have sufficient resources. If you create N libraries, each for M architectures, your build machine may not have enough memory for N*M linkers.
Having N*M processes may or may not be an issue, but if each of the linkers is an lld which may want to run their own K parallel threads, it would not help anything.

That's true, AMDGPU uses lld as its linker so we would be invoking a potentially parallel link step in multiple threads. I'm not sure how much this could impact

In other words, I agree that it may be helpful in some cases, but I can also see how it may actually hurt the build, possibly catastrophically.

My point is that grabbing resources will likely break build system's assumptions about their availability. How that would affect the build is anyone's guess. With infinite resources, parallel-everything would win, but in practice it's a big maybe. It would likely be a win for small builds and probably would be a wash or a regression for a larger build with multiple such targets.

Ideally it would be great if there would be a way to cooperate with the build system and let it manage the scheduling, but I don't think we have a good way of doing that.
E.g. for CUDA compilation I was thinking of exposing per-GPU sub-compilations (well, we already do with --cuda-device-only/--cuda-device-only) and providing a way to create combined object from them, and then let the build system manage how those per-GPU compilations would be launched. The problem there is that the build system would need to know our under-the-hood implementation details, so such an approach will be very fragile. The way the new driver does things may be a bit more suitable for this, but I suspect it would still be hard to do.

lld already uses all available threads for its parallel linking, the linker wrapper runs before the host linker invocation so it shouldn't interfere either.

You do have a point here. As long as we don't end up with too many threads (e.g. we guarantee that per-offload linker instance does not run their own parallel threads, offload linking may be similar to parallel lld.

AMDGPU calls lld for its device linking stage, as LTO has thin-lto which could potentually use more threads. But generally I think the amount of threads required for device linking will probably be small and will always be beneficial compared to running them sequentially.

This is only non-deterministic for the order of linking jobs between several targets and architectures. If the user only links a single architecture it should behave as before.

I'm not sure what you mean. Are you saying that linking with --offload-arch=gfx700 is repeatable, but with --offload-arch=gfx700,gfx701 it's not? That would still be a problem.

The average case is still probably going to be one or two architectures at once, in which case this change won't make much of a difference.

Any difference is a difference, as far as content-based caching and provenance tracking is concerned.

This does bring up a good point. The linked output is going to be entered into an arbitrary order now. We should probably sort it by some metric at least. Otherwise we'd have the same binary creating different images depending on this. I'll also make that change.

In general, I think parallelizing the linking workload for multiple GPU's in the linker wrapper is a useful feature. I am not sure whether the workload to be parallelized includes the LLVM passes and codegen, which is usually the bottleneck. Parallelizing this workload when there are many GPU arch's can significantly improve build time.

It is preferable if the parallelization can be coordinated with GNU make through the job server provided by GNU make (https://www.gnu.org/software/make/manual/html_node/Job-Slots.html#Job-Slots). However, some efforts are needed to implement that.

For now, I think an option to enable parallelization (by default off) should be fine.

Make the default number of threads one, let users use -Wl,--wrapper-jobs=N to use parallelism.

Adding a sort so the entires appear in a deterministic order. The sort is simply a lexigraphic comparison.

Herald added a subscriber: mgrang. · View Herald TranscriptOct 26 2022, 10:52 AM

Harbormaster completed remote builds in B194449: Diff 470867.Oct 26 2022, 11:44 AM

Ping and fix test.

yaxunl added inline comments.Oct 31 2022, 7:12 AM

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
1211–1212	should we also sort for offload kind? In the future, we may have both openmp and hip binaries embeded.

jhuber6 added inline comments.Oct 31 2022, 7:22 AM

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
1211–1212	Sure, this is used to get a deterministic order so we should make sure that those types line up.

Sorting on offload kind as well.

Harbormaster completed remote builds in B195259: Diff 471998.Oct 31 2022, 7:57 AM

ping

tra accepted this revision.Nov 9 2022, 12:44 PM

This revision is now accepted and ready to land.Nov 9 2022, 12:44 PM

Closed by commit rG0f7e8631547a: [LinkerWrapper] Perform device linking steps in parallel (authored by jhuber6). · Explain WhyNov 11 2022, 11:46 AM

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG0f7e8631547a: [LinkerWrapper] Perform device linking steps in parallel.

Revision Contents

Path

Size

clang/

test/

Driver/

linker-wrapper.c

16 lines

tools/

clang-linker-wrapper/

ClangLinkerWrapper.cpp

67 lines

LinkerWrapperOpts.td

4 lines

Diff 474823

clang/test/Driver/linker-wrapper.c

	Show First 20 Lines • Show All 101 Lines • ▼ Show 20 Lines
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=cuda,triple=nvptx64-nvidia-cuda,arch=sm_52			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=cuda,triple=nvptx64-nvidia-cuda,arch=sm_52
	// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \			// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \
	// RUN: -fembed-offload-object=%t.out			// RUN: -fembed-offload-object=%t.out
	// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu \			// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu \
	// RUN: --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=CUDA			// RUN: --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=CUDA

	// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_52 {{.*}}.o			// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_52 {{.*}}.o
	// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_70 {{.}}.o {{.}}.o			// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_70 {{.}}.o {{.}}.o
	// CUDA: fatbinary{{.}}-64 --create {{.}}.fatbin --image=profile=sm_52,file={{.}}.out --image=profile=sm_70,file={{.}}.out			// CUDA: fatbinary{{.}}-64 --create {{.}}.fatbin --image=profile=sm_70,file={{.}}.out --image=profile=sm_52,file={{.}}.out

				// RUN: clang-offload-packager -o %t.out \
				// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=cuda,triple=nvptx64-nvidia-cuda,arch=sm_80 \
				// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=cuda,triple=nvptx64-nvidia-cuda,arch=sm_75 \
				// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=cuda,triple=nvptx64-nvidia-cuda,arch=sm_70 \
				// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=cuda,triple=nvptx64-nvidia-cuda,arch=sm_52
				// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \
				// RUN: -fembed-offload-object=%t.out
				// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu --wrapper-jobs=4 \
				// RUN: --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=CUDA-PAR

				// CUDA-PAR: fatbinary{{.}}-64 --create {{.}}.fatbin

	// RUN: clang-offload-packager -o %t.out \			// RUN: clang-offload-packager -o %t.out \
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=hip,triple=amdgcn-amd-amdhsa,arch=gfx90a \			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=hip,triple=amdgcn-amd-amdhsa,arch=gfx90a \
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx90a \			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx90a \
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=hip,triple=amdgcn-amd-amdhsa,arch=gfx908			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=hip,triple=amdgcn-amd-amdhsa,arch=gfx908
	// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \			// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \
	// RUN: -fembed-offload-object=%t.out			// RUN: -fembed-offload-object=%t.out
	// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu \			// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu \
	// RUN: --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=HIP			// RUN: --linker-path=/usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=HIP

	// HIP: lld{{.}}-flavor gnu --no-undefined -shared -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=mcpu=gfx908 -o {{.}}.out {{.*}}.o			// HIP: lld{{.}}-flavor gnu --no-undefined -shared -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=mcpu=gfx908 -o {{.}}.out {{.*}}.o
	// HIP: lld{{.}}-flavor gnu --no-undefined -shared -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=mcpu=gfx90a -o {{.}}.out {{.*}}.o			// HIP: lld{{.}}-flavor gnu --no-undefined -shared -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=mcpu=gfx90a -o {{.}}.out {{.*}}.o
	// HIP: clang-offload-bundler{{.}}-type=o -bundle-align=4096 -targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx908,hipv4-amdgcn-amd-amdhsa--gfx90a -input=/dev/null -input={{.}}.out -input={{.}}out -output={{.}}.hipfb			// HIP: clang-offload-bundler{{.}}-type=o -bundle-align=4096 -targets=host-x86_64-unknown-linux,hipv4-amdgcn-amd-amdhsa--gfx90a,hipv4-amdgcn-amd-amdhsa--gfx908 -input=/dev/null -input={{.}}.out -input={{.}}out -output={{.}}.hipfb

	// RUN: clang-offload-packager -o %t.out \			// RUN: clang-offload-packager -o %t.out \
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx908 \			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx908 \
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70
	// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \			// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \
	// RUN: -fembed-offload-object=%t.out			// RUN: -fembed-offload-object=%t.out
	// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu \			// RUN: clang-linker-wrapper --dry-run --host-triple=x86_64-unknown-linux-gnu \
	// RUN: --linker-path=/usr/bin/ld --device-linker=a --device-linker=nvptx64-nvidia-cuda=b -- \			// RUN: --linker-path=/usr/bin/ld --device-linker=a --device-linker=nvptx64-nvidia-cuda=b -- \
	Show All 22 Lines

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

Show All 36 Lines
#include "llvm/Option/Option.h"		#include "llvm/Option/Option.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Errc.h"		#include "llvm/Support/Errc.h"
#include "llvm/Support/FileOutputBuffer.h"		#include "llvm/Support/FileOutputBuffer.h"
#include "llvm/Support/FileSystem.h"		#include "llvm/Support/FileSystem.h"
#include "llvm/Support/Host.h"		#include "llvm/Support/Host.h"
#include "llvm/Support/InitLLVM.h"		#include "llvm/Support/InitLLVM.h"
#include "llvm/Support/MemoryBuffer.h"		#include "llvm/Support/MemoryBuffer.h"
		#include "llvm/Support/Parallel.h"
#include "llvm/Support/Path.h"		#include "llvm/Support/Path.h"
#include "llvm/Support/Program.h"		#include "llvm/Support/Program.h"
#include "llvm/Support/Signals.h"		#include "llvm/Support/Signals.h"
#include "llvm/Support/SourceMgr.h"		#include "llvm/Support/SourceMgr.h"
#include "llvm/Support/StringSaver.h"		#include "llvm/Support/StringSaver.h"
#include "llvm/Support/TargetSelect.h"		#include "llvm/Support/TargetSelect.h"
#include "llvm/Support/WithColor.h"		#include "llvm/Support/WithColor.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
▲ Show 20 Lines • Show All 1,061 Lines • ▼ Show 20 Lines	DerivedArgList getLinkerArgs(ArrayRef<OffloadFile> Input,

return DAL;		return DAL;
}		}

/// Transforms all the extracted offloading input files into an image that can		/// Transforms all the extracted offloading input files into an image that can
/// be registered by the runtime.		/// be registered by the runtime.
Expected<SmallVector<StringRef>>		Expected<SmallVector<StringRef>>
linkAndWrapDeviceFiles(SmallVectorImpl<OffloadFile> &LinkerInputFiles,		linkAndWrapDeviceFiles(SmallVectorImpl<OffloadFile> &LinkerInputFiles,
const InputArgList &Args) {		const InputArgList &Args, char **Argv, int Argc) {
llvm::TimeTraceScope TimeScope("Handle all device input");		llvm::TimeTraceScope TimeScope("Handle all device input");

DenseMap<OffloadFile::TargetID, SmallVector<OffloadFile, 4>> InputsForTarget;		DenseMap<OffloadFile::TargetID, SmallVector<OffloadFile>> InputMap;
for (auto &File : LinkerInputFiles)		for (auto &File : LinkerInputFiles)
InputsForTarget[File].emplace_back(std::move(File));		InputMap[File].emplace_back(std::move(File));
LinkerInputFiles.clear();		LinkerInputFiles.clear();

DenseMap<OffloadKind, SmallVector<OffloadingImage, 2>> Images;		SmallVector<SmallVector<OffloadFile>> InputsForTarget;
for (auto &[ID, Input] : InputsForTarget) {		for (auto &[ID, Input] : InputMap)
		InputsForTarget.emplace_back(std::move(Input));
		InputMap.clear();

		std::mutex ImageMtx;
		DenseMap<OffloadKind, SmallVector<OffloadingImage>> Images;
		auto Err = parallelForEachError(InputsForTarget, [&](auto &Input) -> Error {
llvm::TimeTraceScope TimeScope("Link device input");		llvm::TimeTraceScope TimeScope("Link device input");

auto LinkerArgs = getLinkerArgs(Input, Args);		// Each thread needs its own copy of the base arguments to maintain
		// per-device argument storage of synthetic strings.
		const OptTable &Tbl = getOptTable();
		BumpPtrAllocator Alloc;
		StringSaver Saver(Alloc);
		auto BaseArgs =
		Tbl.parseArgs(Argc, Argv, OPT_INVALID, Saver, [](StringRef Err) {
		reportError(createStringError(inconvertibleErrorCode(), Err));
		});
		auto LinkerArgs = getLinkerArgs(Input, BaseArgs);

DenseSet<OffloadKind> ActiveOffloadKinds;		DenseSet<OffloadKind> ActiveOffloadKinds;
for (const auto &File : Input)		for (const auto &File : Input)
ActiveOffloadKinds.insert(File.getBinary()->getOffloadKind());		ActiveOffloadKinds.insert(File.getBinary()->getOffloadKind());

// First link and remove all the input files containing bitcode.		// First link and remove all the input files containing bitcode.
SmallVector<StringRef> InputFiles;		SmallVector<StringRef> InputFiles;
if (Error Err = linkBitcodeFiles(Input, InputFiles, LinkerArgs))		if (Error Err = linkBitcodeFiles(Input, InputFiles, LinkerArgs))
return std::move(Err);		return std::move(Err);

// Write any remaining device inputs to an output file for the linker job.		// Write any remaining device inputs to an output file for the linker.
for (const OffloadFile &File : Input) {		for (const OffloadFile &File : Input) {
auto FileNameOrErr = writeOffloadFile(File);		auto FileNameOrErr = writeOffloadFile(File);
if (!FileNameOrErr)		if (!FileNameOrErr)
return FileNameOrErr.takeError();		return FileNameOrErr.takeError();
InputFiles.emplace_back(*FileNameOrErr);		InputFiles.emplace_back(*FileNameOrErr);
}		}

// Link the remaining device files, if necessary, using the device linker.		// Link the remaining device files using the device linker.
llvm::Triple Triple(LinkerArgs.getLastArgValue(OPT_triple_EQ));		llvm::Triple Triple(LinkerArgs.getLastArgValue(OPT_triple_EQ));
bool RequiresLinking =		bool RequiresLinking =
!Args.hasArg(OPT_embed_bitcode) &&		!Args.hasArg(OPT_embed_bitcode) &&
!(Input.empty() && InputFiles.size() == 1 && Triple.isNVPTX());		!(Input.empty() && InputFiles.size() == 1 && Triple.isNVPTX());
auto OutputOrErr = RequiresLinking ? linkDevice(InputFiles, LinkerArgs)		auto OutputOrErr = RequiresLinking ? linkDevice(InputFiles, LinkerArgs)
: InputFiles.front();		: InputFiles.front();
if (!OutputOrErr)		if (!OutputOrErr)
return OutputOrErr.takeError();		return OutputOrErr.takeError();

// Store the offloading image for each linked output file.		// Store the offloading image for each linked output file.
for (OffloadKind Kind : ActiveOffloadKinds) {		for (OffloadKind Kind : ActiveOffloadKinds) {
llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> FileOrErr =		llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> FileOrErr =
llvm::MemoryBuffer::getFileOrSTDIN(*OutputOrErr);		llvm::MemoryBuffer::getFileOrSTDIN(*OutputOrErr);
if (std::error_code EC = FileOrErr.getError())		if (std::error_code EC = FileOrErr.getError())
return createFileError(*OutputOrErr, EC);		return createFileError(*OutputOrErr, EC);

OffloadingImage TheImage{};		OffloadingImage TheImage{};
TheImage.TheImageKind = IMG_Object;		TheImage.TheImageKind = IMG_Object;
TheImage.TheOffloadKind = Kind;		TheImage.TheOffloadKind = Kind;
TheImage.StringData = {		TheImage.StringData = {
{"triple", LinkerArgs.getLastArgValue(OPT_triple_EQ)},		{"triple",
{"arch", LinkerArgs.getLastArgValue(OPT_arch_EQ)}};		Args.MakeArgString(LinkerArgs.getLastArgValue(OPT_triple_EQ))},
		{"arch",
		Args.MakeArgString(LinkerArgs.getLastArgValue(OPT_arch_EQ))}};
TheImage.Image = std::move(*FileOrErr);		TheImage.Image = std::move(*FileOrErr);

		std::lock_guard<decltype(ImageMtx)> Guard(ImageMtx);
Images[Kind].emplace_back(std::move(TheImage));		Images[Kind].emplace_back(std::move(TheImage));
}		}
}		return Error::success();
		});
		if (Err)
		return std::move(Err);

// Create a binary image of each offloading image and embed it into a new		// Create a binary image of each offloading image and embed it into a new
// object file.		// object file.
SmallVector<StringRef> WrappedOutput;		SmallVector<StringRef> WrappedOutput;
for (const auto &[Kind, Input] : Images) {		for (auto &[Kind, Input] : Images) {
		// We sort the entries before bundling so they appear in a deterministic
		// order in the final binary.
		llvm::sort(Input, [](OffloadingImage &A, OffloadingImage &B) {
		return A.StringData["triple"].compare(B.StringData["triple"]) == 1 \|\|
		A.StringData["arch"].compare(B.StringData["arch"]) == 1 \|\|
		yaxunlUnsubmitted Not Done Reply Inline Actions should we also sort for offload kind? In the future, we may have both openmp and hip binaries embeded. yaxunl: should we also sort for offload kind? In the future, we may have both openmp and hip binaries…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Sure, this is used to get a deterministic order so we should make sure that those types line up. jhuber6: Sure, this is used to get a deterministic order so we should make sure that those types line up.
		A.TheOffloadKind < B.TheOffloadKind;
		});
auto BundledImagesOrErr = bundleLinkedOutput(Input, Args, Kind);		auto BundledImagesOrErr = bundleLinkedOutput(Input, Args, Kind);
if (!BundledImagesOrErr)		if (!BundledImagesOrErr)
return BundledImagesOrErr.takeError();		return BundledImagesOrErr.takeError();
auto OutputOrErr = wrapDeviceImages(*BundledImagesOrErr, Args, Kind);		auto OutputOrErr = wrapDeviceImages(*BundledImagesOrErr, Args, Kind);
if (!OutputOrErr)		if (!OutputOrErr)
return OutputOrErr.takeError();		return OutputOrErr.takeError();
WrappedOutput.push_back(*OutputOrErr);		WrappedOutput.push_back(*OutputOrErr);
}		}
▲ Show 20 Lines • Show All 164 Lines • ▼ Show 20 Lines	int main(int Argc, char **Argv) {
Verbose = Args.hasArg(OPT_verbose);		Verbose = Args.hasArg(OPT_verbose);
DryRun = Args.hasArg(OPT_dry_run);		DryRun = Args.hasArg(OPT_dry_run);
SaveTemps = Args.hasArg(OPT_save_temps);		SaveTemps = Args.hasArg(OPT_save_temps);
ExecutableName = Args.getLastArgValue(OPT_o, "a.out");		ExecutableName = Args.getLastArgValue(OPT_o, "a.out");
CudaBinaryPath = Args.getLastArgValue(OPT_cuda_path_EQ).str();		CudaBinaryPath = Args.getLastArgValue(OPT_cuda_path_EQ).str();
if (!CudaBinaryPath.empty())		if (!CudaBinaryPath.empty())
CudaBinaryPath = CudaBinaryPath + "/bin";		CudaBinaryPath = CudaBinaryPath + "/bin";

		parallel::strategy = hardware_concurrency(1);
		if (auto *Arg = Args.getLastArg(OPT_wrapper_jobs)) {
		unsigned Threads = 0;
		if (!llvm::to_integer(Arg->getValue(), Threads) \|\| Threads == 0)
		reportError(createStringError(
		inconvertibleErrorCode(), "%s: expected a positive integer, got '%s'",
		Arg->getSpelling().data(), Arg->getValue()));
		parallel::strategy = hardware_concurrency(Threads);
		}

if (Args.hasArg(OPT_wrapper_time_trace_eq)) {		if (Args.hasArg(OPT_wrapper_time_trace_eq)) {
unsigned Granularity;		unsigned Granularity;
Args.getLastArgValue(OPT_wrapper_time_trace_granularity, "500")		Args.getLastArgValue(OPT_wrapper_time_trace_granularity, "500")
.getAsInteger(10, Granularity);		.getAsInteger(10, Granularity);
timeTraceProfilerInitialize(Granularity, Argv[0]);		timeTraceProfilerInitialize(Granularity, Argv[0]);
}		}

{		{
llvm::TimeTraceScope TimeScope("Execute linker wrapper");		llvm::TimeTraceScope TimeScope("Execute linker wrapper");

// Extract the device input files stored in the host fat binary.		// Extract the device input files stored in the host fat binary.
auto DeviceInputFiles = getDeviceInput(Args);		auto DeviceInputFiles = getDeviceInput(Args);
if (!DeviceInputFiles)		if (!DeviceInputFiles)
reportError(DeviceInputFiles.takeError());		reportError(DeviceInputFiles.takeError());

// Link and wrap the device images extracted from the linker input.		// Link and wrap the device images extracted from the linker input.
auto FilesOrErr = linkAndWrapDeviceFiles(*DeviceInputFiles, Args);		auto FilesOrErr =
		linkAndWrapDeviceFiles(*DeviceInputFiles, Args, Argv, Argc);
if (!FilesOrErr)		if (!FilesOrErr)
reportError(FilesOrErr.takeError());		reportError(FilesOrErr.takeError());

// Run the host linking job with the rendered arguments.		// Run the host linking job with the rendered arguments.
if (Error Err = runLinker(*FilesOrErr, Args))		if (Error Err = runLinker(*FilesOrErr, Args))
reportError(std::move(Err));		reportError(std::move(Err));
}		}

Show All 14 Lines

clang/tools/clang-linker-wrapper/LinkerWrapperOpts.td

	Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines

	def wrapper_time_trace_eq : Joined<["--"], "wrapper-time-trace=">,			def wrapper_time_trace_eq : Joined<["--"], "wrapper-time-trace=">,
	Flags<[WrapperOnlyOption]>, MetaVarName<"<file>">,			Flags<[WrapperOnlyOption]>, MetaVarName<"<file>">,
	HelpText<"Enable time-trace and write the output to <file>">;			HelpText<"Enable time-trace and write the output to <file>">;
	def wrapper_time_trace_granularity : Joined<["--"], "wrapper-time-trace-granularity=">,			def wrapper_time_trace_granularity : Joined<["--"], "wrapper-time-trace-granularity=">,
	Flags<[WrapperOnlyOption]>, MetaVarName<"<number>">,			Flags<[WrapperOnlyOption]>, MetaVarName<"<number>">,
	HelpText<"Set the granularity of time-trace updates">;			HelpText<"Set the granularity of time-trace updates">;

				def wrapper_jobs : Joined<["--"], "wrapper-jobs=">,
				Flags<[WrapperOnlyOption]>, MetaVarName<"<number>">,
				HelpText<"Sets the number of parallel jobs to use for device linking">;

	// Flags passed to the device linker.			// Flags passed to the device linker.
	def arch_EQ : Joined<["--"], "arch=">,			def arch_EQ : Joined<["--"], "arch=">,
	Flags<[DeviceOnlyOption, HelpHidden]>, MetaVarName<"<arch>">,			Flags<[DeviceOnlyOption, HelpHidden]>, MetaVarName<"<arch>">,
	HelpText<"The device subarchitecture">;			HelpText<"The device subarchitecture">;
	def triple_EQ : Joined<["--"], "triple=">,			def triple_EQ : Joined<["--"], "triple=">,
	Flags<[DeviceOnlyOption, HelpHidden]>, MetaVarName<"<triple>">,			Flags<[DeviceOnlyOption, HelpHidden]>, MetaVarName<"<triple>">,
	HelpText<"The device target triple">;			HelpText<"The device target triple">;
	def whole_program : Flag<["--"], "whole-program">,			def whole_program : Flag<["--"], "whole-program">,
	▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines