This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
test/Driver/
-
Driver/
-
linker-wrapper.c
-
tools/clang-linker-wrapper/
-
clang-linker-wrapper/
-
ClangLinkerWrapper.cpp

Differential D127901

[LinkerWrapper] Add PTX output to CUDA fatbinary in LTO-mode
Needs ReviewPublic

Authored by jhuber6 on Jun 15 2022, 12:57 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
JonChesterfield
tra
yaxunl

Summary

One current downside of the LLVM support for CUDA in RDC-mode is that we
cannot JIT off of the PTX image. This requires the user to provide the
specific architecture when offloading. CUDA's runtime uses a special
method to link the separate PTX files when in RDC-mode, while LLVM
cannot do this with the chosen approach to supporting RDC-mode
compilation. However, if we embed bitcode via LTO we can use the
single-linked PTX image for the whole module and include it in the
fatbinary. This allows us to do the following and have it execute even
without the correct architecture specified.

clang foo.cu -foffload-lto -fgpu-rdc --offload-new-driver -lcudart

It is also worth noting that in full-LTO mode, RDC-mode will behave
exactly like non-RDC mode after linking.

Depends on D127246

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Jun 15 2022, 12:57 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2022, 12:57 PM

Herald added subscribers: mattd, gchakrabarti, asavonic, inglorion. · View Herald Transcript

jhuber6 requested review of this revision.Jun 15 2022, 12:57 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2022, 12:57 PM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B170089: Diff 437306.Jun 15 2022, 2:27 PM

Playing devil's advocate, I've got to ask -- do we even want to support JIT?

JIT brings more trouble than benefits.

substantial start-up time on nontrivial apps. Last time I tried launching a tensorflow app and needed to JIT its kernels, it took about half an hour until JIT was done.
substantial increase in the size of the executable. Statically linked tensorflow apps are already pushing the limits of the executables that use small memory model (-mcmodel=small is the default for clang and gcc, AFAICT).
very easy to make a mistake, compile for a wrong GPU and not notice it, because JIT will try to keep it running using PTX.
makes executables and tests non-hermetic -- the code that will run on GPU (and thus the behavior) will depend on particular driver version the apps uses at runtime.

Benefits: It *may* allow us to run a miscompiled/outdated CUDA app. Whether it's actually a benefit is questionable. To me it looks like a way to paper over a problem.

We (google) have experienced all of the above and ended up disabling PTX JIT'ting altogether.

That said, we do embed PTX by default at the moment, so this patch does not really change the status quo, so I'm not opposed to it, as long is we can disable PTX embedding if we need/want to.

In D127901#3590402, @tra wrote:

Playing devil's advocate, I've got to ask -- do we even want to support JIT?

JIT brings more trouble than benefits.

substantial start-up time on nontrivial apps. Last time I tried launching a tensorflow app and needed to JIT its kernels, it took about half an hour until JIT was done.

substantial increase in the size of the executable. Statically linked tensorflow apps are already pushing the limits of the executables that use small memory model (-mcmodel=small is the default for clang and gcc, AFAICT).

very easy to make a mistake, compile for a wrong GPU and not notice it, because JIT will try to keep it running using PTX.

makes executables and tests non-hermetic -- the code that will run on GPU (and thus the behavior) will depend on particular driver version the apps uses at runtime.

Benefits: It *may* allow us to run a miscompiled/outdated CUDA app. Whether it's actually a benefit is questionable. To me it looks like a way to paper over a problem.

We (google) have experienced all of the above and ended up disabling PTX JIT'ting altogether.

That said, we do embed PTX by default at the moment, so this patch does not really change the status quo, so I'm not opposed to it, as long is we can disable PTX embedding if we need/want to.

I guess it's one of those situations where I figured since we have it when we do LTO anyway I may as well add it. I don't know much about the usage of it w.r.t. performance, but I figured that this was a shortcoming of the RDC-mode support for Clang considering that NVIDIA can JIT RDC-mode compilations. We could definitely have an argument that disables this, I'm assuming there's an argument that does that in Clang already that we could overload to pass something to the linker wrapper. Or we could decide which behaviour we want to be the default.

The problem with LTO however is that many "compile-only" flags are suddenly relevant during linking. So let's say for a build someone did clang foo.cu -c -no-embed-ptx -foffload-lto and then clang foo.o we won't have the argument. I think regular LTO can embed the command line in the bitcode or something. We also have the option to embed the arguments in the binary format I made.

Also one problem with the RDC mode support with this is that we don't gracefully error if something was wrong with the image. so the following is really unhelpful

clang app.cu --offload-arch=sm_<not correct> -fgpu-rdc --offload-new-driver
./a.out // Gives no output, kernel simply never executes.

Do we want JIT -> YES, but specalizing LLVM-IR JIT.
Do we want/need PTX, I do not, but I don't mind having it. Someone will ask for it eventually.

In D127901#3602771, @jdoerfert wrote:

Do we want/need PTX, I do not, but I don't mind having it. Someone will ask for it eventually.

Fair enough.

However, if we embed bitcode via LTO we can use the
single-linked PTX image for the whole module and include it in the
fatbinary. This allows us to do the following and have it execute even
without the correct architecture specified.
clang foo.cu -foffload-lto -fgpu-rdc --offload-new-driver -lcudart

Then we do need a knob controlling whether we do want to embed PTX or not. The default should be "off" IMO.
We currently have --[no-]cuda-include-ptx= we may reuse for that purpose.

This brings another question -- which GPU variant will we generate PTX for? One? All (if more than one is specified)? The ones specified by --[no-]cuda-include-ptx= ?

In D127901#3603006, @tra wrote:

Then we do need a knob controlling whether we do want to embed PTX or not. The default should be "off" IMO.
We currently have --[no-]cuda-include-ptx= we may reuse for that purpose.

We could definitely re-use that. It's another option that probably need to go inside the binary itself since normally those options aren't passed to the linker. We'll probably just use the same default as that flag (which is on I think).

This brings another question -- which GPU variant will we generate PTX for? One? All (if more than one is specified)? The ones specified by --[no-]cuda-include-ptx= ?

Right now, it'll be the one that's attached to the LTO job. So if the user specified sm_70 they'll get PTX for sm_70.

In D127901#3603118, @jhuber6 wrote:

In D127901#3603006, @tra wrote:

Then we do need a knob controlling whether we do want to embed PTX or not. The default should be "off" IMO.
We currently have --[no-]cuda-include-ptx= we may reuse for that purpose.

We could definitely re-use that. It's another option that probably need to go inside the binary itself since normally those options aren't passed to the linker.

I'm not sure I follow. WDYM by "go inside the binary itself" ? I assume you mean the per-GPU offload binaries inside per-TU .o. so that it could be used when that GPU object gets linked into GPU executable?

What if different TUs that we're linking were compiled using different/contradictory options?

The problem is that conceptually "--cuda-include-ptx" option ultimately affects the final GPU executable. If we're in RDC mode, then PTX is probably useless for JITT-ing purposes, as you can't link PTX and create the final executable. Well, I guess it might sort of be possible by concatenating the .s files and adding bunch of forward declarations for the functions, and merging debug info, and removing duplicate weak functions,,... Well, basically by writing a linker for a new "PTX" architecture. Doable, but so not worth it, IMO.

TUs are compiled to IR, then PTX generation shifts to the final link phase. I think we may need to rely on the user to supply PTX controls there explicitly. Or, at the very least, check that cuda-include-ptx propagated from TUs is used consistently in all TUs.

We'll probably just use the same default as that flag (which is on I think).

This brings another question -- which GPU variant will we generate PTX for? One? All (if more than one is specified)? The ones specified by --[no-]cuda-include-ptx= ?

Right now, it'll be the one that's attached to the LTO job. So if the user specified sm_70 they'll get PTX for sm_70.

I mean, when the user specifies more than one GPU variant to target.
E.g. both sm_70 and sm_50.
PTX for the former would probably provide better performance if we run on a newer GPU (e.g. sm_80).
On the other hand, it will likely fail if we were to attempt running from PTX on sm_60.
Both would probably fail if we were to run on sm_35. Including all PTX variants is wasteful (Tensorflow-using applications are already pushing the limits on small memory model and sometimes fail to link due to the executable being too large).

The point is that there's no "one true choice" for the PTX architecture (as there's no safe/sensible choice for the offload target). Only the end user would know their intent. We do need explicit controls and a documented policy on what we produce by default.

In D127901#3603467, @tra wrote:

I'm not sure I follow. WDYM by "go inside the binary itself" ? I assume you mean the per-GPU offload binaries inside per-TU .o. so that it could be used when that GPU object gets linked into GPU executable?

What if different TUs that we're linking were compiled using different/contradictory options?

The problem is that conceptually "--cuda-include-ptx" option ultimately affects the final GPU executable. If we're in RDC mode, then PTX is probably useless for JITT-ing purposes, as you can't link PTX and create the final executable. Well, I guess it might sort of be possible by concatenating the .s files and adding bunch of forward declarations for the functions, and merging debug info, and removing duplicate weak functions,,... Well, basically by writing a linker for a new "PTX" architecture. Doable, but so not worth it, IMO.

TUs are compiled to IR, then PTX generation shifts to the final link phase. I think we may need to rely on the user to supply PTX controls there explicitly. Or, at the very least, check that cuda-include-ptx propagated from TUs is used consistently in all TUs.

I just mean that right now the --[no-]cuda-include-ptx is done at the compilation phase, whereas this in LTO so we'd need to make sure we have those arguments. It's true that we could just require the user to pass it to the linker instead, but conceptually PTX generation happens in the "compiler" and not the linker.

We'll probably just use the same default as that flag (which is on I think).

This brings another question -- which GPU variant will we generate PTX for? One? All (if more than one is specified)? The ones specified by --[no-]cuda-include-ptx= ?

Right now, it'll be the one that's attached to the LTO job. So if the user specified sm_70 they'll get PTX for sm_70.

I mean, when the user specifies more than one GPU variant to target.
E.g. both sm_70 and sm_50.
PTX for the former would probably provide better performance if we run on a newer GPU (e.g. sm_80).
On the other hand, it will likely fail if we were to attempt running from PTX on sm_60.
Both would probably fail if we were to run on sm_35. Including all PTX variants is wasteful (Tensorflow-using applications are already pushing the limits on small memory model and sometimes fail to link due to the executable being too large).

The point is that there's no "one true choice" for the PTX architecture (as there's no safe/sensible choice for the offload target). Only the end user would know their intent. We do need explicit controls and a documented policy on what we produce by default.

This is a good point I haven't thought of. This right now is basically just a by-product of the LTO pass. We run LTO for the target and since we got a PTX output we might as well include it. This may be what we do in Clang as well, I think we just include the PTX output in with the Cubin for each offload job. Even if we went to LLVM-IR we'd still be restricted by some features I think. As it stands, this patch just makes clang++ cuda.cu --offload-new-driver -fgpu-rdc --offload-arch=sm_60 -foffload-lto give a fatbinary with sm_60 PTX / Cubins. I think that is controlled by the user as it's only going to generate PTX for the architecture they specified via --offload-arch (or default).

Revision Contents

Path

Size

clang/

test/

Driver/

linker-wrapper.c

10 lines

tools/

clang-linker-wrapper/

ClangLinkerWrapper.cpp

50 lines

Diff 437306

clang/test/Driver/linker-wrapper.c

	Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines
	// RUN: clang-linker-wrapper --dry-run --host-triple x86_64-unknown-linux-gnu -linker-path \			// RUN: clang-linker-wrapper --dry-run --host-triple x86_64-unknown-linux-gnu -linker-path \
	// RUN: /usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=CUDA			// RUN: /usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=CUDA

	// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_52 {{.*}}.o			// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_52 {{.*}}.o
	// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_70 {{.}}.o {{.}}.o			// CUDA: nvlink{{.}}-m64 -o {{.}}.out -arch sm_70 {{.}}.o {{.}}.o
	// CUDA: fatbinary{{.}}-64 --create {{.}}.fatbin --image=profile=sm_52,file={{.}}.out --image=profile=sm_70,file={{.}}.out			// CUDA: fatbinary{{.}}-64 --create {{.}}.fatbin --image=profile=sm_52,file={{.}}.out --image=profile=sm_70,file={{.}}.out

	// RUN: clang-offload-packager -o %t.out \			// RUN: clang-offload-packager -o %t.out \
				// RUN: --image=file=%S/Inputs/dummy-bc.bc,kind=cuda,triple=nvptx64-nvidia-cuda,arch=sm_52
				// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \
				// RUN: -fembed-offload-object=%t.out
				// RUN: clang-linker-wrapper --dry-run --host-triple x86_64-unknown-linux-gnu -linker-path \
				// RUN: /usr/bin/ld -- %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=CUDA-LTO

				// CUDA-LTO: ptxas{{.*}}-m64 -o [[CUBIN:.+]] -O2 --gpu-name sm_52 [[PTX:.+]]
				// CUDA-LTO: fatbinary{{.*}}-64 --create [[FATBINARY:.+]] --image=profile=compute_52,file=[[PTX]] --image=profile=sm_52,file=[[CUBIN]]

				// RUN: clang-offload-packager -o %t.out \
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx908 \			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=amdgcn-amd-amdhsa,arch=gfx908 \
	// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70			// RUN: --image=file=%S/Inputs/dummy-elf.o,kind=openmp,triple=nvptx64-nvidia-cuda,arch=sm_70
	// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \			// RUN: %clang -cc1 %s -triple x86_64-unknown-linux-gnu -emit-obj -o %t.o \
	// RUN: -fembed-offload-object=%t.out			// RUN: -fembed-offload-object=%t.out
	// RUN: clang-linker-wrapper --dry-run --host-triple x86_64-unknown-linux-gnu -linker-path \			// RUN: clang-linker-wrapper --dry-run --host-triple x86_64-unknown-linux-gnu -linker-path \
	// RUN: /usr/bin/ld --device-linker=a --device-linker=nvptx64-nvidia-cuda=b -- \			// RUN: /usr/bin/ld --device-linker=a --device-linker=nvptx64-nvidia-cuda=b -- \
	// RUN: %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=LINKER_ARGS			// RUN: %t.o -o a.out 2>&1 \| FileCheck %s --check-prefix=LINKER_ARGS

	// LINKER_ARGS: lld{{.}}-flavor gnu --no-undefined -shared -o {{.}}.out {{.*}}.o a			// LINKER_ARGS: lld{{.}}-flavor gnu --no-undefined -shared -o {{.}}.out {{.*}}.o a
	// LINKER_ARGS: nvlink{{.}}-m64 -o {{.}}.out -arch sm_70 {{.*}}.o a b			// LINKER_ARGS: nvlink{{.}}-m64 -o {{.}}.out -arch sm_70 {{.*}}.o a b

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp

Show First 20 Lines • Show All 551 Lines • ▼ Show 20 Lines	fatbinary(ArrayRef<std::pair<StringRef, StringRef>> InputFiles,
BumpPtrAllocator Alloc;		BumpPtrAllocator Alloc;
StringSaver Saver(Alloc);		StringSaver Saver(Alloc);

SmallVector<StringRef, 16> CmdArgs;		SmallVector<StringRef, 16> CmdArgs;
CmdArgs.push_back(*FatBinaryPath);		CmdArgs.push_back(*FatBinaryPath);
CmdArgs.push_back(TheTriple.isArch64Bit() ? "-64" : "-32");		CmdArgs.push_back(TheTriple.isArch64Bit() ? "-64" : "-32");
CmdArgs.push_back("--create");		CmdArgs.push_back("--create");
CmdArgs.push_back(TempFile);		CmdArgs.push_back(TempFile);
for (const auto &FileAndArch : InputFiles)		for (const auto &FileAndArch : InputFiles) {
CmdArgs.push_back(Saver.save("--image=profile=" + std::get<1>(FileAndArch) +		if (std::get<0>(FileAndArch).endswith(".s"))
		CmdArgs.push_back(Saver.save("--image=profile=compute_" +
		std::get<1>(FileAndArch).split("_").second +
",file=" + std::get<0>(FileAndArch)));		",file=" + std::get<0>(FileAndArch)));
		else
		CmdArgs.push_back(
		Saver.save("--image=profile=" + std::get<1>(FileAndArch) +
		",file=" + std::get<0>(FileAndArch)));
		}

if (Error Err = executeCommands(*FatBinaryPath, CmdArgs))		if (Error Err = executeCommands(*FatBinaryPath, CmdArgs))
return std::move(Err);		return std::move(Err);

return static_cast<std::string>(TempFile);		return static_cast<std::string>(TempFile);
}		}
} // namespace nvptx		} // namespace nvptx
namespace amdgcn {		namespace amdgcn {
▲ Show 20 Lines • Show All 244 Lines • ▼ Show 20 Lines
bool isValidCIdentifier(StringRef S) {		bool isValidCIdentifier(StringRef S) {
return !S.empty() && (isAlpha(S[0]) \|\| S[0] == '_') &&		return !S.empty() && (isAlpha(S[0]) \|\| S[0] == '_') &&
std::all_of(S.begin() + 1, S.end(),		std::all_of(S.begin() + 1, S.end(),
[](char C) { return C == '_' \|\| isAlnum(C); });		[](char C) { return C == '_' \|\| isAlnum(C); });
}		}

Error linkBitcodeFiles(SmallVectorImpl<OffloadFile> &InputFiles,		Error linkBitcodeFiles(SmallVectorImpl<OffloadFile> &InputFiles,
SmallVectorImpl<std::string> &OutputFiles,		SmallVectorImpl<std::string> &OutputFiles,
		SmallVectorImpl<std::string> &AuxOutputFiles,
const Triple &TheTriple, StringRef Arch) {		const Triple &TheTriple, StringRef Arch) {
SmallVector<OffloadFile, 4> BitcodeInputFiles;		SmallVector<OffloadFile, 4> BitcodeInputFiles;
DenseSet<StringRef> UsedInRegularObj;		DenseSet<StringRef> UsedInRegularObj;
DenseSet<StringRef> UsedInSharedLib;		DenseSet<StringRef> UsedInSharedLib;

// Search for bitcode files in the input and create an LTO input file. If it		// Search for bitcode files in the input and create an LTO input file. If it
// is not a bitcode file, scan its symbol table for symbols we need to save.		// is not a bitcode file, scan its symbol table for symbols we need to save.
for (OffloadFile &File : InputFiles) {		for (OffloadFile &File : InputFiles) {
▲ Show 20 Lines • Show All 162 Lines • ▼ Show 20 Lines	if (BitcodeOutput.size() != 1 \|\| !WholeProgram)
"Cannot embed bitcode with multiple files.");		"Cannot embed bitcode with multiple files.");
OutputFiles.push_back(static_cast<std::string>(BitcodeOutput.front()));		OutputFiles.push_back(static_cast<std::string>(BitcodeOutput.front()));
return Error::success();		return Error::success();
}		}

// Is we are compiling for NVPTX we need to run the assembler first.		// Is we are compiling for NVPTX we need to run the assembler first.
if (TheTriple.isNVPTX()) {		if (TheTriple.isNVPTX()) {
for (auto &File : Files) {		for (auto &File : Files) {
		AuxOutputFiles.push_back(static_cast<std::string>(File));
auto FileOrErr = nvptx::assemble(File, TheTriple, Arch, !WholeProgram);		auto FileOrErr = nvptx::assemble(File, TheTriple, Arch, !WholeProgram);
if (!FileOrErr)		if (!FileOrErr)
return FileOrErr.takeError();		return FileOrErr.takeError();
File = *FileOrErr;		File = *FileOrErr;
}		}
}		}

// Append the new inputs to the device linker input.		// Append the new inputs to the device linker input.
▲ Show 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	for (auto &InputForTarget : InputsForTarget) {
llvm::Triple Triple(TripleStr);		llvm::Triple Triple(TripleStr);

DenseSet<OffloadKind> ActiveOffloadKinds;		DenseSet<OffloadKind> ActiveOffloadKinds;
for (const auto &File : Input)		for (const auto &File : Input)
ActiveOffloadKinds.insert(File.getBinary()->getOffloadKind());		ActiveOffloadKinds.insert(File.getBinary()->getOffloadKind());

// First link and remove all the input files containing bitcode.		// First link and remove all the input files containing bitcode.
SmallVector<std::string> InputFiles;		SmallVector<std::string> InputFiles;
if (Error Err = linkBitcodeFiles(Input, InputFiles, Triple, Arch))		SmallVector<std::string> OutputFiles;
		if (Error Err =
		linkBitcodeFiles(Input, InputFiles, OutputFiles, Triple, Arch))
return Err;		return Err;

// Write any remaining device inputs to an output file for the linker job.		// Write any remaining device inputs to an output file for the linker job.
for (const OffloadFile &File : Input) {		for (const OffloadFile &File : Input) {
auto FileNameOrErr = writeOffloadFile(File);		auto FileNameOrErr = writeOffloadFile(File);
if (!FileNameOrErr)		if (!FileNameOrErr)
return FileNameOrErr.takeError();		return FileNameOrErr.takeError();
InputFiles.emplace_back(*FileNameOrErr);		InputFiles.emplace_back(*FileNameOrErr);
}		}

// Link the remaining device files, if necessary, using the device linker.		// Link the remaining device files, if necessary, using the device linker.
bool RequiresLinking =		bool RequiresLinking =
!Input.empty() \|\| (!EmbedBitcode && !Triple.isNVPTX());		!Input.empty() \|\| (!EmbedBitcode && !Triple.isNVPTX());
auto OutputOrErr = (RequiresLinking) ? linkDevice(InputFiles, Triple, Arch)		auto OutputOrErr = (RequiresLinking) ? linkDevice(InputFiles, Triple, Arch)
: InputFiles.front();		: InputFiles.front();
if (!OutputOrErr)		if (!OutputOrErr)
return OutputOrErr.takeError();		return OutputOrErr.takeError();
		OutputFiles.push_back(*OutputOrErr);

// Store the offloading image for each linked output file.		// Store the offloading image for each output file.
for (OffloadKind Kind : ActiveOffloadKinds) {		for (OffloadKind Kind : ActiveOffloadKinds) {
		for (StringRef Output : OutputFiles) {
		// Ignore any PTX output if we're not creating a fatbinary.
		if (Output.endswith(".s") && Kind != OFK_Cuda)
		continue;

llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> FileOrErr =		llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>> FileOrErr =
llvm::MemoryBuffer::getFileOrSTDIN(*OutputOrErr);		llvm::MemoryBuffer::getFileOrSTDIN(Output);
if (std::error_code EC = FileOrErr.getError())		if (std::error_code EC = FileOrErr.getError())
return createFileError(*OutputOrErr, EC);		return createFileError(Output, EC);

OffloadingImage TheImage{};		OffloadingImage TheImage{};
TheImage.TheImageKind = IMG_Object;		TheImage.TheImageKind = Output.endswith(".s") ? IMG_PTX : IMG_Object;
TheImage.TheOffloadKind = Kind;		TheImage.TheOffloadKind = Kind;
TheImage.StringData = {{"triple", TripleStr}, {"arch", Arch}};		TheImage.StringData = {{"triple", TripleStr}, {"arch", Arch}};
TheImage.Image = std::move(*FileOrErr);		TheImage.Image = std::move(*FileOrErr);
Images[Kind].emplace_back(std::move(TheImage));		Images[Kind].emplace_back(std::move(TheImage));
}		}
}		}
		}

// Create a binary image of each offloading image and embed it into a new		// Create a binary image of each offloading image and embed it into a new
// object file.		// object file.
SmallVector<std::string> WrappedOutput;		SmallVector<std::string> WrappedOutput;
for (const auto &KindAndImages : Images) {		for (const auto &KindAndImages : Images) {
OffloadKind Kind = KindAndImages.first;		OffloadKind Kind = KindAndImages.first;
auto BundledImagesOrErr =		auto BundledImagesOrErr =
bundleLinkedOutput(KindAndImages.second, KindAndImages.first);		bundleLinkedOutput(KindAndImages.second, KindAndImages.first);
▲ Show 20 Lines • Show All 183 Lines • Show Last 20 Lines