This is an archive of the discontinued LLVM Phabricator instance.

[clang-nvlink-wrapper] Wrapper around nvlink for archive files
ClosedPublic

Authored by saiislam on Aug 18 2021, 4:50 AM.

Details

Summary

nvlink does not support linking of cubin files archived in an archive.
This tool extracts all the cubin files in the given device specific archive
and pass them to nvlink. It is required for linking static device libraries.

Diff Detail

Unit TestsFailed

Event Timeline

saiislam created this revision.Aug 18 2021, 4:50 AM
saiislam requested review of this revision.Aug 18 2021, 4:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2021, 4:50 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
JonChesterfield added a comment.EditedAug 18 2021, 4:57 AM

Tests. The usual bug in this area is that an archive can contain multiple files with the same name which clobber each other if extracted to the same directory. It looks to me like this implementation will fail that test.

Also, motivation? Seems this can be worked around by not putting cubin files in an archive.

Edit: this will fail to implement the archive semantics of only pulling in used files, i.e. it is implicitly -whole-archive. Suggest error unless whole-archive is requested, as then it does the right thing today while leaving us the option to implement the correct linker semantics later.

Side point: really not a fan of incorrectly implementing bits of a linker as standalone tools called from clang. Can we raise a bug report against nvlink instead of doing this?

Tests. The usual bug in this area is that an archive can contain multiple files with the same name which clobber each other if extracted to the same directory. It looks to me like this implementation will fail that test.

While extracting cubin files from the archive each output file gets a new name using llvm::sys::fs::createTemporaryFile, so there won't be any clobbering.

Also, motivation? Seems this can be worked around by not putting cubin files in an archive.

It is required for linking static device libraries on nvptx (D105191). Given a fat heterogenous archive, clang-offload-bundler will create a device specific archive which will be passed to llvm-link for amdgpu and nvlink-wrapper for nvptx.

Edit: this will fail to implement the archive semantics of only pulling in used files, i.e. it is implicitly -whole-archive. Suggest error unless whole-archive is requested, as then it does the right thing today while leaving us the option to implement the correct linker semantics later.

llvm-link and nvlink-wrapper both work on whole archive semantics..

Side point: really not a fan of incorrectly implementing bits of a linker as standalone tools called from clang. Can we raise a bug report against nvlink instead of doing this?

I also don't like adding a new tool just for one additional wrapping around. But, couldn't find a feasible solution on my own. Another alternative is to extend llvm-ar to support "output directory" option. But, the suggestion in last week's multi-company meeting was to go with nvlink-wrapper instead of extending llvm-ar. Let's file a bug report for nvlink for supporting archive files and as soon as it is available we can drop this tool. Requires only one line change in Cuda.cpp.

ye-luo added a comment.EditedAug 18 2021, 6:46 AM

this is the working steps in the linking script.

$clang++ -fopenmp -fopenmp-targets=nvptx64 -Xopenmp-target=nvptx64 -march=sm_80 -O3 -DNDEBUG CMakeFiles/cxx.complex_reduction.x.dir/complex_reduction.cpp.o -o ../bin/cxx.complex_reduction.x -v

clang-offload-bundler (host,device)
in: complex_reduction.cpp.o
out: complex_reduction-494ba8.o, complex_reduction-5aba63.cubin

nvlink (device)
in: complex_reduction-5aba63.cubin
out: complex_reduction-b1898c.out

clang-offload-wrapper (device)
in: complex_reduction-b1898c.out
out: cxx-a8318a.bc

clang (device)
in: cxx-a8318a.bc cxx-e54e6f.o

ld (host, device)
in: complex_reduction-494ba8.o, cxx-e54e6f.o
out: executable

I'm not quite understand what this wrapper replaces and why.
"It is required for linking static device libraries on nvptx" is not explaining what is not working with existing steps and what the clang-nvlink-wrapper changes to make it work. Need elaboration.

clang/tools/clang-nvlink-wrapper/ClangNvlinkWrapper.cpp
42

arcive -> archive
<objects> is input already
"The wrapper extracts any arcive objects " what does it mean?
"call nvlink with the individual files" waht individual files.
What is the output?
Please make this documentation more clear.

Discussed at the multicompany meeting today. Consensus reached is that the whole-archive/no-whole-archive distinction is unimportant - we can ship a toolchain that has whole-archive semantics and later change the default when doing more work in the linker. It's not my idea of backwards compatibility but I'm sympathetic to the argument that there's enough churn in gpu offloading that this particular point is lost in the noise.

The code itself looks basically OK to me, it extract files & passes them unchanged into nvlink along with other files. Error handling is incomplete - we detect some things going wrong, but end up returning 0 anyway. Bunch of comments inline.

It would be nicer from a unix perspective if the tool read the archives and returned a list of files to pass to nvlink, in pipeline fashion, but that'll make temporary file cleanup hazardous. Forking nvlink seems better for that reason.

Might be a nice usability feature for --help to print some stuff then invoke nvlink with --help itself. If this is ~ a perfect filter, aside from expanding archives, it could conceivably be named nvlink and used wherever nvlink is.

clang/tools/clang-nvlink-wrapper/ClangNvlinkWrapper.cpp
28

what's unistd used for here? can we drop this?

124

should lose the commented out code

128

this seems a fairly likely failure mode - perhaps we should find nvlink first, and only start writing archives members into temporary files etc after establishing that it exists

137

This unconditionally writes an nvlink invocation to stderr. Not good post debugging the first implementation, programs should execute silently when successful. We could look for a debugging flag / environment variable if necessary for debugging this in the field

138

there's a risk that the large number of arguments that results here exceeds the platform limitations. I don't know offhand if executeandwait handles that (e.g. by creating response files), or if nvlink understands response files

157

there's a using namespace llvm above, can remove a lot of llvm:: prefixes

197

does this name the file it couldn't delete?

saiislam updated this revision to Diff 368202.Aug 23 2021, 1:28 PM
saiislam marked 7 inline comments as done.
  1. Added documentation.
  2. Improved error handling.
ye-luo added inline comments.Aug 23 2021, 2:11 PM
clang/tools/clang-nvlink-wrapper/ClangNvlinkWrapper.cpp
19

Right now clang-offload-bundler is only used to create an object file for the host and a cubin file for the device.
Then cubin files are passed to the nvlink.
This is different from what you described

clang-offload-bundler creates a device specific archive of cubin files.
Such an archive is then passed to this tool to extract cubin files before passing to nvlink.

Is this caused by changes in https://reviews.llvm.org/D105191?
Do you have any reading materials which documents the whole linking flow of D105191?

saiislam marked an inline comment as done.Aug 24 2021, 12:35 AM
saiislam added inline comments.
clang/tools/clang-nvlink-wrapper/ClangNvlinkWrapper.cpp
19

Yes, this patch is required for D105191 to work correctly on nvptx.
Once this patch lands, I will update D105191 to call "clang-nvlink-wrapper" instead of "nvlink" in clang/lib/Driver/ToolChains/Cuda.cpp::NVPTX::OpenMPLinker::ConstructJob().

Greg Rodgers presented about static device libraries in last year's LLVM-CTH Workshop.

In summary, following commands are generated by clang driver to deal with heterogenous device libraries:

  1. device-specific-archive.a <== clang-offload-bundler(heteregenous-device-archive.a, current-device)
  2. If (amdgpu) linked-output <== llvm-link(device-specific-archive.a)
  3. If (nvptx) extacted-cubins.cubin <== nvlink-wrapper(device-specific-archive.a) linked-output <== nvlink (extracted-cubins.cubin)
saiislam marked an inline comment as done.Aug 31 2021, 5:23 AM

Ping. Any more suggestions/questions?
I will update requested changes/documentation for D105191 once this gets through.

ye-luo accepted this revision.Aug 31 2021, 10:10 AM

Documentation is much improved. LGTM.

This revision is now accepted and ready to land.Aug 31 2021, 10:10 AM

Maybe add this to the release notes of clang 14?

aaron.ballman added inline comments.
clang/tools/clang-nvlink-wrapper/ClangNvlinkWrapper.cpp
63

The -1 is odd here -- this argument is a Boolean for whether the buffer is text or not, and this introduced a new MSVC warning ('argument': truncation from 'int' to 'bool').

Maybe add this to the release notes of clang 14?

@sylvestre.ledru and @aaron.ballman please have a look at D109225. I have made the suggested changes.
I would like to include this wrapper in llvm-13 also because it is actually a fix required for D105191, which is a part of a feature being worked on for over a year (D81109 and D93525).

ye-luo added a comment.Sep 5 2021, 9:31 PM

This patch has landed as 83f3782c6129e7a5df3faaf0ae576611d16a8d49 but not reflected on Phabricator