This is an archive of the discontinued LLVM Phabricator instance.

[Cuda] Add initial support for wrapping CUDA images in the new driver.
ClosedPublic

Authored by jhuber6 on Apr 14 2022, 12:26 PM.

Details

Summary

This patch adds the initial support for wrapping CUDA images. This
requires changing some of the logic for how we bundle images. We now
need to copy the image for all kinds that are active for the
architecture. Then we need to run a separate wrapping job if the Kind is
Cuda. For cuda wrapping we need to use the fatbinary program from the
CUDA SDK to bundle all the binaries together. This is then passed to a
new function to perfom the actual module code generation that will be
implemented in a later patch.

Depends on D120273 D123471

Diff Detail

Event Timeline

jhuber6 created this revision.Apr 14 2022, 12:26 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 12:26 PM
jhuber6 requested review of this revision.Apr 14 2022, 12:26 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 12:26 PM
jhuber6 updated this revision to Diff 426420.May 2 2022, 8:41 AM

Updating to use the OffloadKind enum rather than the string. I will also probably simplify some of the logic for handling multiple files here in a later patch.

tra added inline comments.May 10 2022, 2:52 PM
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
186

Why do we limit ourselves to uint16_t here? Can't we just use OffloadKind itself and get rid of these casts?

188

Extend enum OffloadKind to include these special kinds.

716

We should have a way to pass extra options to fatbinary, too. E.g. we may want to use --compress-all. Also, we may need to pass through -g for debug builds.

Oh. Debug builds. Makes me wonder if cuda-gdb will be able to find GPU binaries packaged by the new driver. If it does not, it will be a rather serious problem. It would likely affect various profiling tools the same way, too. Can you give it a try?

1185–1186

Nit. We should think of changing linkBitcodeFiles to return llvm::ErrorOr<bool> so we can return WholeArchive value, instead of modifying it as an argument.

1193

What's expected to happen if we have more than one input?
If only one is ever expected, I'd add an assert.

1196–1197

Should EmbedBitcode be mutually exclusive vs WholeArchive?

Can we ever end up with both unset?

1199

ditto about the assert on the number of inputs.

1330

The M contains generated registration glue at this point, right? It may be worth a comment to explain what is it that we're compiling here.

clang/tools/clang-linker-wrapper/OffloadWrapper.cpp
271

A "not-implemented-yet/TODO" comment would be appropriate here.

clang/tools/clang-linker-wrapper/OffloadWrapper.h
20

It should be either "Wraps/registers" or "Wrap/register".

jhuber6 added inline comments.May 10 2022, 3:09 PM
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
716

I was planning on implementing this stuff more generally when we get the new binary tool in D125165 landed. That will allow me to more generally put any number of command line arguments into the binary itself and fish it out here. We already support a janky version for the -Xcuda-ptxas option here, but it's a mess and I'm planning on getting rid of it. Is it okay to punt that into the future? Debug builds are another sore point I don't handle super well right now but will be addressed better with D125165.

I haven't tested cuda-gdb, but I embed the fatbinary the same way that we do in non-rdc mode. I can read them with cuobjdump in the final executable so I'm assuming it's compatible.

1185–1186

That's a good idea, I'm planning on cleaning a lot of this stuff up later.

1193

This isn't used right now since JIT hasn't made it in. It should probably be a proper error honestly.

1196–1197

EmbedBitcode might need to require WholeArchive considering that it's supposed to be a completely linked image that can be sent to a JIT engine.

1330

Sure.

jhuber6 updated this revision to Diff 428517.May 10 2022, 3:27 PM

Addressing comments

Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2022, 3:27 PM
tra accepted this revision.May 10 2022, 3:44 PM

Please add a TODO around the items that need further work, so we don't forget about them. Review comments tend to fade from memory.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
191

Is there a particular reason for multiplying by 37? Enum values by themselves should do the job just fine.

716

I can read them with cuobjdump in the final executable so I'm assuming it's compatible.

We should be OK then.

716

Debug builds are another sore point I don't handle super well right now but will be addressed better with D125165

OK.

This revision is now accepted and ready to land.May 10 2022, 3:44 PM

Thanks for the review.

clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
191

That's what LLVM does for the regular DenseMapInfo<uint16_t>::getHashValue() so I just copied it here.

tra added inline comments.May 10 2022, 3:58 PM
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
191

It would make sense for mapping the full range of uint16_t into a much smaller set of entries. In this case we're already dealing with a very small densely packed set of values. For all practical purposes is a convenient overkill We could get by with just using a vector+direct indexing. We also don't care about hash collisions even if they happen.

Removing multiplication would not make much of a difference, but it would be one less question for the reader to ask, when they look at this code.

This revision was landed with ongoing or failed builds.May 11 2022, 4:30 AM
This revision was automatically updated to reflect the committed changes.