This patch adds the initial support for wrapping CUDA images. This
requires changing some of the logic for how we bundle images. We now
need to copy the image for all kinds that are active for the
architecture. Then we need to run a separate wrapping job if the Kind is
Cuda. For cuda wrapping we need to use the fatbinary program from the
CUDA SDK to bundle all the binaries together. This is then passed to a
new function to perfom the actual module code generation that will be
implemented in a later patch.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Updating to use the OffloadKind enum rather than the string. I will also probably simplify some of the logic for handling multiple files here in a later patch.
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp | ||
---|---|---|
186 | Why do we limit ourselves to uint16_t here? Can't we just use OffloadKind itself and get rid of these casts? | |
188 | Extend enum OffloadKind to include these special kinds. | |
716 | We should have a way to pass extra options to fatbinary, too. E.g. we may want to use --compress-all. Also, we may need to pass through -g for debug builds. Oh. Debug builds. Makes me wonder if cuda-gdb will be able to find GPU binaries packaged by the new driver. If it does not, it will be a rather serious problem. It would likely affect various profiling tools the same way, too. Can you give it a try? | |
1185–1186 | Nit. We should think of changing linkBitcodeFiles to return llvm::ErrorOr<bool> so we can return WholeArchive value, instead of modifying it as an argument. | |
1193 | What's expected to happen if we have more than one input? | |
1196–1197 | Should EmbedBitcode be mutually exclusive vs WholeArchive? Can we ever end up with both unset? | |
1199 | ditto about the assert on the number of inputs. | |
1330 | The M contains generated registration glue at this point, right? It may be worth a comment to explain what is it that we're compiling here. | |
clang/tools/clang-linker-wrapper/OffloadWrapper.cpp | ||
271 | A "not-implemented-yet/TODO" comment would be appropriate here. | |
clang/tools/clang-linker-wrapper/OffloadWrapper.h | ||
20 | It should be either "Wraps/registers" or "Wrap/register". |
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp | ||
---|---|---|
716 | I was planning on implementing this stuff more generally when we get the new binary tool in D125165 landed. That will allow me to more generally put any number of command line arguments into the binary itself and fish it out here. We already support a janky version for the -Xcuda-ptxas option here, but it's a mess and I'm planning on getting rid of it. Is it okay to punt that into the future? Debug builds are another sore point I don't handle super well right now but will be addressed better with D125165. I haven't tested cuda-gdb, but I embed the fatbinary the same way that we do in non-rdc mode. I can read them with cuobjdump in the final executable so I'm assuming it's compatible. | |
1185–1186 | That's a good idea, I'm planning on cleaning a lot of this stuff up later. | |
1193 | This isn't used right now since JIT hasn't made it in. It should probably be a proper error honestly. | |
1196–1197 | EmbedBitcode might need to require WholeArchive considering that it's supposed to be a completely linked image that can be sent to a JIT engine. | |
1330 | Sure. |
Please add a TODO around the items that need further work, so we don't forget about them. Review comments tend to fade from memory.
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp | ||
---|---|---|
191 | Is there a particular reason for multiplying by 37? Enum values by themselves should do the job just fine. | |
716 |
We should be OK then. | |
716 |
OK. |
Thanks for the review.
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp | ||
---|---|---|
191 | That's what LLVM does for the regular DenseMapInfo<uint16_t>::getHashValue() so I just copied it here. |
clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp | ||
---|---|---|
191 | It would make sense for mapping the full range of uint16_t into a much smaller set of entries. In this case we're already dealing with a very small densely packed set of values. For all practical purposes is a convenient overkill We could get by with just using a vector+direct indexing. We also don't care about hash collisions even if they happen. Removing multiplication would not make much of a difference, but it would be one less question for the reader to ask, when they look at this code. |
Why do we limit ourselves to uint16_t here? Can't we just use OffloadKind itself and get rid of these casts?