This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Allow targeting NVPTX directly without a host toolchain
ClosedPublic

Authored by jhuber6 on Dec 15 2022, 1:00 PM.

Details

Summary

Currently, the NVPTX compilation toolchain can only be invoked either
through CUDA or OpenMP via --offload-device-only. This is because we
cannot build a CUDA toolchain without an accompanying host toolchain for
the offloading. When using --target=nvptx64-nvidia-cuda this results
in generating calls to the GNU assembler and linker, leading to errors.

This patch abstracts the portions of the CUDA toolchain that are
independent of the host toolchain or offloading kind into a new base
class called NVPTXToolChain. We still need to read the host's triple
to build the CUDA installation, so if not present we just assume it will
match the host's system for now, or the user can provide the path
explicitly.

This should allow the compiler driver to create NVPTX device images
directly from C/C++ code.

Diff Detail

Event Timeline

jhuber6 created this revision.Dec 15 2022, 1:00 PM
jhuber6 requested review of this revision.Dec 15 2022, 1:00 PM
Herald added a project: Restricted Project. · View Herald Transcript
tra added a comment.Dec 15 2022, 1:22 PM

LGTM overall.

So, essentially the patch refactors what we've already been doing for OpenMP and made it usable manually, which will be useful for things like GPU-side libc tests.

clang/lib/Driver/ToolChains/Cuda.cpp
632–633

Nit: I'd just do it within the make_unique below. It should end up on a line by itself and should not hurt readability.

jhuber6 marked an inline comment as done.Dec 15 2022, 3:08 PM

I just realized the method of copying the .o to a .cubin doesn't work if the link step is done in the same compilation because it doesn't exist yet. To fix this I could either make the tool chain emit .cubin if we're going straight to nvlink, or use a symbolic link. The former is uglier, the latter will probably only work on Linux.

Also do you think I should include the CUDA headers in with this? We can always get rid of them with nogpuinc or similar if they're not needed. The AMDGPU compilation still links in the device libraries so I'm wondering if we should at least be consistent.

Also do you think I should include the CUDA headers in with this? We can always get rid of them with nogpuinc or similar if they're not needed. The AMDGPU compilation still links in the device libraries so I'm wondering if we should at least be consistent.

For OpenMP at least we want this mode to include the device runtime and the respective native runtimes as well (except if we do LTO than both are included late).

tra added a comment.Dec 15 2022, 3:31 PM

I just realized the method of copying the .o to a .cubin doesn't work if the link step is done in the same compilation because it doesn't exist yet. To fix this I could either make the tool chain emit .cubin if we're going straight to nvlink, or use a symbolic link. The former is uglier, the latter will probably only work on Linux.

Does it have to be .cubin ? Does nvlink require it?

Also do you think I should include the CUDA headers in with this? We can always get rid of them with nogpuinc or similar if they're not needed. The AMDGPU compilation still links in the device libraries so I'm wondering if we should at least be consistent.

Only some CUDA headers can be used from C++ and they tend to contain host-only declarations, and stubs or CPU implementations of GPU-side functions in that case.
The rest rely on CUDA extensions, attributes, So, in this case doing a C++ compilation will give you a host-side view of the headers in the best case, which is not going to be particularly useful.

If we want to make GPU-side functions available, we'll need to have a pre-included wrapper with more preprocessor magic to make CUDA headers usable. Doing that in a C++ compilation would be questionable as for a C++ compilation a user would normally assume that no headers are pre-included by the compiler.

I just realized the method of copying the .o to a .cubin doesn't work if the link step is done in the same compilation because it doesn't exist yet. To fix this I could either make the tool chain emit .cubin if we're going straight to nvlink, or use a symbolic link. The former is uglier, the latter will probably only work on Linux.

Does it have to be .cubin ? Does nvlink require it?

Yes, it's one of the dumbest things in the CUDA toolchain. If the filename is a .o they assume it's a host file containing embedded RDC-code that needs to be combined. If it's a .cubin they assume it's device code. I have no clue why they couldn't just check the ELF flags, it's trivial.

Also do you think I should include the CUDA headers in with this? We can always get rid of them with nogpuinc or similar if they're not needed. The AMDGPU compilation still links in the device libraries so I'm wondering if we should at least be consistent.

Only some CUDA headers can be used from C++ and they tend to contain host-only declarations, and stubs or CPU implementations of GPU-side functions in that case.
The rest rely on CUDA extensions, attributes, So, in this case doing a C++ compilation will give you a host-side view of the headers in the best case, which is not going to be particularly useful.

If we want to make GPU-side functions available, we'll need to have a pre-included wrapper with more preprocessor magic to make CUDA headers usable. Doing that in a C++ compilation would be questionable as for a C++ compilation a user would normally assume that no headers are pre-included by the compiler.

We might want to at least include the libdevice files, most of our wrappers definitely won't work without CUDA or OpenMP language modes.

If we do magic header including, we should check for the freestanding argument and not include them with that set. I would prefer we not include cuda headers into C++ source that isn't being compiled as cuda, and also not link in misc cuda library files.

Anyone who has decided to compile raw c or c++ to ptx is doing something off the beaten path, I don't think we should assume they want implicit behaviour from other programming models thrown in.

tra added a comment.Dec 15 2022, 3:48 PM

I don't think we should assume they want implicit behaviour from other programming models thrown in.

Agreed. Also, removing things is often surprisingly hard.
Let's keep things simlpe, get this compilation mode to the point where it's practically usable, see what we really want/need, and then make those common cases the default or easier to use.

As far as libdevice is concerned, it's only needed for sources that use CUDA headers, as those map the standard math functions to their implementations in libdevice. Stand-alone sources should not need it. We should also be able to compile libdevice.bc into a libdevice.o and then link with it, if needed, during the final executable link phase.

jhuber6 updated this revision to Diff 483389.Dec 15 2022, 4:39 PM

Addressing comments, I did the symbolic link method. It's a stupid hack that's only necessary because of nvlink's poor handling but I think this works around it well enough.

jhuber6 updated this revision to Diff 483507.Dec 16 2022, 5:59 AM

Fix format

jhuber6 updated this revision to Diff 483576.Dec 16 2022, 10:02 AM

Accidentally deleted the old getInputFilename routine which we need. The
symlink worked fine but would break on Windows, so I ended up writing a hack
that would only use .cubin if we have the nvlink linker active and the input
comes from the assembler. It's extremely ugly so let me know what you think.

jhuber6 updated this revision to Diff 489554.Jan 16 2023, 7:26 AM

Updating. Used a different method to determine if we need to use .cubin or .o. It's a little ugly but I don't think there's a better way to do it.

Also I just realized that if this goes through I could probably heavily simplify linker wrapper code by just invoking clang --target=nvptx64 instead of calling the tools manually. So that's another benefit to this.

FWIW, creating CUBIN from C/C++ directly would be really useful when debugging (and in combination with our soon to be available JIT object loader).

@tra Can we get this in somehow?

tra added a comment.Jan 18 2023, 12:16 PM

LGTM overall, with few nits.

clang/lib/Driver/ToolChains/Cuda.cpp
448–450

Can't say that I like this approach. It heavily relies on "happens to work".

Perhaps a better way to deal with this is to create the temporary GPU object file name with ".cubin" extension to start with.

450

typo: getInputFilename

469

Nit: I'd add an else { Relocatable = false; // comment on what use cases use relocatable compilation by default. } and leave it uninitialized here. At the very least it may be worth a comment. Silently defaulting to true makes me ask "what other compilation modes we may have?" and stand-alone compilation targeting NVPTX is hard to infer here from the surrounding details.

clang/lib/Driver/ToolChains/Cuda.h
196–197

Nit: it's hard to tell whether the whitespace additions are spaces or tabs. They show up as ">" to me which suggests it may be tabs. Just in case it is indeed the case, please make sure to un-tabify the changes.

clang/test/Driver/cuda-cross-compiling.c
37–38

This may fail on windows where ptxas/nvlink will be ptxas.exe nvlink.exe. I think we typically use something like fatbinary{{.*}}" in other tests.

jhuber6 added inline comments.Jan 18 2023, 12:27 PM
clang/lib/Driver/ToolChains/Cuda.cpp
448–450

As far as I know, the files are created independently of the tool-chains so I'm not sure if we'd be able to check there. The current way is to use getInputFilename but that doesn't have access to the compilation. As far as I can come up with there's the following solutions

  • Do this and check the temp files
  • Create a symbolic link if the file is empty (Doesn't work on Windows)
  • Make a random global that's true if the Linker tool was built at some point
469

Works for me.

clang/test/Driver/cuda-cross-compiling.c
37–38

Good point.

jhuber6 marked an inline comment as done.Jan 18 2023, 12:28 PM
jhuber6 added inline comments.
clang/lib/Driver/ToolChains/Cuda.h
196–197

I think it's because the original file was formatted incorrectly, so all my new changes are formatted correctly. I'm somewhat more tempted to just clang-format it upstream and rebase it.

jhuber6 updated this revision to Diff 490317.Jan 18 2023, 3:22 PM

Addressing some comments. I don't know if there's a cleaner way to mess around with the .cubin nonsense. I liked symbolic links but that doesn't work on Windows.

jhuber6 marked 3 inline comments as done.Jan 18 2023, 3:24 PM
tra accepted this revision.Jan 18 2023, 3:35 PM

LGTM with few minor nits and questions.

Addressing some comments. I don't know if there's a cleaner way to mess around with the .cubin nonsense. I liked symbolic links but that doesn't work on Windows.

Can we do a copy or rename instead. That could be more portable.

clang/lib/Driver/ToolChains/Cuda.cpp
448–450

Naming is hard. :-(

OK, let's add a FIXME to this comment and hope to get it fixed when NVIDIA's tools become less obsessive about file extensions. I'll file a bug with them to get the ball rolling on their end.

515–518

Nit: Looks like there are tabs.

clang/lib/Driver/ToolChains/Cuda.h
196–197

Clang-formatting the file(s) and submitting that change separately before the patch SGTM.

This revision is now accepted and ready to land.Jan 18 2023, 3:35 PM
jhuber6 marked 3 inline comments as done.Jan 18 2023, 3:41 PM

LGTM with few minor nits and questions.

Addressing some comments. I don't know if there's a cleaner way to mess around with the .cubin nonsense. I liked symbolic links but that doesn't work on Windows.

Can we do a copy or rename instead. That could be more portable.

I do a copy in this patch already. The problem is that if it's an internal link, e.g. clang --target=nvptx64-nvidia-cuda foo.c bar.c, then there will be no file to copy. The symbol link got around this by linking the files so once the .o was written the contents would automatically appear in the .cubin. The other approach is to detect this and use the correct names which is what this patch does, albeit a little jankily.

clang/lib/Driver/ToolChains/Cuda.cpp
448–450

I would like that. it's a very simple check if you actually know how ELF headers work... Also while you're at it, ask for how they set their machine flags in Nvidia so we can display the arch via llvm-readelf.

I'll add some FIXME's.

515–518

I think that's just the diff showing that it was indented, there's no tabs in the file.

clang/lib/Driver/ToolChains/Cuda.h
196–197

Already did :)

This revision was landed with ongoing or failed builds.Jan 18 2023, 4:18 PM
This revision was automatically updated to reflect the committed changes.
jhuber6 marked 3 inline comments as done.
alexfh added a subscriber: alexfh.Jan 26 2023, 7:25 AM

This patch breaks our cuda compilations. The output file isn't created after it:

$ echo 'extern "C" __attribute__((global)) void q() {}' >q.cc
$ good-clang \
    -nocudainc -x cuda \
    --cuda-path=somepath/cuda/ \
    -Wno-unknown-cuda-version --cuda-device-only \
    -c q.cc -o qqq-good.o \
  && bad-clang \
    -nocudainc -x cuda \
    --cuda-path=somepath/cuda/ \
    -Wno-unknown-cuda-version --cuda-device-only \
    -c q.cc -o qqq-bad.o \
  && ls qqq*.o
qqq-good.o

This patch breaks our cuda compilations. The output file isn't created after it:

$ echo 'extern "C" __attribute__((global)) void q() {}' >q.cc
$ good-clang \
    -nocudainc -x cuda \
    --cuda-path=somepath/cuda/ \
    -Wno-unknown-cuda-version --cuda-device-only \
    -c q.cc -o qqq-good.o \
  && bad-clang \
    -nocudainc -x cuda \
    --cuda-path=somepath/cuda/ \
    -Wno-unknown-cuda-version --cuda-device-only \
    -c q.cc -o qqq-bad.o \
  && ls qqq*.o
qqq-good.o

https://github.com/llvm/llvm-project/issues/60301 Still broken after this fix?

This patch breaks our cuda compilations. The output file isn't created after it:

$ echo 'extern "C" __attribute__((global)) void q() {}' >q.cc
$ good-clang \
    -nocudainc -x cuda \
    --cuda-path=somepath/cuda/ \
    -Wno-unknown-cuda-version --cuda-device-only \
    -c q.cc -o qqq-good.o \
  && bad-clang \
    -nocudainc -x cuda \
    --cuda-path=somepath/cuda/ \
    -Wno-unknown-cuda-version --cuda-device-only \
    -c q.cc -o qqq-bad.o \
  && ls qqq*.o
qqq-good.o

https://github.com/llvm/llvm-project/issues/60301 Still broken after this fix?

That works. Thanks!