This is an archive of the discontinued LLVM Phabricator instance.

[Offload][OpenMP][CUDA] Allow fembed-bitcode for device offload
Needs RevisionPublic

Authored by jdoerfert on Apr 15 2021, 4:53 PM.

Details

Reviewers
tra
bollu
Summary

This is a fix for the problem reported here:
https://lists.llvm.org/pipermail/llvm-dev/2021-March/149529.html

That is, the target information was missing when we embedded bitcode and
that caused the NVPTX backend to fail.

Diff Detail

Event Timeline

jdoerfert created this revision.Apr 15 2021, 4:53 PM
jdoerfert requested review of this revision.Apr 15 2021, 4:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 15 2021, 4:53 PM
Herald added a subscriber: sstefan1. · View Herald Transcript

I'm not really sure about the test, my local setup didn't have CUDA attached properly but this should work in principle ;)

tra added inline comments.Apr 16 2021, 9:58 AM
clang/lib/Driver/ToolChains/Clang.cpp
4442–4446

This duplicates the same code a bit further down in the function. I think you should just set -target-cpu for everyone before diving into if(embedBitcodeInObject).

clang/test/Driver/embed-bitcode-nvptx.cu
1

This command line looks extremely odd to me.
If you are compiling with --cuda-device-only, then clang should've already set the right triple and the features.

Could you tell me more about what is the intent of the compilation and why you use this particular set of options?
I.e. why not just do clang -x cuda --offload-arch=sm_70 --cuda-device-only -nocudalib -nocudainc.

jdoerfert added inline comments.Apr 16 2021, 10:42 AM
clang/lib/Driver/ToolChains/Clang.cpp
4442–4446

Fair. I'll update it.

clang/test/Driver/embed-bitcode-nvptx.cu
1

Could you tell me more about what is the intent of the compilation and why you use this particular set of options?

because I never compiled cuda really ;)

I'll go with your options.

tra requested changes to this revision.Apr 16 2021, 11:23 AM
tra added inline comments.
clang/test/Driver/embed-bitcode-nvptx.cu
1

Something still does not add up.

AFAICT, the real problem is that that we're not adding -target-cpu, but rather that -fembed-bitcode=all splits -S compilation into two phases -- source-to-bitcode (this part gets all the right command line options and compiles fine) and IR -> PTX compilation which does end up only with the subset of the options and ends up failing because the intrinsics are not enabled.

I think what we want to do in this case is to prevent splitting GPU-side compilation. Adding a '-target-gpu' to the IR->PTX subcompilation may make things work in this case, but it does not really fix the root cause. E.g. we should also pass through the features set by the driver and, possibly, other options to keep both source->IR and IR->PTX compilations in sync.

This revision now requires changes to proceed.Apr 16 2021, 11:23 AM
jdoerfert added inline comments.Apr 16 2021, 2:13 PM
clang/test/Driver/embed-bitcode-nvptx.cu
1

I think what we want to do in this case is to prevent splitting GPU-side compilation.

I doubt that is as easy as it sounds. Where do we take the IR from then? (I want the GPU IR embedded after all)

E.g. we should also pass through the features set by the driver and ..

I agree, what if I move the embedding handling to the end, keep the "blacklist" that removes arguments we don't want, and see where that leads us?

tra added inline comments.Apr 16 2021, 2:36 PM
clang/test/Driver/embed-bitcode-nvptx.cu
1

Ah, so you do grab the intermediate IR. I assume that the PTX does get used, too.

Another way to deal with this may be to do two independent compilations -- source-to-IR and source-to-PTX. Each would use the standard compilation flags. The downside is that parsing and optimization time will double, so split compilation combined with filtering args is probably more practical.