Move the -emit-pch option to clang driver flags, so that
HIP can emit pch files, if requested in driver. This is on both
device and host paths. Introducing a new TY_HIPHeader,
which is used for the Precompile phase list.
Details
Diff Detail
Event Timeline
need tests for --cuda-host-only and default host/device compilation, and tests for C/C++
| clang/include/clang/Driver/Types.def | ||
|---|---|---|
| 47 | We should not pollute the normal phases for HIP program with Precompile. BTW can you fix the extensions for HIP and HIP_DEVICE, which should be "hip" instead of "cu". | |
| 64 | I would suggest to add "hip-header" and "hip-header-cpp-output" like "c++-header" and "c++-header-cpp-output". Basically we will only allow -emit-pch on "hip-header". | |
| clang/lib/Driver/Driver.cpp | ||
| 2829 | typo: Actoms | |
| 3638 | we need to support -emit-pch in host only compilation too. And not just for HIP, for C and C++ too, since it is supposed to be a generic option. | |
| clang/test/Driver/hip-device-compile.hip | ||
| 31 ↗ | (On Diff #290572) | pls remove the --hip-device-lib options and add -nogpulib | 
Looking into the C/C++ tests.
| clang/include/clang/Driver/Types.def | ||
|---|---|---|
| 64 | Sounds good, I had tried this attempt earlier, and I have the new diff with this implementation. | |
Can you elaborate on the use case of PCH files for CUDA/HIP?
Is -emit-pch only expected to work with a single sub-compilation only, similarly to how we handle -S ? The tests appear to imply so. If that's the case, if would be great to add a test verifying that --emit-pch fails if it's used with more than one sub-coimpilation.
What will happen if I attempt to use PCH compiled for one GPU variant during compilation targeting a different variant? I know that clang does complain if the compilation uses different options (some of them?) compared to the options used during compilation that produced the PCH. I'm not sure whether it will be sufficient to prevent use of PCH for a wrong GPU. It would be great to have a test for that, too.
I believe one use-case for PCH is for common include headers such as hip_runtime.h which is being re-used in many application source files. To improve the performance, we can pre-compile the header and re-use it during online compilation.
Is -emit-pch only expected to work with a single sub-compilation only, similarly to how we handle -S ? The tests appear to imply so. If that's the case, if would be great to add a test verifying that --emit-pch fails if it's used with more than one sub-coimpilation.
This patch seems to only work for single sub-compilation. I've found that using -emit-pch for host+device will generate a PCH bitcode with clang-offload-bundle. The ASTReader cannot understand the clang_offload_bundle, and only looks for PCH's magic number (that could be future work). So for now, this will only work when PCH is generated under --cuda-host-only or --cuda-device-only options. I will update the tests to reflect this. We can however generate multiple pch for multiple device archs.
What will happen if I attempt to use PCH compiled for one GPU variant during compilation targeting a different variant? I know that clang does complain if the compilation uses different options (some of them?) compared to the options used during compilation that produced the PCH. I'm not sure whether it will be sufficient to prevent use of PCH for a wrong GPU. It would be great to have a test for that, too.
Right, clang will complain if the PCH used was compiled for a different GPU. I will add a test to check for this. We will see an error like:
error: PCH file was compiled for the target CPU 'gfx900' but the current translation unit is being compiled for target 'gfx803'
That would be potentially useful if it could be used from a normal compilation, but it's not. Single-sub-compilarion is a very very small niche.
I'm OK with making -emit-pch work for GPUs, but considering very limited use case and the fact that the generated PCH will be wrong more often than not (I.e. it will be usable for only 1 out of N subcompilations for particular TU), I would rather keep the -emit-pch a CC1 only option. Those who need it should be able to use it via -Xclang -emit-pch and for most of the regular users it does not matter.
@rsmith - WDYT?
Updated the tests to use --cuda-host-only or --cuda-device-only options when using -emit-pch. Added more tests for compilation when using -include-pch. Also, added a negative test when using different GPU variant PCH during compilation.
Having -emit-pch in the clang driver is useful because it doesn't require users to specify standard C++ include paths, clang include paths, and CUDA/HIP wrapper headers needed by CC1. That is error prone for the user. 
Also, this device compilation is not niche, it is needed for nvrtc/hiprtc and hip applications can perform device-only compilations at either compile-time or run-time.
I didn't meant o invoke -cc1 directly, but rather to pass -emit-pch via -Xclang -emit-pch. No need to provide *all* CC1 options manually.
Also, this device compilation is not niche, it is needed for nvrtc/hiprtc and hip applications can perform device-only compilations at either compile-time or run-time.
I'll leave it up to @rsmith. I'm not quite convinced that making -emit-pch a top-level option is the right thing to do yet.
Added a test checking for error when -o option is used for multi-device -emit-pch run.
Adding Diag when mixing device and host paths with -emit-pch. Currently, we don't support this path, since the generated pch will be a clang_offload_bundle (supporting that will require that the ASTReader understand clang offload bundles and that is outside the scope of this patch). Added tests to check error Diag is reported when running both paths with -emit-pch.
@tra , I tried to invoke clang with -Xclang -emit-pch, but the -x hip path doesn't know about the precompile phase. It does pass -emit-pch to the cc1 command, but is overridden by the -emit-obj default flag to cc1 in compiler phase. Also, from there it will go through backend, assembler, linker, which we will need to disable if we want this method to work. Here is the experiment:
root@e6915ef660c7:~/llvm-project/build_rel# ./bin/clang++ -x hip --cuda-device-only --cuda-gpu-arch=gfx803 -Xclang -emit-pch a.hip -o a.hip.pch -ccc-print-bindings
# "amdgcn-amd-amdhsa" - "clang", inputs: ["a.hip"], output: "/tmp/a-e15936.o"
# "amdgcn-amd-amdhsa" - "AMDGCN::Linker", inputs: ["/tmp/a-e15936.o"], output: "/tmp/a-b03f5d.out"
# "amdgcn-amd-amdhsa" - "AMDGCN::Linker", inputs: ["/tmp/a-b03f5d.out"], output: "a.hip.pch"
clang-12: warning: argument unused during compilation: '-Xclang -emit-pch' [-Wunused-command-line-argument]
root@e6915ef660c7:~/llvm-project/build_rel# ./bin/clang++ -x hip --cuda-device-only --cuda-gpu-arch=gfx803 -Xclang -emit-pch a.hip -o a.hip.pch -ccc-print-phases
                     +- 0: input, "a.hip", hip, (device-hip, gfx803)
                  +- 1: preprocessor, {0}, hip-cpp-output, (device-hip, gfx803)
               +- 2: compiler, {1}, ir, (device-hip, gfx803)
            +- 3: backend, {2}, assembler, (device-hip, gfx803)
         +- 4: assembler, {3}, object, (device-hip, gfx803)
      +- 5: linker, {4}, image, (device-hip, gfx803)
   +- 6: offload, "device-hip (amdgcn-amd-amdhsa:gfx803)" {5}, image
+- 7: linker, {6}, hip-fatbin, (device-hip, unknown)
8: offload, "device-hip (amdgcn-amd-amdhsa:unknown)" {7}, hip-fatbin
root@e6915ef660c7:~/llvm-project/build_rel# ./bin/clang++ -x hip --cuda-device-only --cuda-gpu-arch=gfx803 -Xclang -emit-pch a.hip -o a.hip.pch -###
clang version 12.0.0 (https://github.com/llvm/llvm-project.git 18698802f1075c3dbb8d51ed6c2e59c2108bf260)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /root/llvm-project/build_rel/./bin
 "/root/llvm-project/build_rel/bin/clang-12" "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu" "-emit-obj" "-mrelax-all" "--mrelax-relocations" "-disable-free" "-main-file-name" "a.hip" "-mrelocation-model" "pic" "-pic-level" "1" "-mframe-pointer=all" "-fdenormal-fp-math-f32=preserve-sign,preserve-sign" "-fno-rounding-math" "-mconstructor-aliases" "-aux-target-cpu" "x86-64" "-fcuda-is-device" "-mllvm" "-amdgpu-internalize-symbols" "-fcuda-allow-variadic-functions" "-fvisibility" "hidden" "-fapply-global-visibility-to-externs" "-mlink-builtin-bitcode" "/opt/rocm/lib/hip.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/ocml.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/ockl.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_daz_opt_on.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_unsafe_math_off.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_finite_only_off.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_correctly_rounded_sqrt_on.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_wavefrontsize64_on.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_isa_version_803.amdgcn.bc" "-target-cpu" "gfx803" "-fno-split-dwarf-inlining" "-debugger-tuning=gdb" "-resource-dir" "/root/llvm-project/build_rel/lib/clang/12.0.0" "-internal-isystem" "/root/llvm-project/build_rel/lib/clang/12.0.0/include/cuda_wrappers" "-internal-isystem" "/opt/rocm/include" "-include" "__clang_hip_runtime_wrapper.h" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward" "-internal-isystem" "/usr/local/include" "-internal-isystem" "/root/llvm-project/build_rel/lib/clang/12.0.0/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-internal-isystem" "/usr/local/include" "-internal-isystem" "/root/llvm-project/build_rel/lib/clang/12.0.0/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-std=c++11" "-fdeprecated-macro" "-fno-autolink" "-fdebug-compilation-dir" "/root/llvm-project/build_rel" "-ferror-limit" "19" "-fhip-new-launch-api" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" "-fcolor-diagnostics" "-emit-pch" "-fcuda-allow-variadic-functions" "-faddrsig" "-o" "/tmp/a-fb5984.o" "-x" "hip" "a.hip"
 "/root/llvm-project/build_rel/./bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx803" "-o" "/tmp/a-f7f74d.out" "/tmp/a-fb5984.o"
 "/root/llvm-project/build_rel/./bin/clang-offload-bundler" "-type=o" "-targets=host-x86_64-unknown-linux,hip-amdgcn-amd-amdhsa-gfx803" "-inputs=/dev/null,/tmp/a-f7f74d.out" "-outputs=a.hip.pch"
root@e6915ef660c7:~/llvm-project/build_rel# ./bin/clang++ -x hip --cuda-device-only --cuda-gpu-arch=gfx803 -Xclang -emit-pch a.hip -o a.hip.pch
lld: error: /tmp/a-2d2a47.o:22693: unclosed quote
clang-12: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)If you dd -S and remove -ccc-print-bindingsthis command does produce a PCH.
-Xclang -emit-pch does not change what the top-level driver does. You do need to tell it not to do too much. -S prevents additional bundling/linking steps. For regular C++ compilation -c would work, too.
Caveat -- while the command does produce the PCH, I have no idea whether that's the correct way to do it.
@tra , thanks, I've tried the options -S -Xclang -emit-pch and -c -emit-llvm -Xclang -emit-pch and they do generate a PCH. This workaround seems to work, and I am trying to test it.
But same as you, I'm not sure if this is the correct way to do it. It feels a bit hacky, and its uncertain whether this will continue to work in the future.
We should not pollute the normal phases for HIP program with Precompile.
BTW can you fix the extensions for HIP and HIP_DEVICE, which should be "hip" instead of "cu".