Page MenuHomePhabricator

[HIP] Add -emit-pch option to clang driver
Needs ReviewPublic

Authored by ashi1 on Tue, Sep 8, 1:12 PM.

Details

Reviewers
yaxunl
rsmith
tra
Summary

Move the -emit-pch option to clang driver flags, so that
HIP can emit pch files, if requested in driver. This is on both
device and host paths. Introducing a new TY_HIPHeader,
which is used for the Precompile phase list.

Diff Detail

Event Timeline

ashi1 created this revision.Tue, Sep 8, 1:12 PM
ashi1 requested review of this revision.Tue, Sep 8, 1:12 PM

need tests for --cuda-host-only and default host/device compilation, and tests for C/C++

clang/include/clang/Driver/Types.def
47

We should not pollute the normal phases for HIP program with Precompile.

BTW can you fix the extensions for HIP and HIP_DEVICE, which should be "hip" instead of "cu".

64

I would suggest to add "hip-header" and "hip-header-cpp-output" like "c++-header" and "c++-header-cpp-output". Basically we will only allow -emit-pch on "hip-header".

clang/lib/Driver/Driver.cpp
2829

typo: Actoms

3638

we need to support -emit-pch in host only compilation too. And not just for HIP, for C and C++ too, since it is supposed to be a generic option.

clang/test/Driver/hip-device-compile.hip
31 ↗(On Diff #290572)

pls remove the --hip-device-lib options and add -nogpulib

ashi1 marked 5 inline comments as done.Thu, Sep 10, 2:20 PM

Looking into the C/C++ tests.

clang/include/clang/Driver/Types.def
64

Sounds good, I had tried this attempt earlier, and I have the new diff with this implementation.

ashi1 updated this revision to Diff 291082.Thu, Sep 10, 2:23 PM
ashi1 marked an inline comment as done.
ashi1 edited the summary of this revision. (Show Details)

Fixed to use TY_HIPHeader instead of changing the phases in TY_HIP.

tra added a comment.Mon, Sep 14, 11:26 AM

Can you elaborate on the use case of PCH files for CUDA/HIP?

Is -emit-pch only expected to work with a single sub-compilation only, similarly to how we handle -S ? The tests appear to imply so. If that's the case, if would be great to add a test verifying that --emit-pch fails if it's used with more than one sub-coimpilation.

What will happen if I attempt to use PCH compiled for one GPU variant during compilation targeting a different variant? I know that clang does complain if the compilation uses different options (some of them?) compared to the options used during compilation that produced the PCH. I'm not sure whether it will be sufficient to prevent use of PCH for a wrong GPU. It would be great to have a test for that, too.

In D87325#2271676, @tra wrote:

Can you elaborate on the use case of PCH files for CUDA/HIP?

I believe one use-case for PCH is for common include headers such as hip_runtime.h which is being re-used in many application source files. To improve the performance, we can pre-compile the header and re-use it during online compilation.

Is -emit-pch only expected to work with a single sub-compilation only, similarly to how we handle -S ? The tests appear to imply so. If that's the case, if would be great to add a test verifying that --emit-pch fails if it's used with more than one sub-coimpilation.

This patch seems to only work for single sub-compilation. I've found that using -emit-pch for host+device will generate a PCH bitcode with clang-offload-bundle. The ASTReader cannot understand the clang_offload_bundle, and only looks for PCH's magic number (that could be future work). So for now, this will only work when PCH is generated under --cuda-host-only or --cuda-device-only options. I will update the tests to reflect this. We can however generate multiple pch for multiple device archs.

What will happen if I attempt to use PCH compiled for one GPU variant during compilation targeting a different variant? I know that clang does complain if the compilation uses different options (some of them?) compared to the options used during compilation that produced the PCH. I'm not sure whether it will be sufficient to prevent use of PCH for a wrong GPU. It would be great to have a test for that, too.

Right, clang will complain if the PCH used was compiled for a different GPU. I will add a test to check for this. We will see an error like:
error: PCH file was compiled for the target CPU 'gfx900' but the current translation unit is being compiled for target 'gfx803'

tra added a comment.Wed, Sep 16, 12:03 PM
In D87325#2271676, @tra wrote:

Can you elaborate on the use case of PCH files for CUDA/HIP?

I believe one use-case for PCH is for common include headers such as hip_runtime.h which is being re-used in many application source files. To improve the performance, we can pre-compile the header and re-use it during online compilation.

That would be potentially useful if it could be used from a normal compilation, but it's not. Single-sub-compilarion is a very very small niche.

I'm OK with making -emit-pch work for GPUs, but considering very limited use case and the fact that the generated PCH will be wrong more often than not (I.e. it will be usable for only 1 out of N subcompilations for particular TU), I would rather keep the -emit-pch a CC1 only option. Those who need it should be able to use it via -Xclang -emit-pch and for most of the regular users it does not matter.

@rsmith - WDYT?

ashi1 updated this revision to Diff 292315.EditedWed, Sep 16, 12:48 PM

Updated the tests to use --cuda-host-only or --cuda-device-only options when using -emit-pch. Added more tests for compilation when using -include-pch. Also, added a negative test when using different GPU variant PCH during compilation.

ashi1 added a comment.Wed, Sep 16, 1:45 PM
In D87325#2277467, @tra wrote:
In D87325#2271676, @tra wrote:

Can you elaborate on the use case of PCH files for CUDA/HIP?

I believe one use-case for PCH is for common include headers such as hip_runtime.h which is being re-used in many application source files. To improve the performance, we can pre-compile the header and re-use it during online compilation.

That would be potentially useful if it could be used from a normal compilation, but it's not. Single-sub-compilarion is a very very small niche.

I'm OK with making -emit-pch work for GPUs, but considering very limited use case and the fact that the generated PCH will be wrong more often than not (I.e. it will be usable for only 1 out of N subcompilations for particular TU), I would rather keep the -emit-pch a CC1 only option. Those who need it should be able to use it via -Xclang -emit-pch and for most of the regular users it does not matter.

Having -emit-pch in the clang driver is useful because it doesn't require users to specify standard C++ include paths, clang include paths, and CUDA/HIP wrapper headers needed by CC1. That is error prone for the user.
Also, this device compilation is not niche, it is needed for nvrtc/hiprtc and hip applications can perform device-only compilations at either compile-time or run-time.

tra added a comment.Wed, Sep 16, 2:17 PM

Having -emit-pch in the clang driver is useful because it doesn't require users to specify standard C++ include paths, clang include paths, and CUDA/HIP wrapper headers needed by CC1. That is error prone for the user.

I didn't meant o invoke -cc1 directly, but rather to pass -emit-pch via -Xclang -emit-pch. No need to provide *all* CC1 options manually.

Also, this device compilation is not niche, it is needed for nvrtc/hiprtc and hip applications can perform device-only compilations at either compile-time or run-time.

I'll leave it up to @rsmith. I'm not quite convinced that making -emit-pch a top-level option is the right thing to do yet.

ashi1 updated this revision to Diff 292344.Wed, Sep 16, 2:24 PM

Added a test checking for error when -o option is used for multi-device -emit-pch run.

ashi1 updated this revision to Diff 292570.Thu, Sep 17, 11:21 AM

Added a C++ header to .pch file test.

ashi1 updated this revision to Diff 292572.EditedThu, Sep 17, 11:43 AM

Adding Diag when mixing device and host paths with -emit-pch. Currently, we don't support this path, since the generated pch will be a clang_offload_bundle (supporting that will require that the ASTReader understand clang offload bundles and that is outside the scope of this patch). Added tests to check error Diag is reported when running both paths with -emit-pch.

ashi1 added a comment.EditedThu, Sep 17, 2:13 PM
In D87325#2277854, @tra wrote:

Having -emit-pch in the clang driver is useful because it doesn't require users to specify standard C++ include paths, clang include paths, and CUDA/HIP wrapper headers needed by CC1. That is error prone for the user.

I didn't meant o invoke -cc1 directly, but rather to pass -emit-pch via -Xclang -emit-pch. No need to provide *all* CC1 options manually.

@tra , I tried to invoke clang with -Xclang -emit-pch, but the -x hip path doesn't know about the precompile phase. It does pass -emit-pch to the cc1 command, but is overridden by the -emit-obj default flag to cc1 in compiler phase. Also, from there it will go through backend, assembler, linker, which we will need to disable if we want this method to work. Here is the experiment:

root@e6915ef660c7:~/llvm-project/build_rel# ./bin/clang++ -x hip --cuda-device-only --cuda-gpu-arch=gfx803 -Xclang -emit-pch a.hip -o a.hip.pch -ccc-print-bindings
# "amdgcn-amd-amdhsa" - "clang", inputs: ["a.hip"], output: "/tmp/a-e15936.o"
# "amdgcn-amd-amdhsa" - "AMDGCN::Linker", inputs: ["/tmp/a-e15936.o"], output: "/tmp/a-b03f5d.out"
# "amdgcn-amd-amdhsa" - "AMDGCN::Linker", inputs: ["/tmp/a-b03f5d.out"], output: "a.hip.pch"
clang-12: warning: argument unused during compilation: '-Xclang -emit-pch' [-Wunused-command-line-argument]
root@e6915ef660c7:~/llvm-project/build_rel# ./bin/clang++ -x hip --cuda-device-only --cuda-gpu-arch=gfx803 -Xclang -emit-pch a.hip -o a.hip.pch -ccc-print-phases
                     +- 0: input, "a.hip", hip, (device-hip, gfx803)
                  +- 1: preprocessor, {0}, hip-cpp-output, (device-hip, gfx803)
               +- 2: compiler, {1}, ir, (device-hip, gfx803)
            +- 3: backend, {2}, assembler, (device-hip, gfx803)
         +- 4: assembler, {3}, object, (device-hip, gfx803)
      +- 5: linker, {4}, image, (device-hip, gfx803)
   +- 6: offload, "device-hip (amdgcn-amd-amdhsa:gfx803)" {5}, image
+- 7: linker, {6}, hip-fatbin, (device-hip, unknown)
8: offload, "device-hip (amdgcn-amd-amdhsa:unknown)" {7}, hip-fatbin
root@e6915ef660c7:~/llvm-project/build_rel# ./bin/clang++ -x hip --cuda-device-only --cuda-gpu-arch=gfx803 -Xclang -emit-pch a.hip -o a.hip.pch -###
clang version 12.0.0 (https://github.com/llvm/llvm-project.git 18698802f1075c3dbb8d51ed6c2e59c2108bf260)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /root/llvm-project/build_rel/./bin
 "/root/llvm-project/build_rel/bin/clang-12" "-cc1" "-triple" "amdgcn-amd-amdhsa" "-aux-triple" "x86_64-unknown-linux-gnu" "-emit-obj" "-mrelax-all" "--mrelax-relocations" "-disable-free" "-main-file-name" "a.hip" "-mrelocation-model" "pic" "-pic-level" "1" "-mframe-pointer=all" "-fdenormal-fp-math-f32=preserve-sign,preserve-sign" "-fno-rounding-math" "-mconstructor-aliases" "-aux-target-cpu" "x86-64" "-fcuda-is-device" "-mllvm" "-amdgpu-internalize-symbols" "-fcuda-allow-variadic-functions" "-fvisibility" "hidden" "-fapply-global-visibility-to-externs" "-mlink-builtin-bitcode" "/opt/rocm/lib/hip.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/ocml.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/ockl.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_daz_opt_on.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_unsafe_math_off.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_finite_only_off.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_correctly_rounded_sqrt_on.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_wavefrontsize64_on.amdgcn.bc" "-mlink-builtin-bitcode" "/opt/rocm/lib/oclc_isa_version_803.amdgcn.bc" "-target-cpu" "gfx803" "-fno-split-dwarf-inlining" "-debugger-tuning=gdb" "-resource-dir" "/root/llvm-project/build_rel/lib/clang/12.0.0" "-internal-isystem" "/root/llvm-project/build_rel/lib/clang/12.0.0/include/cuda_wrappers" "-internal-isystem" "/opt/rocm/include" "-include" "__clang_hip_runtime_wrapper.h" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/x86_64-linux-gnu/c++/7.5.0" "-internal-isystem" "/usr/lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/backward" "-internal-isystem" "/usr/local/include" "-internal-isystem" "/root/llvm-project/build_rel/lib/clang/12.0.0/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-internal-isystem" "/usr/local/include" "-internal-isystem" "/root/llvm-project/build_rel/lib/clang/12.0.0/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-std=c++11" "-fdeprecated-macro" "-fno-autolink" "-fdebug-compilation-dir" "/root/llvm-project/build_rel" "-ferror-limit" "19" "-fhip-new-launch-api" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" "-fcolor-diagnostics" "-emit-pch" "-fcuda-allow-variadic-functions" "-faddrsig" "-o" "/tmp/a-fb5984.o" "-x" "hip" "a.hip"
 "/root/llvm-project/build_rel/./bin/lld" "-flavor" "gnu" "--no-undefined" "-shared" "-plugin-opt=-amdgpu-internalize-symbols" "-plugin-opt=mcpu=gfx803" "-o" "/tmp/a-f7f74d.out" "/tmp/a-fb5984.o"
 "/root/llvm-project/build_rel/./bin/clang-offload-bundler" "-type=o" "-targets=host-x86_64-unknown-linux,hip-amdgcn-amd-amdhsa-gfx803" "-inputs=/dev/null,/tmp/a-f7f74d.out" "-outputs=a.hip.pch"
root@e6915ef660c7:~/llvm-project/build_rel# ./bin/clang++ -x hip --cuda-device-only --cuda-gpu-arch=gfx803 -Xclang -emit-pch a.hip -o a.hip.pch
lld: error: /tmp/a-2d2a47.o:22693: unclosed quote
clang-12: error: amdgcn-link command failed with exit code 1 (use -v to see invocation)
tra added a comment.Thu, Sep 17, 2:44 PM

@tra , I tried to invoke clang with -Xclang -emit-pch, but the -x hip path doesn't know about the precompile phase. It does pass -emit-pch to the cc1 command, but is overridden by the -emit-obj default flag to cc1 in compiler phase. Also, from there it will go through backend, assembler, linker, which we will need to disable if we want this method to work. Here is the experiment:

root@e6915ef660c7:~/llvm-project/build_rel# ./bin/clang++ -x hip --cuda-device-only --cuda-gpu-arch=gfx803 -Xclang -emit-pch a.hip -o a.hip.pch -ccc-print-bindings

If you dd -S and remove -ccc-print-bindingsthis command does produce a PCH.

-Xclang -emit-pch does not change what the top-level driver does. You do need to tell it not to do too much. -S prevents additional bundling/linking steps. For regular C++ compilation -c would work, too.

Caveat -- while the command does produce the PCH, I have no idea whether that's the correct way to do it.

In D87325#2280370, @tra wrote:

If you dd -S and remove -ccc-print-bindingsthis command does produce a PCH.

-Xclang -emit-pch does not change what the top-level driver does. You do need to tell it not to do too much. -S prevents additional bundling/linking steps. For regular C++ compilation -c would work, too.

Caveat -- while the command does produce the PCH, I have no idea whether that's the correct way to do it.

@tra , thanks, I've tried the options -S -Xclang -emit-pch and -c -emit-llvm -Xclang -emit-pch and they do generate a PCH. This workaround seems to work, and I am trying to test it.
But same as you, I'm not sure if this is the correct way to do it. It feels a bit hacky, and its uncertain whether this will continue to work in the future.