Page MenuHomePhabricator

[PGO][CUDA][HIP] Skip generating profile on the device stub and wrong-side functions.
ClosedPublic

Authored by hliao on Aug 5 2020, 12:00 AM.

Details

Summary
  • Skip generating profile data on __global__ function in the host compilation. It's a host-side stub function only and don't have profile instrumentation generated on the real function body. The extra profile data results in the malformed instrumentation profile data.
  • Skip generating region mapping on functions in the wrong-side, i.e., + For the device compilation, skip host-only functions; and, + For the host compilation, skip device-only functions (including __global__ functions.)
  • As the device-side profiling is not ready yet, only host-side profile code generation is checked.

Diff Detail

Event Timeline

hliao created this revision.Aug 5 2020, 12:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 5 2020, 12:00 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
hliao requested review of this revision.Aug 5 2020, 12:00 AM
tra added a comment.Aug 5 2020, 9:42 AM

LGTM for CUDA.

clang/lib/CodeGen/CodeGenPGO.cpp
839–840

We will still have around some functions that may never be used on the host side (HD functions referenced from device code only). I'm not sure if that's a problem for profiling, though. I wonder if we can somehow tie skipRegionMappingForDecl to whether we've actually codegen'ed the function.

hliao added inline comments.Aug 5 2020, 10:26 AM
clang/lib/CodeGen/CodeGenPGO.cpp
839–840

Skipping wrong-side functions here just makes the report not confusing as these functions are not emitted at all and are supposed never running on the host/device side. If we still create the mapping for them, e.g., we may report they have 0 runs instead of reporting nothing (just like comments between function.) That looks a little bit confusing.
It seems the current PGO adds everything for coverage mapping and late prune them based on checks here. Just try to follow that logic to skip wrong-side functions. If we need to revise the original logic and generate coverage mapping for emitted functions only, the change here is unnecessary.

tra added inline comments.Aug 5 2020, 10:50 AM
clang/lib/CodeGen/CodeGenPGO.cpp
839–840

I'd add a comment here that this 'filter' is just a rough best-effort approximation that still allows some effectively device-only Decls through.
The output should still be correct, even though the functions will never be used. Maybe add a TODO to deal with it if/when we know if the Decl was codegen'ed.

yaxunl added a comment.Aug 6 2020, 6:35 AM

Do we need to disable pgo and coverage mapping for device compilation? Or it is already disabled?

tra added a comment.Aug 6 2020, 9:28 AM

Do we need to disable pgo and coverage mapping for device compilation? Or it is already disabled?

We already disable profiling during device compilation for NVIDIA and AMD GPUs:
https://github.com/llvm/llvm-project/blob/394db2259575ef3cac8d3d37836b11eb2373c435/clang/lib/Driver/ToolChains/Clang.cpp#L4876

hliao added a comment.Aug 6 2020, 9:41 AM
In D85276#2200108, @tra wrote:

Do we need to disable pgo and coverage mapping for device compilation? Or it is already disabled?

We already disable profiling during device compilation for NVIDIA and AMD GPUs:
https://github.com/llvm/llvm-project/blob/394db2259575ef3cac8d3d37836b11eb2373c435/clang/lib/Driver/ToolChains/Clang.cpp#L4876

Anyway, this patch just fixes the caused by that device stub function. As it's "emitted" in the host compilation, we need to skip generating instrumentation on it explicitly.

yaxunl accepted this revision.Aug 6 2020, 10:34 AM

LGTM. thanks

This revision is now accepted and ready to land.Aug 6 2020, 10:34 AM
hliao updated this revision to Diff 283654.Aug 6 2020, 10:35 AM

Revise the comment.

hliao added inline comments.Aug 6 2020, 10:38 AM
clang/lib/CodeGen/CodeGenPGO.cpp
839–840

Add that comment. But, I tend to not deal that "effectively" host-only/device-only ones as that should be developers' responsibility to handle them. The additional zero coverage mapping may be useful as well. If a function is really device-only but is attributed with HD, the 0 coverage may help developers correcting them.

tra added inline comments.Aug 6 2020, 10:53 AM
clang/lib/CodeGen/CodeGenPGO.cpp
839–840

It will be rather noisy in practice. A lot of code has either has been written for NVCC or has to compile with it. NVCC does not have target overloads, so sticking HD everywhere is pretty much the only practical way to do it in complicated enough C++ code. Anything that uses Eigen or Thrust will have tons of HD functions that are actually used only on one side.

This revision was landed with ongoing or failed builds.Aug 10 2020, 8:02 AM
This revision was automatically updated to reflect the committed changes.
hliao added inline comments.Aug 10 2020, 8:03 AM
clang/lib/CodeGen/CodeGenPGO.cpp
839–840

Most HD interfaces in Eigen are designed to be used in both CPU and GPU. For GPU only ones, they are marked with __device__ only. Thrust has a similar situation.