- Skip generating profile data on __global__ function in the host compilation. It's a host-side stub function only and don't have profile instrumentation generated on the real function body. The extra profile data results in the malformed instrumentation profile data.
- Skip generating region mapping on functions in the wrong-side, i.e., + For the device compilation, skip host-only functions; and, + For the host compilation, skip device-only functions (including __global__ functions.)
- As the device-side profiling is not ready yet, only host-side profile code generation is checked.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
LGTM for CUDA.
clang/lib/CodeGen/CodeGenPGO.cpp | ||
---|---|---|
839–840 | We will still have around some functions that may never be used on the host side (HD functions referenced from device code only). I'm not sure if that's a problem for profiling, though. I wonder if we can somehow tie skipRegionMappingForDecl to whether we've actually codegen'ed the function. |
clang/lib/CodeGen/CodeGenPGO.cpp | ||
---|---|---|
839–840 | Skipping wrong-side functions here just makes the report not confusing as these functions are not emitted at all and are supposed never running on the host/device side. If we still create the mapping for them, e.g., we may report they have 0 runs instead of reporting nothing (just like comments between function.) That looks a little bit confusing. |
clang/lib/CodeGen/CodeGenPGO.cpp | ||
---|---|---|
839–840 | I'd add a comment here that this 'filter' is just a rough best-effort approximation that still allows some effectively device-only Decls through. |
Do we need to disable pgo and coverage mapping for device compilation? Or it is already disabled?
We already disable profiling during device compilation for NVIDIA and AMD GPUs:
https://github.com/llvm/llvm-project/blob/394db2259575ef3cac8d3d37836b11eb2373c435/clang/lib/Driver/ToolChains/Clang.cpp#L4876
Anyway, this patch just fixes the caused by that device stub function. As it's "emitted" in the host compilation, we need to skip generating instrumentation on it explicitly.
clang/lib/CodeGen/CodeGenPGO.cpp | ||
---|---|---|
839–840 | Add that comment. But, I tend to not deal that "effectively" host-only/device-only ones as that should be developers' responsibility to handle them. The additional zero coverage mapping may be useful as well. If a function is really device-only but is attributed with HD, the 0 coverage may help developers correcting them. |
clang/lib/CodeGen/CodeGenPGO.cpp | ||
---|---|---|
839–840 | It will be rather noisy in practice. A lot of code has either has been written for NVCC or has to compile with it. NVCC does not have target overloads, so sticking HD everywhere is pretty much the only practical way to do it in complicated enough C++ code. Anything that uses Eigen or Thrust will have tons of HD functions that are actually used only on one side. |
clang/lib/CodeGen/CodeGenPGO.cpp | ||
---|---|---|
839–840 | Most HD interfaces in Eigen are designed to be used in both CPU and GPU. For GPU only ones, they are marked with __device__ only. Thrust has a similar situation. |
We will still have around some functions that may never be used on the host side (HD functions referenced from device code only). I'm not sure if that's a problem for profiling, though. I wonder if we can somehow tie skipRegionMappingForDecl to whether we've actually codegen'ed the function.