This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/CodeGen/
-
CodeGen/
3/6
CodeGenPGO.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
profile-coverage-mapping.cu

Differential D85276

[PGO][CUDA][HIP] Skip generating profile on the device stub and wrong-side functions.
ClosedPublic

Authored by hliao on Aug 5 2020, 12:00 AM.

Download Raw Diff

Details

Reviewers

tra
yaxunl
bogner

Commits

rGc7b683c126b8: [PGO][CUDA][HIP] Skip generating profile on the device stub and wrong-side…

Summary

Skip generating profile data on __global__ function in the host compilation. It's a host-side stub function only and don't have profile instrumentation generated on the real function body. The extra profile data results in the malformed instrumentation profile data.
Skip generating region mapping on functions in the wrong-side, i.e., + For the device compilation, skip host-only functions; and, + For the host compilation, skip device-only functions (including __global__ functions.)
As the device-side profiling is not ready yet, only host-side profile code generation is checked.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hliao created this revision.Aug 5 2020, 12:00 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 5 2020, 12:00 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

hliao requested review of this revision.Aug 5 2020, 12:00 AM

Harbormaster completed remote builds in B67055: Diff 283137.Aug 5 2020, 12:40 AM

LGTM for CUDA.

clang/lib/CodeGen/CodeGenPGO.cpp
839–840	We will still have around some functions that may never be used on the host side (HD functions referenced from device code only). I'm not sure if that's a problem for profiling, though. I wonder if we can somehow tie `skipRegionMappingForDecl` to whether we've actually codegen'ed the function.

hliao added inline comments.Aug 5 2020, 10:26 AM

clang/lib/CodeGen/CodeGenPGO.cpp
839–840	Skipping wrong-side functions here just makes the report not confusing as these functions are not emitted at all and are supposed never running on the host/device side. If we still create the mapping for them, e.g., we may report they have 0 runs instead of reporting nothing (just like comments between function.) That looks a little bit confusing. It seems the current PGO adds everything for coverage mapping and late prune them based on checks here. Just try to follow that logic to skip wrong-side functions. If we need to revise the original logic and generate coverage mapping for emitted functions only, the change here is unnecessary.

tra added inline comments.Aug 5 2020, 10:50 AM

clang/lib/CodeGen/CodeGenPGO.cpp
839–840	I'd add a comment here that this 'filter' is just a rough best-effort approximation that still allows some effectively device-only Decls through. The output should still be correct, even though the functions will never be used. Maybe add a TODO to deal with it if/when we know if the Decl was codegen'ed.

Do we need to disable pgo and coverage mapping for device compilation? Or it is already disabled?

In D85276#2199655, @yaxunl wrote:

Do we need to disable pgo and coverage mapping for device compilation? Or it is already disabled?

We already disable profiling during device compilation for NVIDIA and AMD GPUs:
https://github.com/llvm/llvm-project/blob/394db2259575ef3cac8d3d37836b11eb2373c435/clang/lib/Driver/ToolChains/Clang.cpp#L4876

In D85276#2200108, @tra wrote:

In D85276#2199655, @yaxunl wrote:

Do we need to disable pgo and coverage mapping for device compilation? Or it is already disabled?

We already disable profiling during device compilation for NVIDIA and AMD GPUs:
https://github.com/llvm/llvm-project/blob/394db2259575ef3cac8d3d37836b11eb2373c435/clang/lib/Driver/ToolChains/Clang.cpp#L4876

Anyway, this patch just fixes the caused by that device stub function. As it's "emitted" in the host compilation, we need to skip generating instrumentation on it explicitly.

LGTM. thanks

This revision is now accepted and ready to land.Aug 6 2020, 10:34 AM

Revise the comment.

hliao added inline comments.Aug 6 2020, 10:38 AM

clang/lib/CodeGen/CodeGenPGO.cpp
839–840	Add that comment. But, I tend to not deal that "effectively" host-only/device-only ones as that should be developers' responsibility to handle them. The additional zero coverage mapping may be useful as well. If a function is really device-only but is attributed with HD, the 0 coverage may help developers correcting them.

tra added inline comments.Aug 6 2020, 10:53 AM

clang/lib/CodeGen/CodeGenPGO.cpp
839–840	It will be rather noisy in practice. A lot of code has either has been written for NVCC or has to compile with it. NVCC does not have target overloads, so sticking HD everywhere is pretty much the only practical way to do it in complicated enough C++ code. Anything that uses Eigen or Thrust will have tons of HD functions that are actually used only on one side.

Harbormaster completed remote builds in B67347: Diff 283654.Aug 6 2020, 11:08 AM

This revision was landed with ongoing or failed builds.Aug 10 2020, 8:02 AM

Closed by commit rGc7b683c126b8: [PGO][CUDA][HIP] Skip generating profile on the device stub and wrong-side… (authored by hliao). · Explain Why

This revision was automatically updated to reflect the committed changes.

hliao added a commit: rGc7b683c126b8: [PGO][CUDA][HIP] Skip generating profile on the device stub and wrong-side….

hliao added inline comments.Aug 10 2020, 8:03 AM

clang/lib/CodeGen/CodeGenPGO.cpp
839–840	Most HD interfaces in Eigen are designed to be used in both CPU and GPU. For GPU only ones, they are marked with `__device__` only. Thrust has a similar situation.

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CodeGenPGO.cpp

17 lines

test/

CodeGenCUDA/

profile-coverage-mapping.cu

20 lines

Diff 284374

clang/lib/CodeGen/CodeGenPGO.cpp

Show First 20 Lines • Show All 767 Lines • ▼ Show 20 Lines	uint64_t PGOHash::finalize() {
return Result.low();		return Result.low();
}		}

void CodeGenPGO::assignRegionCounters(GlobalDecl GD, llvm::Function *Fn) {		void CodeGenPGO::assignRegionCounters(GlobalDecl GD, llvm::Function *Fn) {
const Decl *D = GD.getDecl();		const Decl *D = GD.getDecl();
if (!D->hasBody())		if (!D->hasBody())
return;		return;

		// Skip CUDA/HIP kernel launch stub functions.
		if (CGM.getLangOpts().CUDA && !CGM.getLangOpts().CUDAIsDevice &&
		D->hasAttr<CUDAGlobalAttr>())
		return;

bool InstrumentRegions = CGM.getCodeGenOpts().hasProfileClangInstr();		bool InstrumentRegions = CGM.getCodeGenOpts().hasProfileClangInstr();
llvm::IndexedInstrProfReader *PGOReader = CGM.getPGOReader();		llvm::IndexedInstrProfReader *PGOReader = CGM.getPGOReader();
if (!InstrumentRegions && !PGOReader)		if (!InstrumentRegions && !PGOReader)
return;		return;
if (D->isImplicit())		if (D->isImplicit())
return;		return;
// Constructors and destructors may be represented by several functions in IR.		// Constructors and destructors may be represented by several functions in IR.
// If so, instrument only base variant, others are implemented by delegation		// If so, instrument only base variant, others are implemented by delegation
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	void CodeGenPGO::mapRegionCounters(const Decl *D) {
NumRegionCounters = Walker.NextCounter;		NumRegionCounters = Walker.NextCounter;
FunctionHash = Walker.Hash.finalize();		FunctionHash = Walker.Hash.finalize();
}		}

bool CodeGenPGO::skipRegionMappingForDecl(const Decl *D) {		bool CodeGenPGO::skipRegionMappingForDecl(const Decl *D) {
if (!D->getBody())		if (!D->getBody())
return true;		return true;

		// Skip host-only functions in the CUDA device compilation and device-only
		// functions in the host compilation. Just roughly filter them out based on
		traUnsubmitted Not Done Reply Inline Actions We will still have around some functions that may never be used on the host side (HD functions referenced from device code only). I'm not sure if that's a problem for profiling, though. I wonder if we can somehow tie `skipRegionMappingForDecl` to whether we've actually codegen'ed the function. tra: We will still have around some functions that may never be used on the host side (HD functions…
		hliaoAuthorUnsubmitted Done Reply Inline Actions Skipping wrong-side functions here just makes the report not confusing as these functions are not emitted at all and are supposed never running on the host/device side. If we still create the mapping for them, e.g., we may report they have 0 runs instead of reporting nothing (just like comments between function.) That looks a little bit confusing. It seems the current PGO adds everything for coverage mapping and late prune them based on checks here. Just try to follow that logic to skip wrong-side functions. If we need to revise the original logic and generate coverage mapping for emitted functions only, the change here is unnecessary. hliao: Skipping wrong-side functions here just makes the report not confusing as these functions are…
		traUnsubmitted Not Done Reply Inline Actions I'd add a comment here that this 'filter' is just a rough best-effort approximation that still allows some effectively device-only Decls through. The output should still be correct, even though the functions will never be used. Maybe add a TODO to deal with it if/when we know if the Decl was codegen'ed. tra: I'd add a comment here that this 'filter' is just a rough best-effort approximation that still…
		hliaoAuthorUnsubmitted Done Reply Inline Actions Add that comment. But, I tend to not deal that "effectively" host-only/device-only ones as that should be developers' responsibility to handle them. The additional zero coverage mapping may be useful as well. If a function is really device-only but is attributed with HD, the 0 coverage may help developers correcting them. hliao: Add that comment. But, I tend to not deal that "effectively" host-only/device-only ones as that…
		traUnsubmitted Not Done Reply Inline Actions It will be rather noisy in practice. A lot of code has either has been written for NVCC or has to compile with it. NVCC does not have target overloads, so sticking HD everywhere is pretty much the only practical way to do it in complicated enough C++ code. Anything that uses Eigen or Thrust will have tons of HD functions that are actually used only on one side. tra: It will be rather noisy in practice. A lot of code has either has been written for NVCC or has…
		hliaoAuthorUnsubmitted Done Reply Inline Actions Most HD interfaces in Eigen are designed to be used in both CPU and GPU. For GPU only ones, they are marked with `__device__` only. Thrust has a similar situation. hliao: Most HD interfaces in Eigen are designed to be used in both CPU and GPU. For GPU only ones…
		// the function attributes. If there are effectively host-only or device-only
		// ones, their coverage mapping may still be generated.
		if (CGM.getLangOpts().CUDA &&
		((CGM.getLangOpts().CUDAIsDevice && !D->hasAttr<CUDADeviceAttr>() &&
		!D->hasAttr<CUDAGlobalAttr>()) \|\|
		(!CGM.getLangOpts().CUDAIsDevice &&
		(D->hasAttr<CUDAGlobalAttr>() \|\|
		(!D->hasAttr<CUDAHostAttr>() && D->hasAttr<CUDADeviceAttr>())))))
		return true;

// Don't map the functions in system headers.		// Don't map the functions in system headers.
const auto &SM = CGM.getContext().getSourceManager();		const auto &SM = CGM.getContext().getSourceManager();
auto Loc = D->getBody()->getBeginLoc();		auto Loc = D->getBody()->getBeginLoc();
return SM.isInSystemHeader(Loc);		return SM.isInSystemHeader(Loc);
}		}

void CodeGenPGO::emitCounterRegionMapping(const Decl *D) {		void CodeGenPGO::emitCounterRegionMapping(const Decl *D) {
if (skipRegionMappingForDecl(D))		if (skipRegionMappingForDecl(D))
▲ Show 20 Lines • Show All 228 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/profile-coverage-mapping.cu

This file was added.

				// RUN: echo "GPU binary would be here" > %t
				// RUN: %clang_cc1 -fprofile-instrument=clang -triple x86_64-linux-gnu -target-sdk-version=8.0 -fcuda-include-gpubinary %t -emit-llvm -o - %s \| FileCheck --check-prefix=PGOGEN %s
				// RUN: %clang_cc1 -fprofile-instrument=clang -fcoverage-mapping -triple x86_64-linux-gnu -target-sdk-version=8.0 -fcuda-include-gpubinary %t -emit-llvm -o - %s \| FileCheck --check-prefix=COVMAP %s
				// RUN: %clang_cc1 -fprofile-instrument=clang -fcoverage-mapping -dump-coverage-mapping -triple x86_64-linux-gnu -target-sdk-version=8.0 -fcuda-include-gpubinary %t -emit-llvm-only -o - %s \| FileCheck --check-prefix=MAPPING %s

				#include "Inputs/cuda.h"

				// PGOGEN-NOT: @__profn_{{.kernel.}} =
				// COVMAP-COUNT-2: section "__llvm_covfun", comdat
				// COVMAP-NOT: section "__llvm_covfun", comdat
				// MAPPING-NOT: {{.dfn.}}:
				// MAPPING-NOT: {{.kernel.}}:

				__device__ void dfn(int i) {}

				__global__ void kernel(int i) { dfn(i); }

				void host(void) {
				kernel<<<1, 1>>>(1);
				}