This is an archive of the discontinued LLVM Phabricator instance.

clang/test/CodeGenHIP/amdgpu_hostcall.cpp
2–6	Am I right we don't actually need two runs here, the test may be executed with one run, removed `#ifdefs` and, possible, multiplied `CHECK:` lines? I would suggest to use the llvm/utils/update_cc_test_checks.py script in such tests.

One drawback of this approach is that it does not work for LLVM modules generated from assembly or programmatically e.g. Tensorflow XLA.

Another drawback is that if __ockl_call_host_function or __ockl_fprintf_stderr_begin are eliminated by optimizer, the module flag is still kept. This could happen if users use printf in assert.

Is there a way to detect use of hostcall later in LLVM IR not by calling of these functions?

In D115283#3179651, @yaxunl wrote:

One drawback of this approach is that it does not work for LLVM modules generated from assembly or programmatically e.g. Tensorflow XLA.

Another drawback is that if __ockl_call_host_function or __ockl_fprintf_stderr_begin are eliminated by optimizer, the module flag is still kept. This could happen if users use printf in assert.

Is there a way to detect use of hostcall later in LLVM IR not by calling of these functions?

Two other possible solutions that come to my mind:

Make a separate pass that would check if there is an instance of the ockl_hostcall_internal() function in the module and set the module flag if so.

Add a new "amdgpu_hostcall" function attribute. The "ockl_hostcall_internal()" function should be declared with attribute(("amdgpu_hostcall")) in the device libs. Then, in the Code Gen pass, we just set the "hidden_hostcall" kernel arg attribute if any function with "amdgpu_hostcall" is present. I think this is the best solution since we don't rely on the particular function name, and it would work ideally with any optimizations.

Please let me know what you think.

kpyzhov added inline comments.Dec 8 2021, 11:02 AM

clang/test/CodeGenHIP/amdgpu_hostcall.cpp
2–6	Well, it may be executed with one run, but in that case we won't be able to catch an error if one of the functions is broken, because the 2nd one will set the module flag. Why do you think I should use the script for this test?

If we only need to check whether __ockl_hostcall_internal exists in the final module in LLVM codegen to determine whether we need the hostcall metadata, probably we don't even need a function attribute or even module flag.

Openmp defines a weak symbol in the hostcall_invoke function. Optimisation and deadstripping friendly, no compiler support necessary.

In D115283#3180836, @yaxunl wrote:

If we only need to check whether __ockl_hostcall_internal exists in the final module in LLVM codegen to determine whether we need the hostcall metadata, probably we don't even need a function attribute or even module flag.

Right, we used to do exactly that (just check at the CodeGen phase if 'ockl_hostcall_internal()' is present in the module), but then it turned out that it does not work with -fgpu-rdc since IPO may rename the 'ockl_hostcall_internal()'.

Not exactly that. The weak symbol isn't the function name, as that gets renamed or inlined.

In D115283#3181128, @JonChesterfield wrote:

Not exactly that. The weak symbol isn't the function name, as that gets renamed or inlined.

We discussed this before. As code object ABI use runtime metadata to represent hostcall_buffer, we need to check whether hostcall is needed by IR.

This approach will require checking asm instructions inside a function to determine whether this function requires hostcall. It is hacky for IR representation.

In D115283#3181109, @kpyzhov wrote:

In D115283#3180836, @yaxunl wrote:

If we only need to check whether __ockl_hostcall_internal exists in the final module in LLVM codegen to determine whether we need the hostcall metadata, probably we don't even need a function attribute or even module flag.

Right, we used to do exactly that (just check at the CodeGen phase if 'ockl_hostcall_internal()' is present in the module), but then it turned out that it does not work with -fgpu-rdc since IPO may rename the 'ockl_hostcall_internal()'.

Sorry I forgot that.

Then I agree that a function attribute seems a better way to represent hostcall requirement in IR. It is needed in both source and IR. This avoids checking hostcall requirements by function names. It works for all frontends as long as they use device libs or mark their own hostcall function with the attribute. It also can result in more efficient code object if useless hostcall functions are removed by optimizers. Overall it will result in a cleaner IR representation.

In D115283#3182879, @yaxunl wrote:

In D115283#3181128, @JonChesterfield wrote:

Not exactly that. The weak symbol isn't the function name, as that gets renamed or inlined.

We discussed this before. As code object ABI use runtime metadata to represent hostcall_buffer, we need to check whether hostcall is needed by IR.

This approach will require checking asm instructions inside a function to determine whether this function requires hostcall. It is hacky for IR representation.

There are two approaches here:
1/ Tag the function using inline asm and totally ignore it in the compiler. HSA/etc tests per-code-object if the symbol is present
2/ Tag the function (in source or in compiler), propagate information to llc, embed it in msgpack data, HSA/etc tests per-function if the field is present

2/ is somewhat useful if we elide the 8 byte slot of kernarg memory for functions that don't use it, otherwise it just increases work done by the runtime. Instead of checking for presence of one symbol (a hashtable lookup), it's a linear scan through msgpack data. We don't currently elide those 8 bytes, so right now this is making the compiler more complicated in exchange for making the runtime slower.

1/ has the benefit of being dead simple and totally compiler agnostic, and the cost of passing the 8 byte hostcall thing to every function in a code object that asked for it.

dfukalov added inline comments.Dec 9 2021, 11:56 AM

clang/test/CodeGenHIP/amdgpu_hostcall.cpp
2–6	Oh, I see, that indeed should be run with two separate checks. Regarding the script - it generates CHECK-NEXT sequences so we can be assured that substring "amdgpu_hostcall" is not caught from any other place. Of course, you can make the test stronger with hand-written `-NEXT` checks.

In D115283#3183034, @JonChesterfield wrote:

In D115283#3182879, @yaxunl wrote:

In D115283#3181128, @JonChesterfield wrote:

Not exactly that. The weak symbol isn't the function name, as that gets renamed or inlined.

We discussed this before. As code object ABI use runtime metadata to represent hostcall_buffer, we need to check whether hostcall is needed by IR.

This approach will require checking asm instructions inside a function to determine whether this function requires hostcall. It is hacky for IR representation.

There are two approaches here:
1/ Tag the function using inline asm and totally ignore it in the compiler. HSA/etc tests per-code-object if the symbol is present
2/ Tag the function (in source or in compiler), propagate information to llc, embed it in msgpack data, HSA/etc tests per-function if the field is present

2/ is somewhat useful if we elide the 8 byte slot of kernarg memory for functions that don't use it, otherwise it just increases work done by the runtime. Instead of checking for presence of one symbol (a hashtable lookup), it's a linear scan through msgpack data. We don't currently elide those 8 bytes, so right now this is making the compiler more complicated in exchange for making the runtime slower.

1/ has the benefit of being dead simple and totally compiler agnostic, and the cost of passing the 8 byte hostcall thing to every function in a code object that asked for it.

Option 1 needs code object ABI change that does not work with old ROCm runtime. We need to maintain certain stability and backward compatibility with old ROCm as we have customers who use trunk clang/llvm with older ROCm runtime.

We could discuss option 1 for the next version of code object format. However, before that happens, we still need to fix the bug within the current ABI.

I don't see a clear explain the motivation for this change - can you confirm my understanding or provide clarification? It looks like the issue is that D110337 caused a regression for cases when user code directly calls a device library function that requires hostcall services, right? If so, I think this issue highlights a weakness in the module flag approach implemented in D110337 - i.e., now the compiler needs to know every library function that may require hostcall services.

We have this same issue with our proprietary compiler, where we have our own device runtime library that makes use of printf. The prior approach of detecting the ockl_hostcall_internal function definition handles this case just fine (with the caveat of the potential LTO/inlining issues mentioned in D110337). But with the new approach to use the amdgpu_hostcall module flag, we need to modify our compiler to emit that flag for all of our own library calls, too.

Another concern with using a module flag is that is isn't as easily eliminated once it has been inserted, even if the call that triggered insertion is ultimately eliminated through optimization. E.g., a printf call might be eliminated if it is under a condition that can be statically evaluated to false...but, the amdgpu_hostcall module flag may already have been inserted.

Is there an approach that can avoid the LTO/inlining issues, like implementing the hostcall implementation with an intrinsic to access the hostcall buffer pointer? - then the AMDGPU backend could easily detect use of that intrinsic to trigger setup of the implicit kernel arg, and inlining could not eliminate that intrinsic.

The asm variable used by rocm openmp is zero overhead, needs no compiler support and works exactly as one would wish under inlining or code elimination. The main argument against that approach seems to be it's an abi break, much like this patch was, and that it is per-code-object instead of per-function, which I still think is a benefit.

sameerds added a subscriber: sameerds.Jan 31 2022, 8:20 PM

sameerds added inline comments.

clang/lib/CodeGen/TargetInfo.cpp
9434	Just to confirm what others have probably disovered, the only function whose presence should be checked is `__ockl_hostcall_internal`. All others are wrappers that are free to disappear during optimization.

Revision Contents

Path

Size

clang/

lib/

CodeGen/

TargetInfo.cpp

23 lines

test/

CodeGenHIP/

amdgpu_hostcall.cpp

48 lines

Diff 392577

clang/lib/CodeGen/TargetInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,188 Lines • ▼ Show 20 Lines	llvm::SyncScope::ID getLLVMSyncScopeID(const LangOptions &LangOpts,
llvm::AtomicOrdering Ordering,		llvm::AtomicOrdering Ordering,
llvm::LLVMContext &Ctx) const override;		llvm::LLVMContext &Ctx) const override;
llvm::Function *		llvm::Function *
createEnqueuedBlockKernel(CodeGenFunction &CGF,		createEnqueuedBlockKernel(CodeGenFunction &CGF,
llvm::Function *BlockInvokeFunc,		llvm::Function *BlockInvokeFunc,
llvm::Value *BlockLiteral) const override;		llvm::Value *BlockLiteral) const override;
bool shouldEmitStaticExternCAliases() const override;		bool shouldEmitStaticExternCAliases() const override;
void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;		void setCUDAKernelCallingConvention(const FunctionType *&FT) const override;

		virtual void checkFunctionCallABI(CodeGenModule &CGM, SourceLocation CallLoc,
		const FunctionDecl *Caller,
		const FunctionDecl *Callee,
		const CallArgList &Args) const override;
};		};
}		}

static bool requiresAMDGPUProtectedVisibility(const Decl *D,		static bool requiresAMDGPUProtectedVisibility(const Decl *D,
llvm::GlobalValue *GV) {		llvm::GlobalValue *GV) {
if (GV->getVisibility() != llvm::GlobalValue::HiddenVisibility)		if (GV->getVisibility() != llvm::GlobalValue::HiddenVisibility)
return false;		return false;

▲ Show 20 Lines • Show All 207 Lines • ▼ Show 20 Lines
}		}

void AMDGPUTargetCodeGenInfo::setCUDAKernelCallingConvention(		void AMDGPUTargetCodeGenInfo::setCUDAKernelCallingConvention(
const FunctionType *&FT) const {		const FunctionType *&FT) const {
FT = getABIInfo().getContext().adjustFunctionType(		FT = getABIInfo().getContext().adjustFunctionType(
FT, FT->getExtInfo().withCallingConv(CC_OpenCLKernel));		FT, FT->getExtInfo().withCallingConv(CC_OpenCLKernel));
}		}

		void AMDGPUTargetCodeGenInfo::checkFunctionCallABI(CodeGenModule &CGM,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -void AMDGPUTargetCodeGenInfo::checkFunctionCallABI(CodeGenModule &CGM, - SourceLocation CallLoc, - const FunctionDecl Caller, - const FunctionDecl Callee, - const CallArgList &Args) const -{ +void AMDGPUTargetCodeGenInfo::checkFunctionCallABI( + CodeGenModule &CGM, SourceLocation CallLoc, const FunctionDecl Caller, + const FunctionDecl Callee, const CallArgList &Args) const { Lint: Pre-merge checks: clang-format: please reformat the code ``` -void AMDGPUTargetCodeGenInfo::checkFunctionCallABI…
		JonChesterfieldUnsubmitted Not Done Reply Inline Actions Doesn't seem to check anything. Can we tag this patch up module flags onto an existing target specific patch up module hook? JonChesterfield: Doesn't seem to check anything. Can we tag this patch up module flags onto an existing target…
		SourceLocation CallLoc,
		const FunctionDecl *Caller,
		const FunctionDecl *Callee,
		const CallArgList &Args) const
		{
		// Set the "amdgpu_hostcall" module flag if "Callee" is a library function
		// that uses AMDGPU hostcall mechanism.
		if (Callee &&
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - if (Callee && - (Callee->getName() == "__ockl_call_host_function" \|\| - Callee->getName() == "__ockl_fprintf_stderr_begin")) { + if (Callee && (Callee->getName() == "__ockl_call_host_function" \|\| + Callee->getName() == "__ockl_fprintf_stderr_begin")) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - if (Callee && - (Callee->getName() ==…
		(Callee->getName() == "__ockl_call_host_function" \|\|
		sameerdsUnsubmitted Not Done Reply Inline Actions Just to confirm what others have probably disovered, the only function whose presence should be checked is `__ockl_hostcall_internal`. All others are wrappers that are free to disappear during optimization. sameerds: Just to confirm what others have probably disovered, the only function whose presence should be…
		Callee->getName() == "__ockl_fprintf_stderr_begin")) {
		llvm::Module &M = CGM.getModule();
		if (!M.getModuleFlag("amdgpu_hostcall")) {
		M.addModuleFlag(llvm::Module::Override, "amdgpu_hostcall", 1);
		}
		}
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SPARC v8 ABI Implementation.		// SPARC v8 ABI Implementation.
// Based on the SPARC Compliance Definition version 2.4.1.		// Based on the SPARC Compliance Definition version 2.4.1.
//		//
// Ensures that complex values are passed in registers.		// Ensures that complex values are passed in registers.
//		//
namespace {		namespace {
class SparcV8ABIInfo : public DefaultABIInfo {		class SparcV8ABIInfo : public DefaultABIInfo {
▲ Show 20 Lines • Show All 2,016 Lines • Show Last 20 Lines

clang/test/CodeGenHIP/amdgpu_hostcall.cpp

This file was added.


				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -x hip -emit-llvm -fcuda-is-device -DFN_HOSTCALL \
				// RUN: -o - %s \| FileCheck --enable-var-scope %s

				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -x hip -emit-llvm -fcuda-is-device -DFN_PRINTF \
				// RUN: -o - %s \| FileCheck --enable-var-scope %s
				dfukalovUnsubmitted Not Done Reply Inline Actions Am I right we don't actually need two runs here, the test may be executed with one run, removed `#ifdefs` and, possible, multiplied `CHECK:` lines? I would suggest to use the llvm/utils/update_cc_test_checks.py script in such tests. dfukalov: Am I right we don't actually need two runs here, the test may be executed with one run, removed…
				kpyzhovAuthorUnsubmitted Done Reply Inline Actions Well, it may be executed with one run, but in that case we won't be able to catch an error if one of the functions is broken, because the 2nd one will set the module flag. Why do you think I should use the script for this test? kpyzhov: Well, it may be executed with one run, but in that case we won't be able to catch an error if…
				dfukalovUnsubmitted Not Done Reply Inline Actions Oh, I see, that indeed should be run with two separate checks. Regarding the script - it generates CHECK-NEXT sequences so we can be assured that substring "amdgpu_hostcall" is not caught from any other place. Of course, you can make the test stronger with hand-written `-NEXT` checks. dfukalov: Oh, I see, that indeed should be run with two separate checks. Regarding the script - it…

				// CHECK: !llvm.module.flags
				// CHECK: "amdgpu_hostcall"


				typedef unsigned long int uint64_t;

				#define __device__ __attribute__((device))

				template<typename T, unsigned int n> struct HIP_vector_base;

				template<typename T>
				struct HIP_vector_base<T, 2> { using Native_vec_ = T __attribute__((ext_vector_type(2))); };


				extern "C" __device__ uint64_t __ockl_fprintf_stderr_begin();

				extern "C" __device__ HIP_vector_base<long long, 2>::Native_vec_ __ockl_call_host_function(
				uint64_t fptr, uint64_t arg0, uint64_t arg1, uint64_t arg2, uint64_t arg3, uint64_t arg4, uint64_t arg5, uint64_t arg6);


				#ifdef FN_HOSTCALL
				__device__ void fn_hostcall(uint64_t fptr, uint64_t* retval0, uint64_t* retval1) {
				uint64_t arg0 = (uint64_t)fptr;
				uint64_t arg1 = 0;
				uint64_t arg2 = 0;
				uint64_t arg3 = 0;
				uint64_t arg4 = 0;
				uint64_t arg5 = 0;
				uint64_t arg6 = 0;
				uint64_t arg7 = 0;

				__ockl_call_host_function(arg0, arg1, arg2, arg3, arg4, arg5, arg6, arg7);
				}
				#endif

				#ifdef FN_PRINTF
				__device__ void fn_printf() {
				auto msg = __ockl_fprintf_stderr_begin();
				}
				#endif

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Set "amdgpu_hostcall" module flag if an AMDGPU function has calls to device lib functions that use hostcalls.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 392577

clang/lib/CodeGen/TargetInfo.cpp

clang/test/CodeGenHIP/amdgpu_hostcall.cpp

[AMDGPU] Set "amdgpu_hostcall" module flag if an AMDGPU function has calls to device lib functions that use hostcalls.
Needs ReviewPublic