This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Set "amdgpu_hostcall" module flag if an AMDGPU function has calls to device lib functions that use hostcalls.
Needs ReviewPublic

Authored by kpyzhov on Dec 7 2021, 1:53 PM.

Details

Reviewers
yaxunl

Diff Detail

Event Timeline

kpyzhov created this revision.Dec 7 2021, 1:53 PM
kpyzhov requested review of this revision.Dec 7 2021, 1:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 7 2021, 1:53 PM

Needs a test.

Needs a test.

Yes, good point, thanks!

JonChesterfield added inline comments.
clang/lib/CodeGen/TargetInfo.cpp
9425

Doesn't seem to check anything. Can we tag this patch up module flags onto an existing target specific patch up module hook?

kpyzhov updated this revision to Diff 392577.Dec 7 2021, 3:50 PM
dfukalov added inline comments.Dec 8 2021, 1:37 AM
clang/test/CodeGenHIP/amdgpu_hostcall.cpp
2–6

Am I right we don't actually need two runs here, the test may be executed with one run, removed #ifdefs and, possible, multiplied CHECK: lines?
I would suggest to use the llvm/utils/update_cc_test_checks.py script in such tests.

yaxunl added a comment.Dec 8 2021, 7:52 AM

One drawback of this approach is that it does not work for LLVM modules generated from assembly or programmatically e.g. Tensorflow XLA.

Another drawback is that if __ockl_call_host_function or __ockl_fprintf_stderr_begin are eliminated by optimizer, the module flag is still kept. This could happen if users use printf in assert.

Is there a way to detect use of hostcall later in LLVM IR not by calling of these functions?

One drawback of this approach is that it does not work for LLVM modules generated from assembly or programmatically e.g. Tensorflow XLA.

Another drawback is that if __ockl_call_host_function or __ockl_fprintf_stderr_begin are eliminated by optimizer, the module flag is still kept. This could happen if users use printf in assert.

Is there a way to detect use of hostcall later in LLVM IR not by calling of these functions?

Two other possible solutions that come to my mind:

  1. Make a separate pass that would check if there is an instance of the ockl_hostcall_internal() function in the module and set the module flag if so.
  1. Add a new "amdgpu_hostcall" function attribute. The "ockl_hostcall_internal()" function should be declared with attribute(("amdgpu_hostcall")) in the device libs. Then, in the Code Gen pass, we just set the "hidden_hostcall" kernel arg attribute if any function with "amdgpu_hostcall" is present. I think this is the best solution since we don't rely on the particular function name, and it would work ideally with any optimizations.

Please let me know what you think.

kpyzhov added inline comments.Dec 8 2021, 11:02 AM
clang/test/CodeGenHIP/amdgpu_hostcall.cpp
2–6

Well, it may be executed with one run, but in that case we won't be able to catch an error if one of the functions is broken, because the 2nd one will set the module flag.
Why do you think I should use the script for this test?

yaxunl added a comment.Dec 8 2021, 1:21 PM

If we only need to check whether __ockl_hostcall_internal exists in the final module in LLVM codegen to determine whether we need the hostcall metadata, probably we don't even need a function attribute or even module flag.

Openmp defines a weak symbol in the hostcall_invoke function. Optimisation and deadstripping friendly, no compiler support necessary.

kpyzhov added a comment.EditedDec 8 2021, 2:06 PM

If we only need to check whether __ockl_hostcall_internal exists in the final module in LLVM codegen to determine whether we need the hostcall metadata, probably we don't even need a function attribute or even module flag.

Right, we used to do exactly that (just check at the CodeGen phase if 'ockl_hostcall_internal()' is present in the module), but then it turned out that it does not work with -fgpu-rdc since IPO may rename the 'ockl_hostcall_internal()'.

Not exactly that. The weak symbol isn't the function name, as that gets renamed or inlined.

yaxunl added a comment.EditedDec 9 2021, 7:25 AM

Not exactly that. The weak symbol isn't the function name, as that gets renamed or inlined.

We discussed this before. As code object ABI use runtime metadata to represent hostcall_buffer, we need to check whether hostcall is needed by IR.

This approach will require checking asm instructions inside a function to determine whether this function requires hostcall. It is hacky for IR representation.

yaxunl added a comment.Dec 9 2021, 7:34 AM

If we only need to check whether __ockl_hostcall_internal exists in the final module in LLVM codegen to determine whether we need the hostcall metadata, probably we don't even need a function attribute or even module flag.

Right, we used to do exactly that (just check at the CodeGen phase if 'ockl_hostcall_internal()' is present in the module), but then it turned out that it does not work with -fgpu-rdc since IPO may rename the 'ockl_hostcall_internal()'.

Sorry I forgot that.

Then I agree that a function attribute seems a better way to represent hostcall requirement in IR. It is needed in both source and IR. This avoids checking hostcall requirements by function names. It works for all frontends as long as they use device libs or mark their own hostcall function with the attribute. It also can result in more efficient code object if useless hostcall functions are removed by optimizers. Overall it will result in a cleaner IR representation.

Not exactly that. The weak symbol isn't the function name, as that gets renamed or inlined.

We discussed this before. As code object ABI use runtime metadata to represent hostcall_buffer, we need to check whether hostcall is needed by IR.

This approach will require checking asm instructions inside a function to determine whether this function requires hostcall. It is hacky for IR representation.

There are two approaches here:
1/ Tag the function using inline asm and totally ignore it in the compiler. HSA/etc tests per-code-object if the symbol is present
2/ Tag the function (in source or in compiler), propagate information to llc, embed it in msgpack data, HSA/etc tests per-function if the field is present

2/ is somewhat useful if we elide the 8 byte slot of kernarg memory for functions that don't use it, otherwise it just increases work done by the runtime. Instead of checking for presence of one symbol (a hashtable lookup), it's a linear scan through msgpack data. We don't currently elide those 8 bytes, so right now this is making the compiler more complicated in exchange for making the runtime slower.

1/ has the benefit of being dead simple and totally compiler agnostic, and the cost of passing the 8 byte hostcall thing to every function in a code object that asked for it.

dfukalov added inline comments.Dec 9 2021, 11:56 AM
clang/test/CodeGenHIP/amdgpu_hostcall.cpp
2–6

Oh, I see, that indeed should be run with two separate checks.

Regarding the script - it generates CHECK-NEXT sequences so we can be assured that substring "amdgpu_hostcall" is not caught from any other place. Of course, you can make the test stronger with hand-written -NEXT checks.

Not exactly that. The weak symbol isn't the function name, as that gets renamed or inlined.

We discussed this before. As code object ABI use runtime metadata to represent hostcall_buffer, we need to check whether hostcall is needed by IR.

This approach will require checking asm instructions inside a function to determine whether this function requires hostcall. It is hacky for IR representation.

There are two approaches here:
1/ Tag the function using inline asm and totally ignore it in the compiler. HSA/etc tests per-code-object if the symbol is present
2/ Tag the function (in source or in compiler), propagate information to llc, embed it in msgpack data, HSA/etc tests per-function if the field is present

2/ is somewhat useful if we elide the 8 byte slot of kernarg memory for functions that don't use it, otherwise it just increases work done by the runtime. Instead of checking for presence of one symbol (a hashtable lookup), it's a linear scan through msgpack data. We don't currently elide those 8 bytes, so right now this is making the compiler more complicated in exchange for making the runtime slower.

1/ has the benefit of being dead simple and totally compiler agnostic, and the cost of passing the 8 byte hostcall thing to every function in a code object that asked for it.

Option 1 needs code object ABI change that does not work with old ROCm runtime. We need to maintain certain stability and backward compatibility with old ROCm as we have customers who use trunk clang/llvm with older ROCm runtime.

We could discuss option 1 for the next version of code object format. However, before that happens, we still need to fix the bug within the current ABI.

I don't see a clear explain the motivation for this change - can you confirm my understanding or provide clarification? It looks like the issue is that D110337 caused a regression for cases when user code directly calls a device library function that requires hostcall services, right? If so, I think this issue highlights a weakness in the module flag approach implemented in D110337 - i.e., now the compiler needs to know every library function that may require hostcall services.

We have this same issue with our proprietary compiler, where we have our own device runtime library that makes use of printf. The prior approach of detecting the ockl_hostcall_internal function definition handles this case just fine (with the caveat of the potential LTO/inlining issues mentioned in D110337). But with the new approach to use the amdgpu_hostcall module flag, we need to modify our compiler to emit that flag for all of our own library calls, too.

Another concern with using a module flag is that is isn't as easily eliminated once it has been inserted, even if the call that triggered insertion is ultimately eliminated through optimization. E.g., a printf call might be eliminated if it is under a condition that can be statically evaluated to false...but, the amdgpu_hostcall module flag may already have been inserted.

Is there an approach that can avoid the LTO/inlining issues, like implementing the hostcall implementation with an intrinsic to access the hostcall buffer pointer? - then the AMDGPU backend could easily detect use of that intrinsic to trigger setup of the implicit kernel arg, and inlining could not eliminate that intrinsic.

JonChesterfield added a comment.EditedJan 21 2022, 10:22 AM

The asm variable used by rocm openmp is zero overhead, needs no compiler support and works exactly as one would wish under inlining or code elimination. The main argument against that approach seems to be it's an abi break, much like this patch was, and that it is per-code-object instead of per-function, which I still think is a benefit.

sameerds added inline comments.
clang/lib/CodeGen/TargetInfo.cpp
9434

Just to confirm what others have probably disovered, the only function whose presence should be checked is `__ockl_hostcall_internal`. All others are wrappers that are free to disappear during optimization.