This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Emit offloading entries for indirect target variables
ClosedPublic

Authored by jhuber6 on Aug 11 2023, 10:44 AM.

Details

Summary

OpenMP 5.1 allows emission of the indirect clause on declare target
functions, see https://www.openmp.org/spec-html/5.1/openmpsu70.html#x98-1080002.14.7.
The intended use of this is to permit calling device functions via their
associated host pointer. In order to do this the first step will be
building a map associating these variables. Doing this will require the
same offloading entry handling we use for other kernels and globals.

We intentionally emit a new global on the device side. Although it's
possible to look up the device function's address directly, this would
require changing the visibility and would prevent us from making static
functions indirect. Also, the CUDA toolchain will optimize out unused
functions and using a global prevents that. The downside is that the
runtime will need to read the global and copy its value, but there
shouldn't be any other costs.

Note that this patch just performs the codegen, currently this new
offloading entry type is unused and will be ignored by the runtime.

Diff Detail

Event Timeline

jhuber6 created this revision.Aug 11 2023, 10:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2023, 10:44 AM
jhuber6 requested review of this revision.Aug 11 2023, 10:44 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 11 2023, 10:44 AM

calling device functions via their associated host pointer

What does this mean? Defining a function foo such that the host and each individual target each have their own machine code for it, such that &foo on the host can be copied over to the target and then invoked to mean call the function on the local target with the same name?

If so, calling through the pointer &foo on the GPU doing a logarithmic search through a table to choose a function address to branch to sounds like something that will codegen into very slow code. Does it do that search on every call?

Is there an ambition to have &foo on the host and &foo on the (each) target return the same value, in the pointer equality sense?

Searching the linked spec for indirect finds the following

If the indirect clause is present and invoked-by-fptr evaluates to true, any procedures that appear in a to clause on the directive may be called with an indirect device invocation. If the indirect clause is present and invoked-by-fptr does not evaluate to true, any procedures that appear in a to clause on the directive may not be called with an indirect device invocation. Unless otherwise specified by an indirect clause, procedures may not be called with an indirect device invocation.

Which tells me that the indirect clause means procedures can be called with an indirect device invocation. Searching for the expression "indirect device invocation" finds that paragraph and nothing else. So... where does the spec say what this thing is?

calling device functions via their associated host pointer

What does this mean? Defining a function foo such that the host and each individual target each have their own machine code for it, such that &foo on the host can be copied over to the target and then invoked to mean call the function on the local target with the same name?

If so, calling through the pointer &foo on the GPU doing a logarithmic search through a table to choose a function address to branch to sounds like something that will codegen into very slow code. Does it do that search on every call?

Is there an ambition to have &foo on the host and &foo on the target return the same value, in the pointer equality sense?

That's exactly what it means, the mapping is only done for targets with indirect declared on them. The indirect calls themselves, I think @jdoerfert is implementing some specialization? He just asked me to implement this since it's related to copying of virtual classes to the device.

calling device functions via their associated host pointer

What does this mean? Defining a function foo such that the host and each individual target each have their own machine code for it, such that &foo on the host can be copied over to the target and then invoked to mean call the function on the local target with the same name?

If so, calling through the pointer &foo on the GPU doing a logarithmic search through a table to choose a function address to branch to sounds like something that will codegen into very slow code. Does it do that search on every call?

Is there an ambition to have &foo on the host and &foo on the (each) target return the same value, in the pointer equality sense?

Searching the linked spec for indirect finds the following

If the indirect clause is present and invoked-by-fptr evaluates to true, any procedures that appear in a to clause on the directive may be called with an indirect device invocation. If the indirect clause is present and invoked-by-fptr does not evaluate to true, any procedures that appear in a to clause on the directive may not be called with an indirect device invocation. Unless otherwise specified by an indirect clause, procedures may not be called with an indirect device invocation.

Which tells me that the indirect clause means procedures can be called with an indirect device invocation. Searching for the expression "indirect device invocation" finds that paragraph and nothing else. So... where does the spec say what this thing is?

Only indirect calls on the device will do a search the table. The spec does not say how it should be implemented. One could do the translation at the target region when it is mapped on the host but this will not handle all the cases.

If calling an indirect function pointer on the GPU requires a table lookup (keyed by host function addresses, which I didn't think we knew at GPU compile time), and we cannot distinguish indirect function pointers from function pointers, then this feature must send _every_ indirect call on the GPU through the table search in case it hits in the table and then branch on the value if it doesn't.

So if we have a few indirect openmp functions and then we call a function which makes indirect calls of a function pointer which happened to be unrelated to this feature, it's going to search that table anyway. Say qsort on a non-inlined comparison function.

This feature, as I understand it so far, therefore induces a global slowdown on every call to an unknown function. Slow function calls are a bad thing.

What am I missing?

arsenm added a subscriber: arsenm.Aug 14 2023, 3:02 PM
arsenm added inline comments.
clang/lib/CodeGen/CGOpenMPRuntime.cpp
1900–1901

not a huge fan of std::optional<pointer>

1921

target global address space

1926

isn't there a store size?

JonChesterfield added a comment.EditedAug 14 2023, 3:10 PM

Ok, I'm really sure this needs to be reflected in the type system. &foo for some target indirect foo gets to be larger than 8 bytes and we use a different calling convention for it. Otherwise however we carve this the type erasure is going to make unrelated calls acquire the dynamic search overhead.

e.g. store an integer in [0 N) in the 'function pointer' and use that as the index into a jump table. Host and various targets each get their own jump table. O(1) overhead, zero cost for existing/normal indirect calls.

jhuber6 marked an inline comment as done.Aug 14 2023, 3:43 PM
jhuber6 added inline comments.
clang/lib/CodeGen/CGOpenMPRuntime.cpp
1900–1901

This is pretty far entrenched in the Clang handling for this attribute so I don't intend to change it here.

1926

Yeah I can use that instead.

jhuber6 updated this revision to Diff 550140.Aug 14 2023, 4:53 PM

Address comments

ThorBl added a subscriber: ThorBl.Aug 16 2023, 7:27 AM
jdoerfert accepted this revision.Aug 24 2023, 3:31 PM
This revision is now accepted and ready to land.Aug 24 2023, 3:31 PM