This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Introduce an assumption to ignore possible external callers
Needs ReviewPublic

Authored by jdoerfert on Dec 10 2020, 4:47 PM.

Details

Summary

In this patch we utilize the new assumption function attributes to
improve the OpenMPOpt behavior in the presence of externally visible
functions containing parallel regions when we compile for the GPU.

The existing OpenMP target state machine optimization will replace the
use of parallel region function pointers with dedicated handles if all
uses of the parallel region function pointer can be replaced. External
callers that are inside of OpenMP target regions will require the
function pointer to be used as there will be an indirect call. We now
allow the user to assume that external callers do not exist, or that
they might exist but not call from within an OpenMP (target) region.
The assumption in the IR is "omp_no_external_caller_in_target_region"
which can be created with OpenMP assumes

omp begin assumes ext_omp_no_external_caller_in_target_region
void foo() {
  #pragma omp parallel
  { ... }
}
omp end assumes

or the generic clang assume attribute

void foo() __attribute__((assume("omp_no_external_caller_in_target_region"))) {
  #pragma omp parallel
  { ... }
}

Diff Detail

Event Timeline

jdoerfert created this revision.Dec 10 2020, 4:47 PM
jdoerfert requested review of this revision.Dec 10 2020, 4:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 10 2020, 4:47 PM
jdoerfert updated this revision to Diff 311105.Dec 10 2020, 8:26 PM

Add remark and documentation

Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript
Herald added a subscriber: cfe-commits. · View Herald Transcript
jdoerfert updated this revision to Diff 311106.Dec 10 2020, 8:37 PM

Add remark summary

Is this for cases where we are compiling a subset of the target code, i.e. without link time optimisation?

It's interesting that we might want a static function on the gpu and an external one on the cpu. The user could presumably make it static and provide a different, host-only function which calls it as a workaround.

I'm nervous about this one because it looks like a property the compiler should already be able to derive (given LTO, anyway), and I don't see how we can diagnose when the user annotation is inaccurate, e.g. because of changes elsewhere in the codebase after it was added.

Is this for cases where we are compiling a subset of the target code, i.e. without link time optimisation?

Upstream cannot to LTO on any target code as of now :(

It's interesting that we might want a static function on the gpu and an external one on the cpu. The user could presumably make it static and provide a different, host-only function which calls it as a workaround.

If they have control. Code in a header for example might not be changeable. Though I will add it to the remark description for sure.

I'm nervous about this one because it looks like a property the compiler should already be able to derive (given LTO, anyway), and I don't see how we can diagnose when the user annotation is inaccurate, e.g. because of changes elsewhere in the codebase after it was added.

This is always an issue with assumes/assertions/annotations/... Unsure what to say. I'd be fine with retiring this one as soon as we can do LTO upstream, and I am all in favor of making that the only/default mode then too.

jdoerfert updated this revision to Diff 312535.Dec 17 2020, 9:27 AM

Rebase on top of D93439

Ah, this may be one of the things aomp changed for nvptx. We did something around linking but I've never looked into exactly what. In that case fine by me - it's a sharp edge, but I'd expect it to make a difference in benchmarks until whole program optimisation is online.

I think the strategy we want to converge on is equivalent to:

N openmp source files -> bitcode -> llvm-link -> mlink-builtin-bitcode=nvptx.devicertl -> opt -> ptxas

which interacts poorly with the compile/link separation. So presumably we should blend in the devicertl in an lld plugin, as that's when users expect the whole program to be available.