Page MenuHomePhabricator

[OpenMP] Codegen aggregate for outlined function captures
Needs ReviewPublic

Authored by ggeorgakoudis on May 8 2021, 8:16 AM.

Details

Reviewers
jdoerfert
jhuber6
Summary

Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.

Diff Detail

Event Timeline

ggeorgakoudis created this revision.May 8 2021, 8:16 AM
ggeorgakoudis requested review of this revision.May 8 2021, 8:16 AM
Herald added a project: Restricted Project. · View Herald Transcript
ggeorgakoudis edited the summary of this revision. (Show Details)May 8 2021, 8:17 AM
ggeorgakoudis edited the summary of this revision. (Show Details)May 22 2021, 9:55 AM

Use non-aggregate captured args and outlining for ordered codegen

This allows us to remove the switch in the device runtime, right?

This allows us to remove the switch in the device runtime, right?

Yes, with a complication: for combined directives of worksharing loops (distributed parallel for) clang emits the lower bound and upper bound as distinct parameters besides the global tid, bound tid and the captured variables (aggregated in the struct of this patch). We will need to have a flag in parallel_51 to unwrap the global args array for this case.

Rebase to NewGlobalization and amend

Herald added a project: Restricted Project. · View Herald TranscriptTue, Jun 15, 10:31 AM
lebedev.ri added a subscriber: lebedev.ri.EditedTue, Jun 15, 10:47 AM

(This is not offload-specific, right?)
This does not bring any compatibility issues, right?

(This is not offload-specific, right?)
This does not bring any compatibility issues, right?

It's not offload-specific, so the patch applies to host OpenMP too. The patch, as is, does not introduce any incompatibilities, the OpenMP runtime parallel call interface stays the same and the only change is the aggregation of the outlined function arguments in a struct. However, it makes sense to simplify the OpenMP parallel call runtime interface to avoid variadic arguments. This change is not necessary for this patch, but I'm considering to include it too.

Update non-remark tests

(answered now, I forgot to submit this yesterday -.- )

(This is not offload-specific, right?)

It is not. It applies to all parallel regions. We still use the variadic call on the CPU but we can now replace it with a non-variadic interface.

This does not bring any compatibility issues, right?

We'll keep the old interface (kmpc_fork_call) for the host (libomp) and code that uses it will continue to work. New code will use it also in a compatible way
because we change caller and callee of kmpc_fork_call to adjust for the aggregate. This will also hold as the new non-variadic interface is introduced.

The device side is simpler, we are only compatible with a device runtime that machtes (as we link it into the app), so there is no compatibility issue to be expected.

We used this kind of codegen initially but later found out that it causes a large overhead when gathering pointers into a record. What about hybrid scheme where the first args are passed as arguments and others (if any) are gathered into a record?

We used this kind of codegen initially but later found out that it causes a large overhead when gathering pointers into a record. What about hybrid scheme where the first args are passed as arguments and others (if any) are gathered into a record?

I'm confused, maybe I misunderstand the problem. The parallel function arguments need to go from the main thread to the workers somehow, I don't see how this is done w/o a record. This patch makes it explicit though.

We used this kind of codegen initially but later found out that it causes a large overhead when gathering pointers into a record. What about hybrid scheme where the first args are passed as arguments and others (if any) are gathered into a record?

I'm confused, maybe I misunderstand the problem. The parallel function arguments need to go from the main thread to the workers somehow, I don't see how this is done w/o a record. This patch makes it explicit though.

Pass it in a record for workers only? And use a hybrid scheme for all other parallel regions.

cchen added a subscriber: cchen.Fri, Jun 18, 3:45 PM
josemonsalve2 removed a subscriber: josemonsalve2.
josemonsalve2 added a subscriber: josemonsalve2.