Page MenuHomePhabricator

[CodeExtractor] Enable partial aggregate arguments
Needs ReviewPublic

Authored by ggeorgakoudis on Feb 17 2021, 3:43 AM.

Details

Reviewers
jdoerfert
vsk
Summary

Enable CodeExtractor to construct output functions that partially aggregate inputs/outputs in their argument list. A use case is the OMPIRBuilder to create outlined functions for parallel regions that aggregate in a struct the payload variables for the region while passing as scalars thread and bound identifiers.

Diff Detail

Event Timeline

ggeorgakoudis created this revision.Feb 17 2021, 3:43 AM
ggeorgakoudis requested review of this revision.Feb 17 2021, 3:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 17 2021, 3:43 AM
ggeorgakoudis edited the summary of this revision. (Show Details)Feb 17 2021, 3:48 AM
ggeorgakoudis added reviewers: jdoerfert, vsk.
vsk added a comment.Mar 1 2021, 10:33 AM

Sorry it's taken me so long to get to this.

partially aggregate inputs/outputs in their argument list

Could you explain what this means, and what the pros/cons might be compared to any alternatives? It'd also help to see a test case.

ggeorgakoudis added a comment.EditedMar 4 2021, 12:22 AM
In D96854#2594840, @vsk wrote:

Sorry it's taken me so long to get to this.

partially aggregate inputs/outputs in their argument list

Could you explain what this means, and what the pros/cons might be compared to any alternatives? It'd also help to see a test case.

Hi @vsk,

No problem! Let me make the use case of the OpenMP IR builder concrete, I'll do some simplifications that do not affect the point. Currently, the OMPIRBuilder uses CodeExtractor to outline an OpenMP callback as:

void omp.outlined(int global_tid, int bound_tid, int* arg0, int* arg1, ..., int* argn)

where global_tid, bound_tid are OpenMP runtime filled values passed to the outlined functions and arg0, arg1, ...,argn are inputs/outputs found by CodeExtractor. To implement parallel execution calling the outlined call function, OMPIRBuilder emits the call to the OpenMP runtime fork_join function, which use ellipsis to pass the variadic number of parameters to the OpenMP runtime:

__kmpc_fork_call(int argc, omp.outlined, ..,)

so the ellipsis contains the arguments to the outlined function (arg0, arg1, ..., argn). The OpenMP runtime library fills the values for the preceding arguments global_tid, bound_tid when calling omp.outlined and forwards the rest of the arguments through a cumbersome dispatch function that unwraps the variadic arguments and uses a switch-case to call the function pointer of the callback as in:

switch(argc) {
  case 1: fp_to_omp.outlined(global_tid, bound_tid, vararg[0]); return;
  case 2: fp_to_omp.outlined(global_tid, bound_tid, vararg[0], vararg[1]); return;
  ...
}

We would like to remove this ellipsis interface because it creates various problems: there is a hardcoded limit on the number of arguments that the runtime forwards (limited by the switch-case style unwrapping), it has been the source of ABI bugs, and makes hard to analyze and optimize OpenMP code in LLVM. For this we would like to aggregate the input/output arguments to the outlined function but leave the runtime-filled arguments unaggregated:

void omp.outlined(int global_tid, int bound_tid, struct structArg)

This patch enables to exclude arguments from the aggregate by extending extractCodeRegion in CodeExtractor with a parameter of which arguments to exclude (assuming AggregateArgs has been set when creating the CodeExtractor). In this specific use case for OpenMP, the exclude arguments are global_tid and bound_tid.

I have added a unit test that tests this functionality. When we complete the change in OMPIRBuilder to use partial aggregation from CodeExtractor we will add also IR tests that will test this functionality too.

vsk added a comment.Tue, Mar 23, 11:13 AM

Thanks for explaining. I'd suggest making ExcludedAggArgs part of a CodeExtractor instances internal state: e.g. the client may call CE.addArgExludedFromAggregate(Value *) some number of times before CE.extractCodeRegion(). This way, the client doesn't need to maintain a SetVector, and the rest of the interface isn't polluted with an option that's specific to the AggregateArg case.

In D96854#2645458, @vsk wrote:

Thanks for explaining. I'd suggest making ExcludedAggArgs part of a CodeExtractor instances internal state: e.g. the client may call CE.addArgExludedFromAggregate(Value *) some number of times before CE.extractCodeRegion(). This way, the client doesn't need to maintain a SetVector, and the rest of the interface isn't polluted with an option that's specific to the AggregateArg case.

Hi @vsk, thank you for your feedback. I think the proposed interface is more flexible and easier to support than making exclusions internal to CodeExtractor and having repeated calls through addArgExludedFromAggregate to set them. My line of thinking is that (1) it is more flexible to let the client manage the state and request extraction through extractCodeRegion with it, for example it makes possible to extract the same region with different exclusions without creating another CodeExtractor instance, (2) there is only one, backwards behavior compatible change to the external interface of CodeExtractor, that is for extractCodeRegion, so changes in other methods (constructFunction, emitCallAndSwitchStatement) are internal and changes do not pollute this external API. Please let me know what you think.