Page MenuHomePhabricator

[CodeExtractor] Enable partial aggregate arguments
ClosedPublic

Authored by ggeorgakoudis on Feb 17 2021, 3:43 AM.

Details

Summary

Enable CodeExtractor to construct output functions that partially aggregate inputs/outputs in their argument list. A use case is the OMPIRBuilder to create outlined functions for parallel regions that aggregate in a struct the payload variables for the region while passing as scalars thread and bound identifiers.

Diff Detail

Event Timeline

ggeorgakoudis created this revision.Feb 17 2021, 3:43 AM
ggeorgakoudis requested review of this revision.Feb 17 2021, 3:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 17 2021, 3:43 AM
ggeorgakoudis edited the summary of this revision. (Show Details)Feb 17 2021, 3:48 AM
ggeorgakoudis added reviewers: jdoerfert, vsk.
vsk added a comment.Mar 1 2021, 10:33 AM

Sorry it's taken me so long to get to this.

partially aggregate inputs/outputs in their argument list

Could you explain what this means, and what the pros/cons might be compared to any alternatives? It'd also help to see a test case.

ggeorgakoudis added a comment.EditedMar 4 2021, 12:22 AM
In D96854#2594840, @vsk wrote:

Sorry it's taken me so long to get to this.

partially aggregate inputs/outputs in their argument list

Could you explain what this means, and what the pros/cons might be compared to any alternatives? It'd also help to see a test case.

Hi @vsk,

No problem! Let me make the use case of the OpenMP IR builder concrete, I'll do some simplifications that do not affect the point. Currently, the OMPIRBuilder uses CodeExtractor to outline an OpenMP callback as:

void omp.outlined(int global_tid, int bound_tid, int* arg0, int* arg1, ..., int* argn)

where global_tid, bound_tid are OpenMP runtime filled values passed to the outlined functions and arg0, arg1, ...,argn are inputs/outputs found by CodeExtractor. To implement parallel execution calling the outlined call function, OMPIRBuilder emits the call to the OpenMP runtime fork_join function, which use ellipsis to pass the variadic number of parameters to the OpenMP runtime:

__kmpc_fork_call(int argc, omp.outlined, ..,)

so the ellipsis contains the arguments to the outlined function (arg0, arg1, ..., argn). The OpenMP runtime library fills the values for the preceding arguments global_tid, bound_tid when calling omp.outlined and forwards the rest of the arguments through a cumbersome dispatch function that unwraps the variadic arguments and uses a switch-case to call the function pointer of the callback as in:

switch(argc) {
  case 1: fp_to_omp.outlined(global_tid, bound_tid, vararg[0]); return;
  case 2: fp_to_omp.outlined(global_tid, bound_tid, vararg[0], vararg[1]); return;
  ...
}

We would like to remove this ellipsis interface because it creates various problems: there is a hardcoded limit on the number of arguments that the runtime forwards (limited by the switch-case style unwrapping), it has been the source of ABI bugs, and makes hard to analyze and optimize OpenMP code in LLVM. For this we would like to aggregate the input/output arguments to the outlined function but leave the runtime-filled arguments unaggregated:

void omp.outlined(int global_tid, int bound_tid, struct structArg)

This patch enables to exclude arguments from the aggregate by extending extractCodeRegion in CodeExtractor with a parameter of which arguments to exclude (assuming AggregateArgs has been set when creating the CodeExtractor). In this specific use case for OpenMP, the exclude arguments are global_tid and bound_tid.

I have added a unit test that tests this functionality. When we complete the change in OMPIRBuilder to use partial aggregation from CodeExtractor we will add also IR tests that will test this functionality too.

vsk added a comment.Mar 23 2021, 11:13 AM

Thanks for explaining. I'd suggest making ExcludedAggArgs part of a CodeExtractor instances internal state: e.g. the client may call CE.addArgExludedFromAggregate(Value *) some number of times before CE.extractCodeRegion(). This way, the client doesn't need to maintain a SetVector, and the rest of the interface isn't polluted with an option that's specific to the AggregateArg case.

In D96854#2645458, @vsk wrote:

Thanks for explaining. I'd suggest making ExcludedAggArgs part of a CodeExtractor instances internal state: e.g. the client may call CE.addArgExludedFromAggregate(Value *) some number of times before CE.extractCodeRegion(). This way, the client doesn't need to maintain a SetVector, and the rest of the interface isn't polluted with an option that's specific to the AggregateArg case.

Hi @vsk, thank you for your feedback. I think the proposed interface is more flexible and easier to support than making exclusions internal to CodeExtractor and having repeated calls through addArgExludedFromAggregate to set them. My line of thinking is that (1) it is more flexible to let the client manage the state and request extraction through extractCodeRegion with it, for example it makes possible to extract the same region with different exclusions without creating another CodeExtractor instance, (2) there is only one, backwards behavior compatible change to the external interface of CodeExtractor, that is for extractCodeRegion, so changes in other methods (constructFunction, emitCallAndSwitchStatement) are internal and changes do not pollute this external API. Please let me know what you think.

vsk added a comment.Apr 16 2021, 11:30 AM

for example it makes possible to extract the same region with different exclusions without creating another CodeExtractor instance,

As extractCodeRegion mutates the original function, I assumed it was not possible to reuse a CodeExtractor instance in this way. Is there an in-tree example of CE instance reuse I can take a look at?

so changes in other methods (constructFunction, emitCallAndSwitchStatement) are internal and changes do not pollute this external API

I don't quite follow. Wouldn't introducing a separate API for adding arg exclusions would also be backwards compatible?

In D96854#2695426, @vsk wrote:

for example it makes possible to extract the same region with different exclusions without creating another CodeExtractor instance,

As extractCodeRegion mutates the original function, I assumed it was not possible to reuse a CodeExtractor instance in this way. Is there an in-tree example of CE instance reuse I can take a look at?

Unfortunately, there isn't an example so far. The CodeExtractor instance can be re-used because the analysis of CE stays the same. The exclusion from aggregates affects only the extraction of the outlined function when calling extractCodeRegion.

so changes in other methods (constructFunction, emitCallAndSwitchStatement) are internal and changes do not pollute this external API

I don't quite follow. Wouldn't introducing a separate API for adding arg exclusions would also be backwards compatible?

It will be backwards compatible but IMO it unnecessarily binds the CE instance to a specific way of argument creation for the outlined function. In the same sense, the AggregateArgs that is part of the CE constructor could be only an argument to extractCodeRegion since CE analysis is orthogonal to it.

@vsk I'll re-structure the implementation to use internal state and an addArgExludedFromAggregate interface. I don't have/can't think right now a use case for the flexibility I'm proposing.

vsk added a comment.Apr 22 2021, 11:46 AM
In D96854#2695426, @vsk wrote:

for example it makes possible to extract the same region with different exclusions without creating another CodeExtractor instance,

As extractCodeRegion mutates the original function, I assumed it was not possible to reuse a CodeExtractor instance in this way. Is there an in-tree example of CE instance reuse I can take a look at?

Unfortunately, there isn't an example so far. The CodeExtractor instance can be re-used because the analysis of CE stays the same. The exclusion from aggregates affects only the extraction of the outlined function when calling extractCodeRegion.

Oh, I think I see what you meant - was it that the CE analysis cache can be re-used? The motivation there was eliminating quadratic compile-time for repeated outlining (D68616); with that change some cached analysis of the caller became reusable, but not the extractor instance itself.

so changes in other methods (constructFunction, emitCallAndSwitchStatement) are internal and changes do not pollute this external API

I don't quite follow. Wouldn't introducing a separate API for adding arg exclusions would also be backwards compatible?

It will be backwards compatible but IMO it unnecessarily binds the CE instance to a specific way of argument creation for the outlined function.
[snip]

I believe this would be true of any API we pick today.

ggeorgakoudis added a comment.EditedMay 9 2021, 12:54 AM
In D96854#2709726, @vsk wrote:

Oh, I think I see what you meant - was it that the CE analysis cache can be re-used? The motivation there was eliminating quadratic compile-time for repeated outlining (D68616); with that change some cached analysis of the caller became reusable, but not the extractor instance itself.

Yes, that's what I meant! The CE analysis is reused and the extractCodeRegion should only define the way to extract the analyzed region to an outlined function.

I believe this would be true of any API we pick today.

I think it doesn't have to be this way, if we move the AggregateArgs flag to be an argument to extractCodeRegion along with the exclusions. For the use case in the OpenMP IR builder either interface will work. Since you have a more holistic use of the clients of CodeExtractor, it makes sense that you make the call. Shall I go forth with the addArgExludedFromAggregate interface?

Ping @vsk! Waiting for your decision :)

vsk added a comment.Jun 10 2021, 2:21 PM

@ggeorgakoudis apologies again for the delay here.

My opinion hasn't changed, I still feel it'd make for a simpler interface to introduce a "CE.addArgExcludedFromAggregate(Value *)" API (though I admit that that API name isn't ideal).

Update interface for comments

vsk added inline comments.
llvm/lib/Transforms/Utils/CodeExtractor.cpp
833

I notice there's a paramTy = ScalarParamTy: can we delete paramTy entirely? Having two copies of the same thing creates a risk of the copies diverging, adding complexity.

860

Suggest - assert(StructValues.empty() || AggregateArgs && "StructValues updated for arg structs only")

Update for comments

ggeorgakoudis marked 2 inline comments as done.Nov 1 2021, 11:05 AM
ggeorgakoudis added inline comments.
llvm/lib/Transforms/Utils/CodeExtractor.cpp
833

In paramTy we concatenate the scalar parameter types and the aggregate type. The comment on diverging copies makes sense. I'll change paramTy to ParamTy, remove ScalarParamTy, and directly push non-aggregate arguments and concat the aggregate type.

860

Makes sense, will add the check

ggeorgakoudis marked 2 inline comments as done.Dec 1 2021, 9:25 AM

Ping @vsk :)

vsk added a comment.Dec 6 2021, 2:58 PM

Looks good to me, thanks.
(And apologies for more delay: I'm not working on llvm much these days, Jessica or Jon (both cc'd) may be able to provide better turnaround time going forward.)

jdoerfert accepted this revision.Jan 25 2022, 4:05 PM

Was accepted before, now mark it as such.

This revision is now accepted and ready to land.Jan 25 2022, 4:05 PM
This revision was landed with ongoing or failed builds.Jan 25 2022, 6:25 PM
This revision was automatically updated to reflect the committed changes.