Enable CodeExtractor to construct output functions that partially aggregate inputs/outputs in their argument list. A use case is the OMPIRBuilder to create outlined functions for parallel regions that aggregate in a struct the payload variables for the region while passing as scalars thread and bound identifiers.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Sorry it's taken me so long to get to this.
partially aggregate inputs/outputs in their argument list
Could you explain what this means, and what the pros/cons might be compared to any alternatives? It'd also help to see a test case.
Hi @vsk,
No problem! Let me make the use case of the OpenMP IR builder concrete, I'll do some simplifications that do not affect the point. Currently, the OMPIRBuilder uses CodeExtractor to outline an OpenMP callback as:
void omp.outlined(int global_tid, int bound_tid, int* arg0, int* arg1, ..., int* argn)
where global_tid, bound_tid are OpenMP runtime filled values passed to the outlined functions and arg0, arg1, ...,argn are inputs/outputs found by CodeExtractor. To implement parallel execution calling the outlined call function, OMPIRBuilder emits the call to the OpenMP runtime fork_join function, which use ellipsis to pass the variadic number of parameters to the OpenMP runtime:
__kmpc_fork_call(int argc, omp.outlined, ..,)
so the ellipsis contains the arguments to the outlined function (arg0, arg1, ..., argn). The OpenMP runtime library fills the values for the preceding arguments global_tid, bound_tid when calling omp.outlined and forwards the rest of the arguments through a cumbersome dispatch function that unwraps the variadic arguments and uses a switch-case to call the function pointer of the callback as in:
switch(argc) { case 1: fp_to_omp.outlined(global_tid, bound_tid, vararg[0]); return; case 2: fp_to_omp.outlined(global_tid, bound_tid, vararg[0], vararg[1]); return; ... }
We would like to remove this ellipsis interface because it creates various problems: there is a hardcoded limit on the number of arguments that the runtime forwards (limited by the switch-case style unwrapping), it has been the source of ABI bugs, and makes hard to analyze and optimize OpenMP code in LLVM. For this we would like to aggregate the input/output arguments to the outlined function but leave the runtime-filled arguments unaggregated:
void omp.outlined(int global_tid, int bound_tid, struct structArg)
This patch enables to exclude arguments from the aggregate by extending extractCodeRegion in CodeExtractor with a parameter of which arguments to exclude (assuming AggregateArgs has been set when creating the CodeExtractor). In this specific use case for OpenMP, the exclude arguments are global_tid and bound_tid.
I have added a unit test that tests this functionality. When we complete the change in OMPIRBuilder to use partial aggregation from CodeExtractor we will add also IR tests that will test this functionality too.
Thanks for explaining. I'd suggest making ExcludedAggArgs part of a CodeExtractor instances internal state: e.g. the client may call CE.addArgExludedFromAggregate(Value *) some number of times before CE.extractCodeRegion(). This way, the client doesn't need to maintain a SetVector, and the rest of the interface isn't polluted with an option that's specific to the AggregateArg case.
Hi @vsk, thank you for your feedback. I think the proposed interface is more flexible and easier to support than making exclusions internal to CodeExtractor and having repeated calls through addArgExludedFromAggregate to set them. My line of thinking is that (1) it is more flexible to let the client manage the state and request extraction through extractCodeRegion with it, for example it makes possible to extract the same region with different exclusions without creating another CodeExtractor instance, (2) there is only one, backwards behavior compatible change to the external interface of CodeExtractor, that is for extractCodeRegion, so changes in other methods (constructFunction, emitCallAndSwitchStatement) are internal and changes do not pollute this external API. Please let me know what you think.
for example it makes possible to extract the same region with different exclusions without creating another CodeExtractor instance,
As extractCodeRegion mutates the original function, I assumed it was not possible to reuse a CodeExtractor instance in this way. Is there an in-tree example of CE instance reuse I can take a look at?
so changes in other methods (constructFunction, emitCallAndSwitchStatement) are internal and changes do not pollute this external API
I don't quite follow. Wouldn't introducing a separate API for adding arg exclusions would also be backwards compatible?
Unfortunately, there isn't an example so far. The CodeExtractor instance can be re-used because the analysis of CE stays the same. The exclusion from aggregates affects only the extraction of the outlined function when calling extractCodeRegion.
so changes in other methods (constructFunction, emitCallAndSwitchStatement) are internal and changes do not pollute this external API
I don't quite follow. Wouldn't introducing a separate API for adding arg exclusions would also be backwards compatible?
It will be backwards compatible but IMO it unnecessarily binds the CE instance to a specific way of argument creation for the outlined function. In the same sense, the AggregateArgs that is part of the CE constructor could be only an argument to extractCodeRegion since CE analysis is orthogonal to it.
@vsk I'll re-structure the implementation to use internal state and an addArgExludedFromAggregate interface. I don't have/can't think right now a use case for the flexibility I'm proposing.
Oh, I think I see what you meant - was it that the CE analysis cache can be re-used? The motivation there was eliminating quadratic compile-time for repeated outlining (D68616); with that change some cached analysis of the caller became reusable, but not the extractor instance itself.
so changes in other methods (constructFunction, emitCallAndSwitchStatement) are internal and changes do not pollute this external API
I don't quite follow. Wouldn't introducing a separate API for adding arg exclusions would also be backwards compatible?
It will be backwards compatible but IMO it unnecessarily binds the CE instance to a specific way of argument creation for the outlined function.
[snip]
I believe this would be true of any API we pick today.
Yes, that's what I meant! The CE analysis is reused and the extractCodeRegion should only define the way to extract the analyzed region to an outlined function.
I believe this would be true of any API we pick today.
I think it doesn't have to be this way, if we move the AggregateArgs flag to be an argument to extractCodeRegion along with the exclusions. For the use case in the OpenMP IR builder either interface will work. Since you have a more holistic use of the clients of CodeExtractor, it makes sense that you make the call. Shall I go forth with the addArgExludedFromAggregate interface?
@ggeorgakoudis apologies again for the delay here.
My opinion hasn't changed, I still feel it'd make for a simpler interface to introduce a "CE.addArgExcludedFromAggregate(Value *)" API (though I admit that that API name isn't ideal).
llvm/lib/Transforms/Utils/CodeExtractor.cpp | ||
---|---|---|
833 | I notice there's a paramTy = ScalarParamTy: can we delete paramTy entirely? Having two copies of the same thing creates a risk of the copies diverging, adding complexity. | |
860 | Suggest - assert(StructValues.empty() || AggregateArgs && "StructValues updated for arg structs only") |
llvm/lib/Transforms/Utils/CodeExtractor.cpp | ||
---|---|---|
833 | In paramTy we concatenate the scalar parameter types and the aggregate type. The comment on diverging copies makes sense. I'll change paramTy to ParamTy, remove ScalarParamTy, and directly push non-aggregate arguments and concat the aggregate type. | |
860 | Makes sense, will add the check |
Looks good to me, thanks.
(And apologies for more delay: I'm not working on llvm much these days, Jessica or Jon (both cc'd) may be able to provide better turnaround time going forward.)
I notice there's a paramTy = ScalarParamTy: can we delete paramTy entirely? Having two copies of the same thing creates a risk of the copies diverging, adding complexity.