This is an archive of the discontinued LLVM Phabricator instance.

[OpenMPIRBuilder] forward arguments as pointers to outlined function
ClosedPublic

Authored by ftynse on Nov 26 2020, 9:33 AM.

Details

Summary

OpenMPIRBuilder::createParallel outlines the body region of the parallel
construct into a new function that accepts any value previously defined outside
the region as a function argument. This function is called back by OpenMP
runtime function __kmpc_fork_call, which expects trailing arguments to be
pointers. If the region uses a value that is not of a pointer type, e.g. a
struct, the produced code would be invalid. In such cases, make createParallel
emit IR that stores the value on stack and pass the pointer to the outlined
function instead. The outlined function then loads the value back and uses as
normal.

Diff Detail

Event Timeline

ftynse created this revision.Nov 26 2020, 9:33 AM
ftynse requested review of this revision.Nov 26 2020, 9:33 AM

I ran into this when trying to generate OpenMP code from MLIR (not Fortran). We use structs for our rich pointer abstraction and using that struct in the outlined code led to memory corruptions because it was not passed by pointer. I suppose the intent is the same as https://reviews.llvm.org/D91556, so consider me commandeering that revision.

ftynse updated this revision to Diff 307992.Nov 27 2020, 1:56 AM

Update tests

Looking at CodeGenFunction::GenerateOpenMPCapturedVars, the condition if (!CurField->getType()->isAnyPointerType()) seems to match what is done here, but I wonder whether it is always necessary. Is the additional indirection necessary for pointer-sized data types (e.g. intptr_t, or smaller)?

Aside this optimization, the change looks like something has to be done. Can you update the naming conventions pointed-out by clang-tidy?

llitchev accepted this revision.EditedNov 28 2020, 9:12 AM

This looks fine. Except the failing tests. Once the tests are fixed it should be good to go.

I have the changes on https://reviews.llvm.org/D91556 made (talked offline with Johannes and made changes to create the struct and capture the values just before the extraction as well - the parameters captured are the ones that the definition of the used vars are parented by a different block than the block of the region. Spending time doing bunch of testing. The approach in this PR works as well.

This revision is now accepted and ready to land.Nov 28 2020, 9:12 AM
fghanim added a subscriber: fghanim.

Thank you for the patch :)

To be honest, I am not exactly sure what is the problem you are trying to address here, is it about passing shared variables as proper pointers? or is it about creating locals for said arguments

Based on the patch description; I think it is the first; and AFAIK, the current implementation will always pass shared variables correctly as pointers. is that not the case? I mean I have never came across a case where it didn't
If yes, do you have a c/c++ example I can try? I'd be very interested in seeing for which cases this doesn't work.

However, implementation and checks makes me think it is the second, Which I always assumed was going to be a part of how a specific frontend handles privatization; but I am very much not against doing so here
Just will want to make sure to do it in a way that works for the OMP-IR, and I don't know if there is or what is the plan to handle private/shared variables is on their end.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
701

Why not use InnerAllocaIP instead? it's in the entry block of the parallel region.

707

NIT: IPGuard maybe better?

712

For the alloca, should use OuterAllocaIP or InnerAllocaIP based on which function the alloca belongs to.

allocations in InsertBB will end right before the fork call, which suggests you probably should use outerAllocaIP. but why do you need two allocations for same variable. or is this meant to be the allocations where we store the arguments in the outlined function? if that's the case, then you should use InnerAllocaIP

713

Nit: if you could specify a name for the alloca, a name is easier to track than a number in IR :)

jdoerfert requested changes to this revision.Nov 30 2020, 9:42 PM
jdoerfert added a subscriber: ggeorgakoudis.

Nit: Clang tidy warning wrt variable names should be addressed. Some other nits inlined.

My only concern is described in the longer inline comment.


To be honest, I am not exactly sure what is the problem you are trying to address here, is it about passing shared variables as proper pointers? or is it about creating locals for said arguments

The problem was discussed in the Flang + OpenMP call lately and got attention by at least 4 people since then, as far as I can tell independently. I'm rather confused TBH, but it's a "good thing", as far as bug fixes can be good ;)

Long story short, kmpc_fork_call expects variadic arguments to be in integer registers, anything else passed is lost and replaced with garbage. Clang deals with this already but since the problem is not unique to Clang we should do it here. If anyone wants to create a parallel region they should not need to know the implicit calling convention rules of the OpenMP runtime. For an example how clang "optimizes" things, see https://godbolt.org/z/cja5TE, especially the float/i32 bitcasts.

Based on the patch description; I think it is the first; and AFAIK, the current implementation will always pass shared variables correctly as pointers. is that not the case? I mean I have never came across a case where it didn't
If yes, do you have a c/c++ example I can try? I'd be very interested in seeing for which cases this doesn't work.

It's not about shared variables but basically the equivalent of "firstprivate" (=pass by value), though it occurs in non-frontend contexts as well, e.g. when merging/expanding parallel regions.

However, implementation and checks makes me think it is the second, Which I always assumed was going to be a part of how a specific frontend handles privatization; but I am very much not against doing so here

It belongs here and I asked them to put it here instead of Flang, especially since @ggeorgakoudis exposed the same problem in OpenMP opt. Now instead of fixing up pointers 3 times and exposing a very fragile API to non-OpenMP users, we will ensure the calling convention is not leaking out of the IRBuilder.

Just will want to make sure to do it in a way that works for the OMP-IR, and I don't know if there is or what is the plan to handle private/shared variables is on their end.

Privatization (in all its glory) has still to happen in the frontend. This is only partially related.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
742

I was expecting the above logic to be placed here, after the PrivCB callback as I assumed we would privatize in the sequential part. Clang does not and we do not either (for now), which is unfortunate. It does duplicate the privatization and makes this patch more complicated than I anticipated. I though we can simply replace ReplacementValue by Reloaded if ReplacementValue wasn't a pointer and therefor put in an alloca. The fact that we need to reload V and then "replace twice" seems really unfortunate.

What would happen if you do not have Reloaded at all but instead make it V = createlodoad(alloca)? After all, Reloaded is equivalent to V in all aspects, right? Would this work? Would we still need the code below? I feel like there must be a minimal solution as we only want to replace the value once on the other side of the call edge.

760

Nit: Assertion message please.

780

Nit: -"we"

This revision now requires changes to proceed.Nov 30 2020, 9:42 PM

Looking at CodeGenFunction::GenerateOpenMPCapturedVars, the condition if (!CurField->getType()->isAnyPointerType()) seems to match what is done here, but I wonder whether it is always necessary. Is the additional indirection necessary for pointer-sized data types (e.g. intptr_t, or smaller)?

Aside this optimization, the change looks like something has to be done. Can you update the naming conventions pointed-out by clang-tidy?

Let's keep it simple. TBH, I am not convinced Clang's "optimization" is not hurting us in the long run. I mean, we'd need to understand this: bitcast float to i32 -> zext ---fork_call---> trunc -> bitcast i32 to float while the pointer indirection is common also for shared values so we will be (actually are downstream) able to handle those.

jdoerfert added inline comments.Nov 30 2020, 9:48 PM
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
742

-"duplicate privatization" +"duplicate replace all uses with"

ftynse updated this revision to Diff 308622.Dec 1 2020, 5:27 AM
ftynse marked 3 inline comments as done.

Address reviews.

ftynse added inline comments.Dec 1 2020, 5:35 AM
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
701

I started with that, but it did not work out. InnerAllocaIP is used by PrivCB to construct IR that may be using the value defined at ReloadIP. If I literally replace the use of ReloadIP with InnerAllocaIP, the instructions with have the wrong order. I also considered inserting this at the top of the entry block, but this messes up the order of arguments in the outlined function. Suggestions on how to structure this better are welcome.

712

This belongs to the outer function, but OuterAllocaIP does not seem valid at this point anymore. Using it leads to segfaults, I can investigate later, after we decide whether this code should remain here or move below as Johannes suggested.

742

I am not sure I fully follow what you suggest, so I will elaborate on the two versions I had considered.

  1. Move the code that loads back the value (currently lines 709-725) after this line. This will not remove the need for two separate "replace all uses with": there uses of the original V in the body that need to be replaced with ReplacementValue, and there are uses of V that could have been introduced by PrivCB for the purpose of defining ReplacementValue which should be replaced with Reloaded instead. This doesn't look like it would address your concern about having double replacement.
  1. I can keep the code that loads back the value in its current place and pass either V or *Reloaded to PrivCB. This will make sure any instructions created in PrivCB use the Reloaded and don't need to be update via "replace all uses with" pattern. However, this exposes the pointer mechanism to the caller of CreateParallel. The Value & that PrivCB receives is not necessarily a value that exists in the user code, it can be the alloca we constructed in builder. So the implementation of PrivCB needs to be aware of it and can no longer rely on, e.g., keeping a list of values/instructions that need privatization or directly rely on types since the type would change. I decided that I did not want to change the contract that PrivCB has with CreateParallel because I was not aware of all its use cases (and have a preference for more "isolated" APIs) . However, if you think it is acceptable for the builder in order to reduce the complexity/overhead of the code, I can adapt.

There is a flavor of (2) that changes PrivCB to take both V and Replacement so that the implementation of PrivCB can easily detect when the pointer mechanism is active.

ftynse marked an inline comment as done.Dec 1 2020, 5:36 AM
ftynse updated this revision to Diff 308643.Dec 1 2020, 6:48 AM

Fix clang tests. The order of arguments is switched in the internal outlined function.

Herald added a project: Restricted Project. · View Herald TranscriptDec 1 2020, 6:48 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
ftynse added inline comments.Dec 1 2020, 7:01 AM
clang/test/OpenMP/parallel_codegen.cpp
143

The order of arguments changes here because we create a use of the promoted-to-pointer argument before any other uses in the body and the outliner finds it first. This should be fine because it's just an internal outlined function that the builder created and the calls to it are emitted accordingly and in the same builder. I can add a comment that explains this if desired.

If we go with Michael's suggestion not to turn into pointers the integer values whose size is equal to or smaller than pointer size, this change will not be necessary. I seem to remember seeing some documentation that says that trailing arguments to fork_call should be _pointers_, but I can't find it anymore.

While only partially related, can you leave a FIXME saying that more than 15 arguments need to be packed in a structure?

clang/test/OpenMP/parallel_codegen.cpp
143

I seem to remember seeing some documentation that says that trailing arguments to fork_call should be _pointers_, but I can't find it anymore.

Technically, anything that is passed in the same register as a void * is fine. The documentation on this is thin at best. As I mentioned in my response to @Meinersbur, I think turning everything into pointers is a reasonable way to handle this. I gave some rational there (https://reviews.llvm.org/D92189#2424690)

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
742

Appreciate the detailed explanations!

There is a flavor of (2) that changes PrivCB to take both V and Replacement so that the implementation of PrivCB can easily detect when the pointer mechanism is active.

This looks like a sweet-spot. We can avoid adding more replacement logic but the PrivCB is aware of what is going on. If you are OK with this solution as well I'd prefer it.

ftynse planned changes to this revision.Dec 1 2020, 8:58 AM
ftynse added inline comments.
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
742

Great, will do this.

llitchev added a comment.EditedDec 1 2020, 9:10 AM

While only partially related, can you leave a FIXME saying that more than 15 arguments need to be packed in a structure?

Yes, the number of varargs that is guarantee to work is 15, if more are needed, there is a need to wrap the parameters in a struct.
As we discussed (bring it here too) the issue here is how parameters are passed following the calling convention. The varargs assumes that parameters are passed in GPR (INTEGER classifications based on the calling conventions). The varargs approach works fine for the INTEGER classified args. It breaks for the SSE(UP) and vector clarified parameters. If we pass the pointers to the SSE params (floats for example) everything will work - pointers are INTEGER classification. For Windows things are a bit different of course, based on the specific calling convention used there.

ftynse updated this revision to Diff 308698.Dec 1 2020, 10:07 AM

Simplify the code by adapting the PrivatizationCallback signature.

ftynse marked 2 inline comments as done.Dec 1 2020, 10:08 AM
jdoerfert accepted this revision.Dec 1 2020, 12:34 PM

LGTM, one comment and two nits. Feel free to commit if you agree with the suggestions or come back with concerns. Thanks for working on this and taking the time to fix it like this!

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
124

It is not a pointer to original, is it? It is an equivalent but potentially not the same IR value.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
738

Nit: Rename Reloaded to Inner to make it consistent with the callback (description). With Value* Inner = V; you can avoid the select above.

745

Nit: This could go back up again I guess.

This revision is now accepted and ready to land.Dec 1 2020, 12:34 PM

Thanks; just two more mediocre things if possible. if not. you are good to go. :)

@jdoerfert

To be honest, I am not exactly sure what is the problem you are trying to address here, is it about passing shared variables as proper pointers? or is it about creating locals for said arguments

The problem was discussed in the Flang + OpenMP call lately and got attention by at least 4 people since then, as far as I can tell independently. I'm rather confused TBH, but it's a "good thing", as far as bug fixes can be good ;)

I don't disagree. As I said, "I am very much not against doing so here"

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
701

OK; The only instructions generated at InnerAllocaIP up to this point are the things generated in bodyCB, are these the users that you are talking about?

In any case, and since you already reset InnerAllocaIP to ReloadIP, I prefer we do that before privatization (i.e. reset InnerAllocaIP, to ZeroAddrUse->getNextNode() directly), and use that instead.

712

That's probably because the OuterAllocaIP iterator got invalidated between the beginning and here.

If you want; get the insertion BasicBlock early on, and then use that to reupdate OuterAllocaIP and use that to place whatever you want. right now those alloca.s are not in an entry block.

ftynse updated this revision to Diff 308933.Dec 2 2020, 4:23 AM
ftynse marked 6 inline comments as done.

Address all remaining review comments.

ftynse updated this revision to Diff 308938.Dec 2 2020, 5:21 AM

Fix a runaway MLIR use of createParallel

Herald added a project: Restricted Project. · View Herald TranscriptDec 2 2020, 5:21 AM
ftynse added a comment.Dec 2 2020, 5:59 AM

Windows failure is unrelated, broken by an earlier commit to MLIR PDL.

This revision was landed with ongoing or failed builds.Dec 2 2020, 5:59 AM
This revision was automatically updated to reflect the committed changes.