OpenMPIRBuilder::createParallel outlines the body region of the parallel
construct into a new function that accepts any value previously defined outside
the region as a function argument. This function is called back by OpenMP
runtime function __kmpc_fork_call, which expects trailing arguments to be
pointers. If the region uses a value that is not of a pointer type, e.g. a
struct, the produced code would be invalid. In such cases, make createParallel
emit IR that stores the value on stack and pass the pointer to the outlined
function instead. The outlined function then loads the value back and uses as
normal.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I ran into this when trying to generate OpenMP code from MLIR (not Fortran). We use structs for our rich pointer abstraction and using that struct in the outlined code led to memory corruptions because it was not passed by pointer. I suppose the intent is the same as https://reviews.llvm.org/D91556, so consider me commandeering that revision.
Looking at CodeGenFunction::GenerateOpenMPCapturedVars, the condition if (!CurField->getType()->isAnyPointerType()) seems to match what is done here, but I wonder whether it is always necessary. Is the additional indirection necessary for pointer-sized data types (e.g. intptr_t, or smaller)?
Aside this optimization, the change looks like something has to be done. Can you update the naming conventions pointed-out by clang-tidy?
This looks fine. Except the failing tests. Once the tests are fixed it should be good to go.
I have the changes on https://reviews.llvm.org/D91556 made (talked offline with Johannes and made changes to create the struct and capture the values just before the extraction as well - the parameters captured are the ones that the definition of the used vars are parented by a different block than the block of the region. Spending time doing bunch of testing. The approach in this PR works as well.
Thank you for the patch :)
To be honest, I am not exactly sure what is the problem you are trying to address here, is it about passing shared variables as proper pointers? or is it about creating locals for said arguments
Based on the patch description; I think it is the first; and AFAIK, the current implementation will always pass shared variables correctly as pointers. is that not the case? I mean I have never came across a case where it didn't
If yes, do you have a c/c++ example I can try? I'd be very interested in seeing for which cases this doesn't work.
However, implementation and checks makes me think it is the second, Which I always assumed was going to be a part of how a specific frontend handles privatization; but I am very much not against doing so here
Just will want to make sure to do it in a way that works for the OMP-IR, and I don't know if there is or what is the plan to handle private/shared variables is on their end.
| llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | ||
|---|---|---|
| 696 | Why not use InnerAllocaIP instead? it's in the entry block of the parallel region. | |
| 702 | NIT: IPGuard maybe better? | |
| 707 | For the alloca, should use OuterAllocaIP or InnerAllocaIP based on which function the alloca belongs to. allocations in InsertBB will end right before the fork call, which suggests you probably should use outerAllocaIP. but why do you need two allocations for same variable. or is this meant to be the allocations where we store the arguments in the outlined function? if that's the case, then you should use InnerAllocaIP | |
| 708 | Nit: if you could specify a name for the alloca, a name is easier to track than a number in IR :) | |
Nit: Clang tidy warning wrt variable names should be addressed. Some other nits inlined.
My only concern is described in the longer inline comment.
The problem was discussed in the Flang + OpenMP call lately and got attention by at least 4 people since then, as far as I can tell independently. I'm rather confused TBH, but it's a "good thing", as far as bug fixes can be good ;)
Long story short, kmpc_fork_call expects variadic arguments to be in integer registers, anything else passed is lost and replaced with garbage. Clang deals with this already but since the problem is not unique to Clang we should do it here. If anyone wants to create a parallel region they should not need to know the implicit calling convention rules of the OpenMP runtime. For an example how clang "optimizes" things, see https://godbolt.org/z/cja5TE, especially the float/i32 bitcasts.
Based on the patch description; I think it is the first; and AFAIK, the current implementation will always pass shared variables correctly as pointers. is that not the case? I mean I have never came across a case where it didn't
If yes, do you have a c/c++ example I can try? I'd be very interested in seeing for which cases this doesn't work.
It's not about shared variables but basically the equivalent of "firstprivate" (=pass by value), though it occurs in non-frontend contexts as well, e.g. when merging/expanding parallel regions.
However, implementation and checks makes me think it is the second, Which I always assumed was going to be a part of how a specific frontend handles privatization; but I am very much not against doing so here
It belongs here and I asked them to put it here instead of Flang, especially since @ggeorgakoudis exposed the same problem in OpenMP opt. Now instead of fixing up pointers 3 times and exposing a very fragile API to non-OpenMP users, we will ensure the calling convention is not leaking out of the IRBuilder.
Just will want to make sure to do it in a way that works for the OMP-IR, and I don't know if there is or what is the plan to handle private/shared variables is on their end.
Privatization (in all its glory) has still to happen in the frontend. This is only partially related.
| llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | ||
|---|---|---|
| 709 | I was expecting the above logic to be placed here, after the PrivCB callback as I assumed we would privatize in the sequential part. Clang does not and we do not either (for now), which is unfortunate. It does duplicate the privatization and makes this patch more complicated than I anticipated. I though we can simply replace ReplacementValue by Reloaded if ReplacementValue wasn't a pointer and therefor put in an alloca. The fact that we need to reload V and then "replace twice" seems really unfortunate. What would happen if you do not have Reloaded at all but instead make it V = createlodoad(alloca)? After all, Reloaded is equivalent to V in all aspects, right? Would this work? Would we still need the code below? I feel like there must be a minimal solution as we only want to replace the value once on the other side of the call edge. | |
| 727 | Nit: Assertion message please. | |
| 747 | Nit: -"we" | |
Let's keep it simple. TBH, I am not convinced Clang's "optimization" is not hurting us in the long run. I mean, we'd need to understand this: bitcast float to i32 -> zext ---fork_call---> trunc -> bitcast i32 to float while the pointer indirection is common also for shared values so we will be (actually are downstream) able to handle those.
| llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | ||
|---|---|---|
| 709 | -"duplicate privatization" +"duplicate replace all uses with" | |
| llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | ||
|---|---|---|
| 696 | I started with that, but it did not work out. InnerAllocaIP is used by PrivCB to construct IR that may be using the value defined at ReloadIP. If I literally replace the use of ReloadIP with InnerAllocaIP, the instructions with have the wrong order. I also considered inserting this at the top of the entry block, but this messes up the order of arguments in the outlined function. Suggestions on how to structure this better are welcome. | |
| 707 | This belongs to the outer function, but OuterAllocaIP does not seem valid at this point anymore. Using it leads to segfaults, I can investigate later, after we decide whether this code should remain here or move below as Johannes suggested. | |
| 709 | I am not sure I fully follow what you suggest, so I will elaborate on the two versions I had considered. 
 
 There is a flavor of (2) that changes PrivCB to take both V and Replacement so that the implementation of PrivCB can easily detect when the pointer mechanism is active. | |
Fix clang tests. The order of arguments is switched in the internal outlined function.
| clang/test/OpenMP/parallel_codegen.cpp | ||
|---|---|---|
| 139 ↗ | (On Diff #308643) | The order of arguments changes here because we create a use of the promoted-to-pointer argument before any other uses in the body and the outliner finds it first. This should be fine because it's just an internal outlined function that the builder created and the calls to it are emitted accordingly and in the same builder. I can add a comment that explains this if desired. If we go with Michael's suggestion not to turn into pointers the integer values whose size is equal to or smaller than pointer size, this change will not be necessary. I seem to remember seeing some documentation that says that trailing arguments to fork_call should be _pointers_, but I can't find it anymore. | 
While only partially related, can you leave a FIXME saying that more than 15 arguments need to be packed in a structure?
| clang/test/OpenMP/parallel_codegen.cpp | ||
|---|---|---|
| 139 ↗ | (On Diff #308643) | 
 Technically, anything that is passed in the same register as a void * is fine. The documentation on this is thin at best. As I mentioned in my response to @Meinersbur, I think turning everything into pointers is a reasonable way to handle this. I gave some rational there (https://reviews.llvm.org/D92189#2424690) | 
| llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | ||
| 709 | Appreciate the detailed explanations! 
 This looks like a sweet-spot. We can avoid adding more replacement logic but the PrivCB is aware of what is going on. If you are OK with this solution as well I'd prefer it. | |
| llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | ||
|---|---|---|
| 709 | Great, will do this. | |
Yes, the number of varargs that is guarantee to work is 15, if more are needed, there is a need to wrap the parameters in a struct.
As we discussed (bring it here too) the issue here is how parameters are passed following the calling convention. The varargs assumes that parameters are passed in GPR (INTEGER classifications based on the calling conventions). The varargs approach works fine for the INTEGER classified args. It breaks for the SSE(UP) and vector clarified parameters. If we pass the pointers to the SSE params (floats for example) everything will work - pointers are INTEGER classification. For Windows things are a bit different of course, based on the specific calling convention used there.
LGTM, one comment and two nits. Feel free to commit if you agree with the suggestions or come back with concerns. Thanks for working on this and taking the time to fix it like this!
| llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h | ||
|---|---|---|
| 124 ↗ | (On Diff #308698) | It is not a pointer to original, is it? It is an equivalent but potentially not the same IR value. | 
| llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | ||
| 707 | Nit: Rename Reloaded to Inner to make it consistent with the callback (description). With Value* Inner = V; you can avoid the select above. | |
| 760 | Nit: This could go back up again I guess. | |
Thanks; just two more mediocre things if possible. if not. you are good to go. :)
I don't disagree. As I said, "I am very much not against doing so here"
| llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp | ||
|---|---|---|
| 696 | OK; The only instructions generated at InnerAllocaIP up to this point are the things generated in bodyCB, are these the users that you are talking about? In any case, and since you already reset InnerAllocaIP to ReloadIP, I prefer we do that before privatization (i.e. reset InnerAllocaIP, to ZeroAddrUse->getNextNode() directly), and use that instead. | |
| 707 | That's probably because the OuterAllocaIP iterator got invalidated between the beginning and here. If you want; get the insertion BasicBlock early on, and then use that to reupdate OuterAllocaIP and use that to place whatever you want. right now those alloca.s are not in an entry block. | |
Why not use InnerAllocaIP instead? it's in the entry block of the parallel region.