This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][IRBuilder] Support allocas in nested parallel regions
ClosedPublic

Authored by jdoerfert on Jun 24 2020, 8:34 AM.

Details

Summary

We need to keep track of the alloca insertion point (which we already
communicate via the callback to the user) as we place allocas as well.

Diff Detail

Event Timeline

jdoerfert created this revision.Jun 24 2020, 8:34 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 24 2020, 8:34 AM

Thanks for the Patch. I have few questions to help me understand what's going on.

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
282

What's the benefit of this over just maintaining an alloca insertion point?

With this we will have 3 IRBuilders to maintain and keep track of: the clang (or flang) IRBuilder, the OMP IRBuilder and the allocaBuilder

294

I see two benefits of passing entry and exit blocks, as opposed to what we used to do:

  1. less memory, but in return we collect blocks twice (i.e. O(N) mem & O(N+E) work vs O(1) mem and 2 * O(N+E) work ). Do you expect that the vector is likely to become large enough where it is a problem? if not, what's the benefit of the change?
  1. If some blocks are added later, then this becomes a correctness issue. Which is unlikely since it happens after the body codegen is complete. However, if I am mistaken, shouldn't we also delay searching for inputs/outputs?
299

What is the benefit of passing blockSet when it is exclusively used inside of collectBlocks?
I don't think I saw a usage of it in calling functions. am I missing something?

jdoerfert marked 3 inline comments as done.Jun 24 2020, 1:45 PM
jdoerfert added inline comments.
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
282

No functional difference. Basically the same reason why clang has an AllocIRBuilder (I think), we can use it independently of the other one. The alternative is to save the builder IP, set it to the alloca IP, do alloca stuff, reset the builder IP, when you use the AllocaIRBuilder below. Again, no functional difference, just less setting/resetting of IPs.

294

It is 2. Here, finalization steps will outline the inner region and introduce split blocks in the process. Those were not in the outer regions blocks vector which caused problems. The inputs/outputs are not allowed to change though, that invariant seems to hold fine.
(Once we do OpenMP loop transformations on this level other new blocks would be introduced during finalize.)

299

It's used in line 684 outside of collectBlocks.

fghanim added inline comments.Jun 24 2020, 4:32 PM
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
282

I understand where you are coming from, and I understand that functionally there is mostly no change - for now. But I prefer we do a small struct with some RAII, over adding an entire builder. I think it's a bit overkill and will be a source to lots of trouble.

Clang maintains an instruction AllocaInsertPt, not a specific builder. Speaking of which, This is completely from memory (i.e. check to make sure); I think we already save the previous AllocaInsertionPt in clang as part of the bodyGenCB. In which case, this is not needed; a nested parallel will always be generated by clang as part of the bodyGenCB, which in turn is going to save the Alloca insertion point for us.
To enable the creation of the allocas for TID and ZeroAddr in the outer function, Why not pass current AllocaInsertionPt as an arg. to createParallel

294

Oh right. Thanks for explaining that. :)
Nit: if possible, add a simple assert to check that inputs/outputs didn't change

299

Oh, Thanks. I missed that. :)
In this case; I think both the Blockset, and BlockVector have the same contents, correct? can't we use the vector instead on line 684, and keep the set local?

jdoerfert marked 3 inline comments as done.Jun 24 2020, 9:48 PM
jdoerfert added inline comments.
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
282

I understand where you are coming from, and I understand that functionally there is mostly no change - for now. But I prefer we do a small struct with some RAII, over adding an entire builder. I think it's a bit overkill and will be a source to lots of trouble.

Could you explain what kind of overkill you see here? And maybe also the lots of trouble you expect from a builder versus a insertion point? It is the same information, just packed in a different struct, right?

Clang maintains an instruction AllocaInsertPt, not a specific builder.

Right.

Speaking of which, This is completely from memory (i.e. check to make sure); I think we already save the previous AllocaInsertionPt in clang as part of the bodyGenCB.

I think so.

In which case, this is not needed;

Not strictly, no.

a nested parallel will always be generated by clang as part of the bodyGenCB,

Yes.

which in turn is going to save the Alloca insertion point for us.

Yes.

To enable the creation of the allocas for TID and ZeroAddr in the outer function, Why not pass current AllocaInsertionPt as an arg. to createParallel

So far, because this will not be the only one that needs to create allocas. If you want to add the allocaIP to all runtime calls that create allocas, I'm fine with that too.

294

We can only if we remember them. I'll do it in debug builds I guess.

299

We could but we don't want to:
(1) that will make the currently logarithmic query in 684 linear, and
(2) we already compute the set during the vector construction because the we would get the same linear instead of logarithmic behavior there.

The change looks good to me. Thanks!

Overall this seems great Thanks! I have one minor concern:
This patch seems to do 2 things: Support for nested parallel regions(which was crashing earlier) and some infrastructure change(introducing AllocBuilder..).
I'm not sure of this, but is it possible to separate these as 2( or more) patches ? 1 for Nested parallel region support and other patch as a infrastructure change ?

Main benefit of this approach would be that this would be easier to track/maintain changes in future.
Thanks!

Overall this seems great Thanks! I have one minor concern:
This patch seems to do 2 things: Support for nested parallel regions(which was crashing earlier) and some infrastructure change(introducing AllocBuilder..).
I'm not sure of this, but is it possible to separate these as 2( or more) patches ? 1 for Nested parallel region support and other patch as a infrastructure change ?

I'll try.

jdoerfert retitled this revision from [OpenMP][IRBuilder] Support nested parallel regions to [OpenMP][IRBuilder] Support allocas in nested parallel regions.Jun 28 2020, 11:48 AM
jdoerfert edited the summary of this revision. (Show Details)
jdoerfert edited the summary of this revision. (Show Details)

restrict this to the alloca support

SouraVX accepted this revision.Jun 28 2020, 1:07 PM

Overall this seems great Thanks! I have one minor concern:
This patch seems to do 2 things: Support for nested parallel regions(which was crashing earlier) and some infrastructure change(introducing AllocBuilder..).
I'm not sure of this, but is it possible to separate these as 2( or more) patches ? 1 for Nested parallel region support and other patch as a infrastructure change ?

I'll try.

Thanks a lot for separating this :) both LGTM!

This revision is now accepted and ready to land.Jun 28 2020, 1:07 PM
fghanim requested changes to this revision.Jun 29 2020, 1:12 AM
fghanim added inline comments.
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
282

Could you explain what kind of overkill you see here?

Using an IRBuilder when an insertion point suffices

And maybe also the lots of trouble you expect from a builder versus a insertion point?

Currently, when using the OMPBuilder you need to juggle two IRBuilders, let's not make them 3 :)

It is the same information, just packed in a different struct, right?

no it is not. InsetrionPoint vs struct that contains Insertionpoint + other things. But that's beside the point

I mean, is the AllocaBuilder providing a new functionality other than a convenient way to keep track of the Alloca InsertionPoint?
Personally, I prefer some suitable structure to your needs and a stack (i.e. the way clang does it) . This should also resolve the todo below. Alternatively, pass an allocation point as an argument as suggested earlier. I am open to any third way if you have any.

This revision now requires changes to proceed.Jun 29 2020, 1:12 AM
jdoerfert marked an inline comment as done.Jun 29 2020, 5:00 PM
jdoerfert added inline comments.
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
282

Personally, I prefer some suitable structure to your needs and a stack (i.e. the way clang does it).

The stack is already here, implicitly, and actually on the stack. With
IRBuilder<>::InsertPointGuard AIPG(AllocaBuilder);
we have all the stack we need without leaking the state.
The updates to the insertion point are done for use and the AllocaBuilder is always ready to be used. No need to create a builder or update an insertion point (other than the places that create the new alloca insertion points). Note that the suitable structure is the builder, as it is the only interaction with the insertion point, e.g. we do not pass the insertion point to other functions.

InsetrionPoint vs struct that contains Insertionpoint + other things. But that's beside the point

I disagree. The "other things" is the only interaction we have with the insertion point. So storing or accepting an insertion point is fine for the only purpose of creating a Builder and interacting with the builder. The difference is that we need to add more churn to update and reset the insertion point (what happens behind the scenes with the one line I quoted above).

Alternatively, pass an allocation point as an argument as suggested earlier.

I did not go this route because I think it complicates the API with little gain. I was also unsure what the flang situation looks like here. Do we have an alloca insertion point there too?
At the end of the day, this makes it easier for the user.

fghanim added inline comments.Jul 3 2020, 1:26 AM
llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
282

But we haven't solved a problem, we just kicked it down the road with the TODO you added, with the only trade-off being convenience now. The AllocaBuilder is not usable by the frontend - the frontend doesn't & shouldn't know or care about any builders other than its own. However, an insertion point can be passed to the frontend (i.e. bodyCB). So keeping track of an Alloca insertion point across nests -similar to how clang does it- is what we need. I still prefer having a stack to push to - maybe even piggy back on the finalization stack by creating a struct that contains all relevant info per nest level. But you don't like it, fine.

Then just pass an argument for allocaIP and be done with it. Flang (or clang or whichever other frontend) are required to emit allocas into function entry block by all LLVM passes that follow. As such, we can safely assume the outermost call is always going to have the AllocaIP in the entry block of the enclosing function. Which means the first time we call CreateParallel they can pass that as arg.. or pass an empty insertion point, in which case we already detect it.

Adding a builder for every insertion point that we need to pass across nests or regions is not something we want to do for the reasons I mentioned earlier. Whether we like it or not, it is a builder we have to juggle along the other two, no matter how much we try to specialize its use, or keep it "hidden".

Used an extra argument now. Please let me know if this is OK.

Thanks for the update. Just a couple of Nits, and a quick note

llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h
288

Nit: isn't this supposed to be part of one of the other patches you split of this?

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
433–450

Here and elsewhere: You seem to have forgotten to undo the changes introduced by using a separate AllocaBuilder?

Also, Nit: before using AllocaIP, either add an assert that it is not empty, or alternatively, add logic to just set it to entry block of outerfn if it is

502

NIT: I think it would be better to give each a different name, to avoid unnecessary confusion. maybe OuterAllocaIP and innerAllocaIP? Also, we don't overwrite the outerAllocaIP in case it's needed later?

jdoerfert marked an inline comment as done.Jul 14 2020, 8:11 AM

I'll address the nits. Unsure if that is all. I also don't get the one comment.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
433–450

Here and elsewhere: You seem to have forgotten to undo the changes introduced by using a separate AllocaBuilder?

I don't get this. These changes are on purpose.

I'll address the nits.

Thanks :)

Unsure if that is all.

depends on the comment about AllocaBuilder

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
433–450

Oh, my bad. I assumed that since we now pass allocaip to communicate where is the insertion point, using builder the way we used to is sufficient, and this was leftover code. So now what purpose does AllocaBuilder serve?

jdoerfert marked an inline comment as done.Jul 14 2020, 5:11 PM
jdoerfert added inline comments.
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
433–450
  1. No switching the IP back and forth in one builder all the time.
  2. Showing by the name of the builder directly where the instructions will be created.
fghanim added inline comments.Jul 16 2020, 10:54 AM
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
433–450

No switching the IP back and forth in one builder all the time.

  • Is this about something you expect in the future? because unless I am missing something, and looking at it side by side, I don't see any less setting /reseting of IPs going on, except in the new version we to set the debug info IP in allocabuilder in addition to in builder.
  • we don't use either AllocaIP after returning from bodyCB
  • Even if we did, both bodyCB, and FiniCB have IPGuards inside them.
  • Also, since IRBuilders do insertbefore an IP, both AllocaIP should be up-to-date except if the iterator for an IP gets invalidated, the risk of which is sort of proportional to increase in number of IRbuilders.

Showing by the name of the builder directly where the instructions will be created.

Wouldn't a comment be better?
Personally, I find it confusing when something called AllocaBuilder, is being used to emit loads and stores.

jdoerfert updated this revision to Diff 278952.Jul 17 2020, 6:20 PM

Reuse builder

fghanim accepted this revision.Jul 17 2020, 10:16 PM

Great. Thank you!
LGTM

This revision is now accepted and ready to land.Jul 17 2020, 10:16 PM
This revision was landed with ongoing or failed builds.Jul 30 2020, 8:21 AM
This revision was automatically updated to reflect the committed changes.