Page MenuHomePhabricator

[OpenMP][MLIR] WIP : Fix for AllocaIP
Needs ReviewPublic

Authored by kiranchandramohan on Oct 1 2020, 3:26 PM.

Details

Summary

Fix for nested parallel regions. Does the following,

  1. Switch to OpenMPIRBuilder version which maintains allocaIP. (An initial version of https://reviews.llvm.org/D82470)
  2. Create jump to continuation block by converting the terminator

Note:

  1. This patch should be applied after reverting 19756ef53a498b7aa1fbac9e3a7cd3aa8e110fad.
  2. This fix is required in the lowering of Master Op (https://reviews.llvm.org/D87247)

Diff Detail

Event Timeline

Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript
kiranchandramohan requested review of this revision.Oct 1 2020, 3:26 PM
fghanim added a subscriber: fghanim.

After reading the patch for master you referred to, I don't understand why do we need the OMPBuilder to maintain the insertion point. As far as master is concerned, we will emit any alloca's contained inside its region into the entry block of the enclosing outlined region (e.g. innermost parallel).
FWIW, the master directive in clang already uses the OMPBuilder and just relies on clang to handle the insertion of any non-omp code (including alloca's). Is there a reason why a similar approach wouldn't work here?

If this is indeed needed for master, then please don't create extra IRBuilders needlessly. As you mentioned, D82470 had this exact approach implemented, and we ended up not going through with it.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
724–727

These are unrelated to maintaining alloca insertion point. I don't recall back then whether these were removed completely, or added as part of other (2?) patches that were split off of this one.

Thanks @fghanim for the review. While debugging the nested parallel region issues, we saw some difference between where the allocas are placed by the OpenMP IRBuilder in the clang usage and the MLIR usage. Moving to the version where the OpenMP IRBuilder maintains allocaIP fixed the difference. In MLIR module translation there is no alloca insertion point. So what we can provide as allocaIP is the current insertionPoint. Is that OK? What is the allocaIP used for? Why do we need a separate allocaIP, why cannot we treat it like a normal instruction? Is it because all alloca instructions should be together at the top of the function? The langref for alloca did not have such a requirement.

I have created another review without the alloca changes and that also works correctly for the nested parallel case.
https://reviews.llvm.org/D88720

@fghanim I just saw your following comment in https://reviews.llvm.org/D87247.
"All llvm passes expect Allocas to be in the entry block of the function. In this case, the soon-to-be-outlined region. is builder's current insertion point in the entry block? Also, is it guaranteed to not be empty?"

I belive all allocas in the LLVM dialect will also be in the entry block of the OpenMP operation. But these will added only in the bodygen call-back. So they will be added to omp.par.region (actually omp.par.region1 since a dummy branch is created). But all these are trivial branches (not conditional) and can't they be inlined into the entry block if required? See example below for,

llvm.func @test_omp_parallel_4() -> () {
  omp.parallel {
    omp.barrier
    omp.terminator
  }
  llvm.return
}
define internal void @test_omp_parallel_4..omp_par(i32* noalias %tid.addr, i32* noalias %zero.addr) #0 !dbg !10 {
omp.par.entry:
  %tid.addr.local = alloca i32, align 4
  %0 = load i32, i32* %tid.addr, align 4
  store i32 %0, i32* %tid.addr.local, align 4
  %tid = load i32, i32* %tid.addr.local, align 4
  br label %omp.par.region

omp.par.outlined.exit.exitStub:                   ; preds = %omp.par.pre_finalize
  ret void

omp.par.region:                                   ; preds = %omp.par.entry
  br label %omp.par.region1

omp.par.region1:                                  ; preds = %omp.par.region
  // ALLOCAS IN the OpenMP parallel region will appear HERE.
  %omp_global_thread_num2 = call i32 @__kmpc_global_thread_num(%struct.ident_t* @4)
  call void @__kmpc_barrier(%struct.ident_t* @3, i32 %omp_global_thread_num2)
  br label %omp.par.pre_finalize, !dbg !11

omp.par.pre_finalize:                             ; preds = %omp.par.region1
  br label %omp.par.outlined.exit.exitStub
}

Found that they should be in the entry block for optimizations. @fghanim is this what you are suggesting?
http://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas

While debugging the nested parallel region issues, we saw some difference between where the allocas are placed by the OpenMP IRBuilder in the clang usage and the MLIR usage. Moving to the version where the OpenMP IRBuilder maintains allocaIP fixed the difference. In MLIR module translation there is no alloca insertion point. So what we can provide as allocaIP is the current insertionPoint.

Yes, this is the expected behavior. in clang we follow the convention of setting the insertion point for allocas to proper entry block, and let clang handle all of the none OMP code generation, and we let clang pass to us where it wants us to put its alloca instructions. Here, you are passing the current insertion point of your IRBuilder at the time the bodyCB is called, and all allocas are output there.

Is that OK?

Depends on your goal. is it illegal? no it is not. you can put allocas wherever you want
Does it adversely affect performance as you point out in a later comment? absolutely
Is it possible to have an effect on correctness? possible. Please check below for why I believe it may.

What is the allocaIP used for?

to specify where you want all allocas to go.

Why do we need a separate allocaIP, why cannot we treat it like a normal instruction?

two reasons:

  1. there are allocas that go inside the outlined region, and allocas that go outside of it, and we need to be able to chose when to codegen each.
  2. We try to generate allocas into the entry block of the relevant function. and treating it like a normal instruction means we cannot.

Is it because all alloca instructions should be together at the top of the function?

yes, but as I mention above, not just that.

The langref for alloca did not have such a requirement.

It is a requirement for optimization and possible the backend, not for correctness of IR which is what the langref is concerned about.

I have created another review without the alloca changes and that also works correctly for the nested parallel case.
https://reviews.llvm.org/D88720

I saw that this patch resolved the problem you were seeing. I am glad it did. It worked, because, as I and JDoerfert pointed out in the original patch, your problem had nothing to do with alloca locations but has everything to do with your outlined region not being completely contained within the entry and exit blocks of the parallel region. D88720 seems to have fixed that.

I belive all allocas in the LLVM dialect will also be in the entry block of the OpenMP operation. But these will added only in the bodygen call-back. So they will be added to omp.par.region (actually omp.par.region1 since a dummy branch is created). But all these are trivial branches (not conditional) and can't they be inlined into the entry block if required? See example below for,
{ ... code ...}

Sure. but what happens when you generate Copyin for example? the allocas in the body are very likely going to end up after the copyin CFG structure, which means these are not going to be in the entry block. (to get what I mean check the createcopyinblocks in the OMPBuilder)
Furthermore, while there is the CFGSimplify pass which should remove all extra branches, are you guaranteed to run that everytime? what happens when you run the frontend with -O0?

Here is a question regarding allocas in the llvm dialect; by the time they are translated to llvm ir proper, are they already located in the entry block in the dialect itself, or are they located in different places but you have something like an AllocaIP where you move allocas to entry block during the translation to proper llvm IR?
What is FIR planning to do about their stack allocations? if they are not guaranteeing it's in the entry BB then that's an issue.
is OMP-IR meant to also work alongside a future C/C++ dialect? If yes, then clang people are very unlikely to be ok with not having allocas in the entry block, and OMP-IR needs to follow the conventions of the relevant frontend when it comes variable declarations, etc., regardless of what that frontend is.

Found that they should be in the entry block for optimizations. @fghanim is this what you are suggesting?
http://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas

mem2reg pass will look for all allocas in a function and move them into entry block. However, when you run a frontend with -O0, we don't run any optimization passes. I remember working with llvm passes that ignored Allocas not in the entry block. so yes, it is about optimizations but not only that. Many of the various backends take advantage of the fact that allocas are in the entry block to reserve stack space upon entry into a function. Unless you can guarantee that everyone of those has a way to handle not having all allocas in entry block, we should guarantee that for them - at which point it's a correctness issue.


To be clear; I am NOT against keeping track of Alloca insertion points in the OMPBuilder if there is a reason to. As can be seen in D82470 where I suggested multiple other ways to do so. However, I have a huge problem with creating a special IRBuilder just to do so.
I do have a small preference towards not keeping state of anything that we don't need, just out of consistency with other IRBuilders (i.e. you tell an IRBuilder what do you want it to do and where to do it, rather than it knowing that).

SouraVX added inline comments.Oct 4 2020, 11:02 PM
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
724

This must be un-intentional! This is what asserted in first place and present in trunk.

SouraVX added a subscriber: schweitz.EditedOct 5 2020, 12:40 AM

Here is a question regarding allocas in the llvm dialect; by the time they are translated to llvm ir proper, are they already located in the entry block in the dialect itself, or are they located in different places but you have something like an AllocaIP where you move allocas to entry block during the translation to proper llvm IR?
What is FIR planning to do about their stack allocations? if they are not guaranteeing it's in the entry BB then that's an issue.

AFAIK both of these cases we end up having alloca in the entry block.
@schweitz can confirm this WRT FIR.

Overall I think this might be a phase ordering problem, i.e we are outlining late(every thing is already at LLVMIR Dialect level) we've lost the insertionPoint abstraction.
clang handle all this differently.

Thanks @fghanim for the detailed response. Very helpful.

As @SouraVX is suggesting, when FIR is created I believe there are entry blocks where allocas are placed. These would be converted to llvm dialect allocas in MLIR. Since blocks are translated in topological order, the entry block would have been processed (converted to LLVM IR allocas) when we reach an OpenMP operation. So I believe a location in the entry block (last alloca, or first alloca) can be passed as the outer allocaIP. We can also either structure the region inside the OpenMP operation to have an entry block and pass it to the OpenMP IRBuilder as the inner allocaIP.

schweitz added a comment.EditedOct 5 2020, 4:03 PM

fir.alloca ops should be hoisted to the entry block. Because Fortran is pass-by-reference, correctness will often simply require stack allocations. However, that said, in cases where alloca ops can be promoted to registers, they will be although that is disabled at the moment.

fir.alloca ops should be hoisted to the entry block. Because Fortran is pass-by-reference, correctness will often simply require stack allocations. However, that said, in cases where alloca ops can be promoted to registers, they will be although that is disabled at the moment.

I guess this is "often" correct but that is beyond the point. The OpenMP-IR-Builder introduces the allocas in question and they cannot go into the function entry block. That is simply not sound.
The allocas need to be placed at the last alloca point provided by the OpenMP-IR-Builder since those points will become the entry blocks of new functions and those entry blocks might be executed by more threads making "pass-by-reference" reuse unsound.

That said, I doubt that the allocas caused by "user Fortran code" can *always* go into the function entry either, though that is a discussion for another day.

kiranchandramohan retitled this revision from [OpenMP][MLIR] WIP : Fix for nested parallel region to [OpenMP][MLIR] WIP : Fix for AllocaIP.Fri, Nov 20, 7:28 AM