Page MenuHomePhabricator

[mlir] support collapsed loops in OpenMP-to-LLVM translation
ClosedPublic

Authored by ftynse on Jul 9 2021, 8:49 AM.

Diff Detail

Event Timeline

ftynse created this revision.Jul 9 2021, 8:49 AM
ftynse requested review of this revision.Jul 9 2021, 8:49 AM
wsmoses added a subscriber: wsmoses.Jul 9 2021, 8:54 AM
jdoerfert added inline comments.Jul 10 2021, 9:21 AM
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
316

Maybe compute the location first, the push is complex enough.
doNothingBodyGen should probably check loopInfos size and in the innermost case
execute bodyGen for the innermost loop. We can deal with non-perfectly nested ones later
a TODO would suffice for now I guess.

ftynse updated this revision to Diff 358954.Jul 15 2021, 7:00 AM
ftynse marked an inline comment as done.

Update

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
316

MLIR modeling won't let you have non-perfectly nested collapsed loops at the moment.

ftynse retitled this revision from [mlir] WIP: support collapsed loops in OpenMP-to-LLVM translation to [mlir] support collapsed loops in OpenMP-to-LLVM translation.Jul 15 2021, 7:01 AM

@MatsPetersson is working on the Flang side flow of collapse. Adding him as a reviewer.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
316

We will probably sink the code to the body in such situations in the frontend in the Flang flow.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
254–255

Am I missing something here? When I try using collapsed loops from flang, this catches me, and my compilation fails...

Code on line 261 seems to imply it isn't intended to be 1 always?

ftynse marked an inline comment as done.Jul 16 2021, 6:28 AM
ftynse added inline comments.
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
254–255

Just a misrebase, this should have been removed.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
277–286

It's getting late on a Friday, but is is possible that this is only mapping a single value?

I'm seeing a crash from translating this:

omp.wsloop (%arg0, %arg1, %arg2) : i32 = (%4, %4, %4) to (%28, %29, %30) step (%4, %4, %4) collapse(3) inclusive {
  %31 = llvm.load %20 : !llvm.ptr<i32>
  %32 = llvm.add %31, %arg0  : i32
  %33 = llvm.add %32, %arg1  : i32
  %34 = llvm.add %33, %arg2  : i32
  llvm.store %34, %20 : !llvm.ptr<i32>
  omp.yield
}

The crash happens when it gets to the second add, because arg1 can't be found and a null pointer is returned, which we try to get a type from.

Sorry if I'm completely off-track here, I'm about to go home, and I haven't stepped through all the code to figure out what is really going on. The crash happens on a call from line 271 below - but there's about 20 layers of call-stack to the crash-point...

I'm not working on Monday-Tuesday, but will be back here on Wednesday.

Meinersbur added a comment.EditedJul 23 2021, 1:25 PM

The use of OpenMPIRBuilder::collapseLoops looks fine, but don't really know the omp.wsloop modelling.

No unit test coming with this?

Sorry for the rather lengthy comment. I am not sure if I'm expecting the wrong thing, I'm doing something wrong, or this patch is incorrect, so I'll try to put as much information into this comment as I can.
Any suggestions, hints or similar would be welcome. I fully expect some of my workarounds and fixes to be called out as wrong - and I don't mean my global variable, which is clearly just a workaround for not understanding how to get the correct value out of where it is known.

Background: I'm currently implement collapse in the Flang lowering to MLIR. I _think_ this is working correctly (I do not have a public patch at this time, as I was hoping to have a least some sort of functional test as part of that patch).

So, the first issue, which I did mention in a comment on line 242, I worked around the failure to find symbols with this patch:

index 1d55d0230b8f..f3bb23520d87 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -213,8 +213,8 @@ convertOmpWsLoop(Operation &opInst, llvm::IRBuilderBase &builder,
   if (loop.lowerBound().empty())
     return failure();
 
-  if (loop.getNumLoops() != 1)
-    return opInst.emitOpError("collapsed loops not yet supported");
+  //  if (loop.getNumLoops() != 1)
+  //    return opInst.emitOpError("collapsed loops not yet supported");
 
   // Static is the default.
   omp::ClauseScheduleKind schedule = omp::ClauseScheduleKind::Static;
@@ -236,13 +236,15 @@ convertOmpWsLoop(Operation &opInst, llvm::IRBuilderBase &builder,
   SmallVector<llvm::CanonicalLoopInfo *> loopInfos;
   LogicalResult bodyGenStatus = success();
   auto bodyGen = [&](llvm::OpenMPIRBuilder::InsertPointTy ip, llvm::Value *iv) {
-    if (loopInfos.size() != loop.getNumLoops() - 1)
+    auto index = loopInfos.size();
+    // Make sure further conversions know about the induction variable.
+    moduleTranslation.mapValue(loop.getRegion().front().getArgument(index), iv);
+
+    if (index != loop.getNumLoops() - 1)
       return;
 
     llvm::IRBuilder<>::InsertPointGuard guard(builder);
 
-    // Make sure further conversions know about the induction variable.
-    moduleTranslation.mapValue(loop.getRegion().front().getArgument(0), iv);
 
     llvm::BasicBlock *entryBlock = ip.getBlock();
     llvm::BasicBlock *exitBlock =

This solves the crash that I talked about om the earlier comment, but then exposes a different problem. Calling collapseLoops in convertOmpWsLoop sets the AfterIP to the wrong place - this causes incorrect LLVM-IR to be generated - an empty block and one with two unconditional branches in a row, like this:

  br label %omp_collapsed.after, !dbg !23

omp_collapsed.after:                              ; preds = %omp_collapsed.exit
  br label %omp.par.pre_finalize, !dbg !24
  br label %omp_loop.after, !dbg !23

omp_loop.after:                                   ; preds = %omp_collapsed.after

omp.wsloop.region:                                ; preds = %omp_loop.body17

where, I think, the first branch in omp_collapsed.after should be in omp_loop.after. The incorrect code is causing a crash in the findAllocas function, because some function is returning an empty set of blocks, and the next layer of code expects that to be non-empty, and falls over on the null-pointer returned when it IS empty.

I added a very ugly hack to store the OrigAfter in the collapseLoops into a global variable, and them in the end of convertOMPWsLoop set the AfterIP to the relevant stored value.

diff --git a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
index 43c6dd9ab997..cbaa77360cd5 100644
--- a/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+++ b/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
@@ -32,6 +32,8 @@
 using namespace llvm;
 using namespace omp;
 
+BasicBlock *GlobalAfter;
+
 static cl::opt<bool>
     OptimisticAttributes("openmp-ir-builder-optimistic-attributes", cl::Hidden,
                          cl::desc("Use optimistic attributes describing "
@@ -1566,6 +1568,7 @@ OpenMPIRBuilder::collapseLoops(DebugLoc DL, ArrayRef<CanonicalLoopInfo *> Loops,
   CanonicalLoopInfo *Innermost = Loops.back();
   BasicBlock *OrigPreheader = Outermost->getPreheader();
   BasicBlock *OrigAfter = Outermost->getAfter();
+  GlobalAfter = OrigAfter;
   Function *F = OrigPreheader->getParent();
 
   // Setup the IRBuilder for inserting the trip count computation.
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index f3bb23520d87..c9d83358de62 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -347,6 +347,12 @@ convertOmpWsLoop(Operation &opInst, llvm::IRBuilderBase &builder,
         ompLoc, loopInfo, allocaIP, schedType, !loop.nowait(), chunk);
   }
 
+  // HACK!
+  extern llvm::BasicBlock *GlobalAfter;
+  if (GlobalAfter) {
+         afterIP = {GlobalAfter, GlobalAfter->begin()};
+         GlobalAfter = nullptr;
+  }
   // Continue building IR after the loop.
   builder.restoreIP(afterIP);
   return success();

This allows the code to get a bit further, but it's still not right, and bails out with this:

Instruction does not dominate all uses!
  %omp_loop.tripcount2 = select i1 %37, i32 0, i32 %39, !dbg !29
  %22 = mul nuw i32 %omp_loop.tripcount, %omp_loop.tripcount2
Instruction does not dominate all uses!
  %omp_loop.tripcount13 = select i1 %43, i32 0, i32 %45, !dbg !29
  %23 = mul nuw i32 %22, %omp_loop.tripcount13
Instruction does not dominate all uses!
  %omp_loop.tripcount13 = select i1 %43, i32 0, i32 %45, !dbg !29
  %30 = urem i32 %29, %omp_loop.tripcount13, !dbg !29
Instruction does not dominate all uses!
  %omp_loop.tripcount13 = select i1 %43, i32 0, i32 %45, !dbg !29
  %31 = udiv i32 %29, %omp_loop.tripcount13, !dbg !29
Instruction does not dominate all uses!
  %omp_loop.tripcount2 = select i1 %37, i32 0, i32 %39, !dbg !29
  %32 = urem i32 %31, %omp_loop.tripcount2, !dbg !29
Instruction does not dominate all uses!
  %omp_loop.tripcount2 = select i1 %37, i32 0, i32 %39, !dbg !29
  %33 = udiv i32 %31, %omp_loop.tripcount2, !dbg !29
Instruction does not dominate all uses!
  %51 = add i32 %50, 1
  %53 = add i32 %52, %51, !dbg !32
Instruction does not dominate all uses!
  %49 = add i32 %48, 1
  %54 = add i32 %53, %49, !dbg !33
error: could not emit LLVM-IR

I'm not sure if there's something wrong in the collapseLoop or in the way that it is used?

For reference: the MLIR used to reproduce this (I don't think this needs any of my code to process):

llvm.func @_QQmain() {
  %0 = llvm.mlir.constant(3 : i32) : i32
  %1 = llvm.mlir.constant(2 : i32) : i32
  %2 = llvm.mlir.constant(5 : i32) : i32
  %3 = llvm.mlir.constant(0 : i32) : i32
  %4 = llvm.mlir.constant(1 : i32) : i32
  %5 = llvm.mlir.constant(-1 : i32) : i32
  %6 = llvm.mlir.constant(19 : i32) : i32
  %7 = llvm.mlir.constant(1 : i64) : i64
  %8 = llvm.alloca %7 x i32 {bindc_name = "a", in_type = i32, operand_segment_sizes = dense<0> : vector<2xi32>, uniq_name = "_QEa"} : (i64) -> !llvm.ptr<i32>
  %9 = llvm.mlir.constant(1 : i64) : i64
  %10 = llvm.alloca %9 x i32 {bindc_name = "b", in_type = i32, operand_segment_sizes = dense<0> : vector<2xi32>, uniq_name = "_QEb"} : (i64) -> !llvm.ptr<i32>
  %11 = llvm.mlir.constant(1 : i64) : i64
  %12 = llvm.alloca %11 x i32 {bindc_name = "c", in_type = i32, operand_segment_sizes = dense<0> : vector<2xi32>, uniq_name = "_QEc"} : (i64) -> !llvm.ptr<i32>
  %13 = llvm.mlir.constant(1 : i64) : i64
  %14 = llvm.alloca %13 x i32 {bindc_name = "i", in_type = i32, operand_segment_sizes = dense<0> : vector<2xi32>, uniq_name = "_QEi"} : (i64) -> !llvm.ptr<i32>
  %15 = llvm.mlir.constant(1 : i64) : i64
  %16 = llvm.alloca %15 x i32 {bindc_name = "j", in_type = i32, operand_segment_sizes = dense<0> : vector<2xi32>, uniq_name = "_QEj"} : (i64) -> !llvm.ptr<i32>
  %17 = llvm.mlir.constant(1 : i64) : i64
  %18 = llvm.alloca %17 x i32 {bindc_name = "k", in_type = i32, operand_segment_sizes = dense<0> : vector<2xi32>, uniq_name = "_QEk"} : (i64) -> !llvm.ptr<i32>
  %19 = llvm.mlir.constant(1 : i64) : i64
  %20 = llvm.alloca %19 x i32 {bindc_name = "x", in_type = i32, operand_segment_sizes = dense<0> : vector<2xi32>, uniq_name = "_QEx"} : (i64) -> !llvm.ptr<i32>
  llvm.store %0, %8 : !llvm.ptr<i32>
  llvm.store %1, %10 : !llvm.ptr<i32>
  llvm.store %2, %12 : !llvm.ptr<i32>
  llvm.store %3, %20 : !llvm.ptr<i32>
  omp.parallel {
    %28 = llvm.load %8 : !llvm.ptr<i32>
    %29 = llvm.load %10 : !llvm.ptr<i32>
    %30 = llvm.load %12 : !llvm.ptr<i32>
    omp.wsloop (%arg0, %arg1, %arg2) : i32 = (%4, %4, %4) to (%28, %29, %30) step (%4, %4, %4) collapse(3) inclusive {
      %31 = llvm.load %20 : !llvm.ptr<i32>
      %32 = llvm.add %31, %arg0  : i32
      %33 = llvm.add %32, %arg1  : i32
      %34 = llvm.add %33, %arg2  : i32
      llvm.store %34, %20 : !llvm.ptr<i32>
      omp.yield
    }
    omp.terminator
  }
  %21 = llvm.mlir.addressof @_QQcl.2E2F636F6C6C617073652E66393000 : !llvm.ptr<array<15 x i8>>
  %22 = llvm.bitcast %21 : !llvm.ptr<array<15 x i8>> to !llvm.ptr<i8>
  %23 = llvm.call @_FortranAioBeginExternalListOutput(%5, %22, %6) : (i32, !llvm.ptr<i8>, i32) -> !llvm.ptr<i8>
  %24 = llvm.load %20 : !llvm.ptr<i32>
  %25 = llvm.sext %24 : i32 to i64
  %26 = llvm.call @_FortranAioOutputInteger64(%23, %25) : (!llvm.ptr<i8>, i64) -> i1
  %27 = llvm.call @_FortranAioEndIoStatement(%23) : (!llvm.ptr<i8>) -> i32
  llvm.return
}
llvm.func @_FortranAioBeginExternalListOutput(i32, !llvm.ptr<i8>, i32) -> !llvm.ptr<i8> attributes {fir.io, fir.runtime, sym_visibility = "private"}
llvm.mlir.global linkonce constant @_QQcl.2E2F636F6C6C617073652E66393000() : !llvm.array<15 x i8> {
  %0 = llvm.mlir.constant("./collapse.f90\00") : !llvm.array<15 x i8>
  llvm.return %0 : !llvm.array<15 x i8>
}
llvm.func @_FortranAioOutputInteger64(!llvm.ptr<i8>, i64) -> i1 attributes {fir.io, fir.runtime, sym_visibility = "private"}
llvm.func @_FortranAioEndIoStatement(!llvm.ptr<i8>) -> i32 attributes {fir.io, fir.runtime, sym_visibility = "private"}

Calling collapseLoops in convertOmpWsLoop sets the AfterIP to the wrong place - this causes incorrect LLVM-IR to be generated - an empty block and one with two unconditional branches in a row, like this:

Where does AfterIP point to and where do you think it should point to after collapseLoops? Should the result of collapseLoops be changed or is it an issue of how convertOmpWsLoop uses CanonicalLoopInfo?

I would appreciate a test case to the patch that would help me reproduce the problem.

Calling collapseLoops in convertOmpWsLoop sets the AfterIP to the wrong place - this causes incorrect LLVM-IR to be generated - an empty block and one with two unconditional branches in a row, like this:

Where does AfterIP point to and where do you think it should point to after collapseLoops? Should the result of collapseLoops be changed or is it an issue of how convertOmpWsLoop uses CanonicalLoopInfo?

It points at omp_collapsed.after, I think it should point at omp_loop.after. This branch is inserted by convertOmpRegions on the second pass through (first one is the one that converts the body or some such - I'm still not 100% sure how all this hangs together to be honest).

I'm not at all sure this is right, but I'm just trying to make my stuff [which generates the MLIR as attached in the original long comment] work all the way to LLVM-IR.

Perhaps the problem is that we should have two different CanonicalLoopInfo variables, one for the outer scope, and one for the collapsed loops? Or maybe the problem is with my understanding of what is supposed to happen [although there appears to be a problem with the way that loops are collapsed and the arguments bound, as my initial attempt crashes during the code-generation just before collapseLoops.

I would appreciate a test case to the patch that would help me reproduce the problem.

I would love to have a better test than the MLIR I attached, but at present, that's as far as I have got. I'm currently just doing build/bin/tco collapse.mlir -o collapse.ll with the MLIR that I put at the end of the long comment.

Meinersbur added inline comments.Jul 27 2021, 12:45 PM
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
371

I think this is wrong. The CanonicalLoopInfo returned by createStaticWorkshareLoop represents the inner loop over chunks, not the contruct as a whole. Don't use the returned CanonicalLoopInfo unless you want to process the loop. The workshare reagion itself ends after kmpc_for_static_fini which is not part of the inner canonical loop.

ftynse updated this revision to Diff 362501.Jul 28 2021, 12:42 PM
ftynse marked 3 inline comments as done.

Fix insertion points.

Perhaps the problem is that we should have two different CanonicalLoopInfo variables, one for the outer scope, and one for the collapsed loops? Or maybe the problem is with my understanding of what is supposed to happen [although there appears to be a problem with the way that loops are collapsed and the arguments bound, as my initial attempt crashes during the code-generation just before collapseLoops.

Yeah, what looks disturbing to me is the behavior of collapseLoops. I expected it to either mutate the outermost canonical loop, or create an entirely new loop and move the bodies of the old loop there. What it does, however, is to create an additional canonical loop half-way inside the outermost loop so that the canonical loop has an After block distinct from that of the outermost loop and, what's worse, preceding it in the control flow.

The rest, i.e. the dominance issues, is due to CanonicalLoopInfo::getBodyIP pointing to a different location (entry of the first body block) than the bodyIP of the body builder codegen (before the terminator of the first body block). This is not an issue if the loop is known to have a zero lower bound and a unit step, which was the case for my test cases.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
371

If you meant "returned by collapseLoops" instead of "returned by createStaticWorkshareLoop", then it was indeed wrong. createStaticWorkshareLoop returns a mutated canonical loop info that is perfectly fine to use here. collapseLoops does not, which was highly unintuitive for me, given that virtually all the other builder APIs return an object that contains the insertion point for next operations.

Thanks for the updated patch, this seems to work for my simple use-case.

Meinersbur added a comment.EditedJul 29 2021, 10:09 AM

Thank you for the test case

Before collapse:

After collapse:

After workshare:

Meinersbur added a comment.EditedJul 29 2021, 11:27 AM

I expected it to either mutate the outermost canonical loop, or create an entirely new loop and move the bodies of the old loop there. What it does, however, is to create an additional canonical loop half-way inside the outermost loop so that the canonical loop has an After block distinct from that of the outermost loop and, what's worse, preceding it in the control flow.

There might be interleaving code between the loops that cannot be discarded. The OpenMP spec basically allows it to be moved into the loop body of the collapsed loop (executing it multiple times per pre-collapse execution) which is what collapseLoop does. An alternative is to protect it with a conditional to only execute interleaving code when the inner loop's logical loop counter is zero.

The outermost CanonicalLoopInfo could have been reused, so could have the innermost. From an interface perspective, the caller should not make any assumptions on what happens with the original loop blocks (the ones that are under control of CanonicalLoopInfo) or the CanonicalLoopInfo itself.

The rest, i.e. the dominance issues, is due to CanonicalLoopInfo::getBodyIP pointing to a different location (entry of the first body block) than the bodyIP of the body builder codegen (before the terminator of the first body block). This is not an issue if the loop is known to have a zero lower bound and a unit step, which was the case for my test cases.

CanonicalLoopInfo::getBody has to return the block that follows the loop condition branch. If you want the original Body BB, store it in a variable and use that.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
371

I meant "returned by createStaticWorkshareLoop" (although I now think it is not the issue here).

A canonical loop is a construct that that executes the body a number of times without itself having any side-effects per OpenMP specification(*). This is not the case for calls to the runtime library such as @__kmpc_for_static_fini. Hence, it should NOT be part of the CanonicalLoopInfo and AfterIP return an InsertionPoint where there is no site-effect between the last execution of the body and whatever follows the loop.

Calls that take a CanonicalLoopInfo as input are different to the other Create Calls in that it does not insert something new at an InsertionPoint, but takes an existing piece that is already part of the control flow and transforms it. Additional code should be appended after the code that the original Create inserted, not where the code has been transformed.

Ity might have been better if we names those two kinds of Create methods difference. Such as InsertXYZ and ApplyXYZ.

(*) Per specification, if there are side-effects, it is unspecified when they happen or how often.

Thanks for the visualization! My question is basically why can't we coalesce omp_loop.preheader/omp_collapsed.preheader omp_collapsed.after/omp_loop.after pairs?

The outermost CanonicalLoopInfo could have been reused, so could have the innermost. From an interface perspective, the caller should not make any assumptions on what happens with the original loop blocks (the ones that are under control of CanonicalLoopInfo) or the CanonicalLoopInfo itself.

Yes, that's why it feels strange to me that I can just store the InsertPoint of the builder (or, alternatively, use OutermostCanonicalLoopInfo->getAfterIP()) if it points to the block managed by CanonicalLoopInfo that was passed into the call and therefore potentially invalidated.

CanonicalLoopInfo::getBody has to return the block that follows the loop condition branch. If you want the original Body BB, store it in a variable and use that.

It's the same block, but the insertion iterator is different. getBodyIP() returns the begin() of the block, but there may be a mul/add pair that computes the non-canonical induction variable.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
371

Yes, using a different naming scheme and a clear documentation that the previous insertion point should still be used sounds good.

My question is basically why can't we coalesce omp_loop.preheader/omp_collapsed.preheader omp_collapsed.after/omp_loop.after pairs?

Implementation-wise, the new preheader+after BBs are created by createLoopSkeleton and are just connected to the rest of the CFG.
omp_loop.preheader could be reused by appending instructions from omp_collapsed.preheader instead of creating that BB.
omp_loop.after could be reused instead of creating omp_collapsed.after by preprending instructions.
However, I don't see what gain the extra complexity would bring.

Yes, that's why it feels strange to me that I can just store the InsertPoint of the builder (or, alternatively, use OutermostCanonicalLoopInfo->getAfterIP()) if it points to the block managed by CanonicalLoopInfo that was passed into the call and therefore potentially invalidated.

Insertion points to user code should not be invalidated.
The After (and Body and Preheader) BBs contain user-code (Preheader=any code before the canonical loop; After=any code after the loop; Body=iteration code) and therefore are not under control of CanonicalLoopInfo. My goal was to not modify these except to connect them to new BBs. This should ensure that all InsertionPoints into these BBs remain valid.

At this point, I think it is less risky to not modify existing block and just redirect branch instructions. For instance, someone has an InsertionPoint to the terminator of omp_loop.preheader. I.e. it will insert instructions to the end of the preheader. If collapseLoop would insert instructions to the preheader (calculate the new trip count) and necessarily before the branch instruction (if it did not replace the branch instruction itself which would invalidate the InsertionPoint), the InsectionPoint would insert after the collapseLoop's control instruction when the original meaning was to insert before the loop and its control statements.

In case of collapseLoop, these control instructions should be sideeffect-free so it may not matter, but for CreateWorkshareLoop, an InsertionPoint before the loop should point continue to be before any control instructions intruduced, especially __kmpc_for_static_init.

CanonicalLoopInfo::getBody has to return the block that follows the loop condition branch. If you want the original Body BB, store it in a variable and use that.

It's the same block, but the insertion iterator is different. getBodyIP() returns the begin() of the block, but there may be a mul/add pair that computes the non-canonical induction variable.

It is the begin of the transformed canonical loop. If access to the (untransformed) induction variables is needed, use the Body BB from the loop that introduced it.

Anything else would not be composable. Let's say we apply one more transformation on the collapsed loop, say unrolling. For unrolling the only relevant loop induction variable is the one introduced by collapseLoop, starting to be valid at the new BodyIP. From that standpoint evertyhing from Body to Latch is "user code", what original induction variables have been computed from it is not its concern. Unrolling will also need to insert code that computes the value of the collapsed induction variable from the induction variable of the unrolled loop (after duplicating it). It needs to insert that before any code that uses the collapsed induction variable (and then do a RUOW) which is represented by ... getBodyIP().

If this is still unclear, let's have a conference call conversation.

ftynse added a comment.Aug 2 2021, 5:26 AM

Yes, that's why it feels strange to me that I can just store the InsertPoint of the builder (or, alternatively, use OutermostCanonicalLoopInfo->getAfterIP()) if it points to the block managed by CanonicalLoopInfo that was passed into the call and therefore potentially invalidated.

Insertion points to user code should not be invalidated.
The After (and Body and Preheader) BBs contain user-code (Preheader=any code before the canonical loop; After=any code after the loop; Body=iteration code) and therefore are not under control of CanonicalLoopInfo. My goal was to not modify these except to connect them to new BBs. This should ensure that all InsertionPoints into these BBs remain valid.

At this point, I think it is less risky to not modify existing block and just redirect branch instructions. For instance, someone has an InsertionPoint to the terminator of omp_loop.preheader. I.e. it will insert instructions to the end of the preheader. If collapseLoop would insert instructions to the preheader (calculate the new trip count) and necessarily before the branch instruction (if it did not replace the branch instruction itself which would invalidate the InsertionPoint), the InsectionPoint would insert after the collapseLoop's control instruction when the original meaning was to insert before the loop and its control statements.

It would be great if these expectations had been documented next to CanonicalLoopInfo, especially what is considered "user code" and what is not.

CanonicalLoopInfo::getBody has to return the block that follows the loop condition branch. If you want the original Body BB, store it in a variable and use that.

It's the same block, but the insertion iterator is different. getBodyIP() returns the begin() of the block, but there may be a mul/add pair that computes the non-canonical induction variable.

It is the begin of the transformed canonical loop. If access to the (untransformed) induction variables is needed, use the Body BB from the loop that introduced it.

Anything else would not be composable. Let's say we apply one more transformation on the collapsed loop, say unrolling. For unrolling the only relevant loop induction variable is the one introduced by collapseLoop, starting to be valid at the new BodyIP. From that standpoint evertyhing from Body to Latch is "user code", what original induction variables have been computed from it is not its concern. Unrolling will also need to insert code that computes the value of the collapsed induction variable from the induction variable of the unrolled loop (after duplicating it). It needs to insert that before any code that uses the collapsed induction variable (and then do a RUOW) which is represented by ... getBodyIP().

Ditto, it's better when this reasoning is provided in the inline documentation.

Let me work on improving the documentation. For "user-code", I didn't want to commit to what InsertPoint guarantees to make, as seen by CanonicalLoopInfo::collectControlBlocks which only excludes the Body block. I expected this to be developed over time. I didn't not expect people to use CanonicalLoopInfo for code regions that are not canonical loops.

ftynse added a comment.Aug 3 2021, 1:15 AM

Let me work on improving the documentation.

Thanks!

For "user-code", I didn't want to commit to what InsertPoint guarantees to make, as seen by CanonicalLoopInfo::collectControlBlocks which only excludes the Body block. I expected this to be developed over time.

IMO, it's totally fine to say "currently, this does <...>, but this may change in the future".

Do you think this patch is okay to go or do you see other issues?

Do you think this patch is okay to go or do you see other issues?

There is this one issue of not using stale CanonicalLoopInfo.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
371

Please don't any of the loopInfos anymore they have been consumed by collapseLoops. With the documentation patch I am adding and invalidation flag that ensures that the CanonicalLoopInfo is not used anymore after it does not exist anymore in the IR.

ftynse marked an inline comment as done.Aug 4 2021, 4:01 AM
This revision is now accepted and ready to land.Aug 5 2021, 3:23 PM
This revision was landed with ongoing or failed builds.Aug 6 2021, 8:13 AM
This revision was automatically updated to reflect the committed changes.