This is an archive of the discontinued LLVM Phabricator instance.

[LoopInterchange] Allow some loops with PHI nodes in the exit block.
ClosedPublic

Authored by fhahn on Feb 13 2018, 7:36 AM.

Details

Summary

We currently support LCSSA PHI nodes in the outer loop exit, if their
incoming values do not come from the outer loop latch or if the
outer loop latch has a single predecessor. In that case, the outer loop latch
will be executed only if the inner loop gets executed. If we have multiple
predecessors for the outer loop latch, it may be executed even if the inner
loop does not get executed.

This is a first step to support the case described in
https://bugs.llvm.org/show_bug.cgi?id=30472

Diff Detail

Event Timeline

Can you given an example which isn't safe to interchange due to a PHI node in the outer loop's exit block? I'm not following the logic here.

Thank you very much for having a look Eli. PHI nodes in the outer loop exit block are used as a proxy for uses of values produced in the loop nest. As we are doing interchanging ATM, I think it is only safe to interchange, if the values used outside of the loop nest are produced in the inner loop. Before and after interchanging, the inner loop block will be executed if both inner and outer loop conditions are true.

The same might not be true for values produced in the outer loop I think. For example, if the inner loop body is never executed. Consider the example below, where the outer loop induction variable is returned. In case br i1 undef, label %for.body4, label %for.cond.cleanup3 is always false, we would execute the outer loop latch 1024 times and return 1024. In the interchanged version, we would never execute the outer loop latch, and it is not clear what the values the PHI node should use I think.

Please let me know if that makes sense :)

define i64 @no_deps_interchange([1024 x i32]* nocapture %Arr, i32 %k, i64 %B) {
entry:
  br label %for.body

for.body:                                         ; preds = %entry, %for.cond.cleanup3
  %indvars.iv19 = phi i64 [ 0, %entry ], [ %indvars.iv.next20, %for.cond.cleanup3 ]
  br i1 undef, label %for.body4, label %for.cond.cleanup3

for.body4:                                        ; preds = %for.body, %for.body4
  %indvars.iv = phi i64 [ 0, %for.body ], [ %indvars.iv.next, %for.body4 ]
  %arrayidx6 = getelementptr inbounds [1024 x i32], [1024 x i32]* %Arr, i64 %indvars.iv, i64 %indvars.iv19
  store i32 0, i32* %arrayidx6, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, %B
  br i1 %exitcond, label %for.body4, label %for.cond.cleanup3

for.cond.cleanup3:                                ; preds = %for.body4
  %indvars.iv.next20 = add nuw nsw i64 %indvars.iv19, 1
  %exitcond21 = icmp ne i64 %indvars.iv.next20, 1024
  br i1 %exitcond21, label %for.body, label %for.cond.cleanup


for.cond.cleanup:                                 ; preds = %for.cond.cleanup3
  ret i64 %indvars.iv.next20
}

Which, given we allow PHI nodes with values produced in the outer loop latch, produces the following code after interchanging

define i64 @no_deps_interchange([1024 x i32]* nocapture %Arr, i32 %k, i64 %B) {
entry:
  br label %for.body4.preheader

for.body.preheader:                               ; preds = %for.body4
  br label %for.body

for.body:                                         ; preds = %for.body.preheader, %for.cond.cleanup3
  %indvars.iv19 = phi i64 [ %indvars.iv.next20, %for.cond.cleanup3 ], [ 0, %for.body.preheader ]
  br i1 undef, label %for.body4.split1, label %for.cond.cleanup

for.body4.preheader:                              ; preds = %entry
  br label %for.body4

for.body4:                                        ; preds = %for.body4.preheader, %for.body4.split
  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body4.split ], [ 0, %for.body4.preheader ]
  br label %for.body.preheader

for.body4.split1:                                 ; preds = %for.body
  %arrayidx6 = getelementptr inbounds [1024 x i32], [1024 x i32]* %Arr, i64 %indvars.iv, i64 %indvars.iv19
  store i32 0, i32* %arrayidx6, align 4
  br label %for.cond.cleanup3.loopexit

for.body4.split:                                  ; preds = %for.cond.cleanup3
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, %B
  br i1 %exitcond, label %for.body4, label %for.cond.cleanup

for.cond.cleanup3.loopexit:                       ; preds = %for.body4.split1
  br label %for.cond.cleanup3

for.cond.cleanup3:                                ; preds = %for.cond.cleanup3.loopexit
  %indvars.iv.next20 = add nuw nsw i64 %indvars.iv19, 1
  %exitcond21 = icmp ne i64 %indvars.iv.next20, 1024
  br i1 %exitcond21, label %for.body, label %for.body4.split

for.cond.cleanup:                                 ; preds = %for.body4.split, %for.body
  %indvars.iv.next20.lcssa = phi i64 [ %indvars.iv.next20, %for.body4.split ]  ; invalid PHI node
  ret i64 %indvars.iv.next20.lcssa
}

So the problem is specifically with values produced in the loop latch? That makes sense, I guess, but it seems like you could check for that more directly. (It might also be possible to hoist the instruction out of the latch, but I guess that's complicated in general.)

fhahn added a comment.Mar 12 2018, 4:52 PM

So the problem is specifically with values produced in the loop latch? That makes sense, I guess, but it seems like you could check for that more directly. (It might also be possible to hoist the instruction out of the latch, but I guess that's complicated in general.)

Yes, I think currently only values produced in the outer loop latch can be a problem, because the outer loop latch must be the inner loop exit block at the moment IIRC. Although if we loosen that restriction at some point, values in other blocks of the outer loop could become problematic as well.

I guess we could simplify the check by just checking if any incoming values to the phis in the exit block are non-phi nodes in the outer loop latch and prevent interchanging in that case. That way, we would also allow values produced in the outer loop header, which should be fine. With some assertions, ensuring the outer loop latch's predecessors are only blocks exiting the inner loop or the outer loop header. Is that along the lines you were thinking?

There's potential follow-up work to loosen that restriction if we know that the inner loop body gets executed at least once or if we can hoist the offending statements.

I guess, something like that, yes.

fhahn updated this revision to Diff 138244.Mar 13 2018, 12:35 PM
fhahn edited the summary of this revision. (Show Details)

Made check slightly more direct. I went ahead and limited it loop latches with a single predecessor, because we need additional logic to update the PHI nodes in that case. The LI tests also changed a bit after D35430, so I updated the tests this patch touches.

This revision is now accepted and ready to land.Mar 23 2018, 3:35 PM
fhahn added a comment.Apr 3 2018, 7:21 AM

Thank you very much for having a look Eli! I have submitted D45207 and D45206 for review, to restore most of the loop interchange test coverage after D35430 when in. I will commit this change after the test changes went in

fhahn added a comment.Apr 5 2018, 7:36 AM

After the recent changes I committed, the test-suite passes with loopinterchange enabled + LTO. It looks like this patch breaks a few benchmarks again. I will look into that before submitting it

fhahn updated this revision to Diff 144143.Apr 26 2018, 9:46 AM

Sorry for the delay! Rebased after recent LoopInterchange changes.

There has been a minor change to fix a compilation issue with the test suite: floating point LCSSA phi nodes are not supported, as they currently are the only way to detect FP reductions.

I plan to commit this tomorrow, in case there are no further objections.

This revision was automatically updated to reflect the committed changes.
lib/Transforms/Scalar/LoopInterchange.cpp