Page MenuHomePhabricator

[LoopUnrollAndJam] Correctly update LoopInfo when unroll and jam more than 2-levels loop nests.
Needs ReviewPublic

Authored by Whitney on Jan 21 2020, 12:01 PM.

Details

Summary

Before unroll and jam:

for i
  for j
    for k
      A[i][j][k] = 0

After unroll and jam loop-i by a factor of 2:

for i +=2
  for j {
    for k
      A[i][j][k] = 0
    for k'
      A[i+1][j][k] = 0
  }

Notice that there exists a new child loop loop-k' of loop-j, after unroll and jam loop-i by 2.
This patch correctly update LoopInfo with new loops created during unroll and jamming.

Side discussion:
With the example above, we can see that changing the order of how loops are traverse in a loop nest doesn't actually solve the only one subloop limitation. We need to think of other ways to have more loops able to unroll and jam. One way is try to fuse subloops before giving up. Another way could be unroll and jam the whole loop nest in one attempt, i.e. no need to create the loop-k'. Other ideas are welcome.

Diff Detail

Event Timeline

Whitney created this revision.Jan 21 2020, 12:01 PM

Does it make sense to split the cloneBasicBlocksInLoop changes and review them separatly? It would make the diff smaller and easier to read.

llvm/test/Transforms/LoopUnrollAndJam/loopnest.ll
7

I would generate the check lines with the update_checks script. This doesn't test much.

Whitney marked an inline comment as done.Jan 21 2020, 8:53 PM

Does it make sense to split the cloneBasicBlocksInLoop changes and review them separatly? It would make the diff smaller and easier to read.

Sounds good, I will move all simple changes unrelated to cloneBasicBlocksInLoop to another review.

llvm/test/Transforms/LoopUnrollAndJam/loopnest.ll
7

I tried to follow the same pattern as the other unroll and jam lit tests. The main purpose of this test is to check if loop info verified correctly. I will update the test with check lines generated by use update_checks script.

dmgreen added inline comments.Jan 22 2020, 5:59 AM
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
626

I think the #ifndef NDEBUG checks below will re-use L after it has been deleted.

llvm/test/Transforms/LoopUnrollAndJam/loopnest.ll
7

Maybe just show control flow, like the test that was changed below?

Whitney updated this revision to Diff 239587.Jan 22 2020, 7:17 AM

Addressed Johannes's comments.

jdoerfert added inline comments.Jan 22 2020, 8:18 AM
llvm/test/Transforms/LoopUnrollAndJam/loopnest.ll
7

Maybe just show control flow, like the test that was changed below?

The problem there is that the conditions are missing and you don't know where the loop bodies end up.

It's debatable what is best but this representation has some advantages when it comes to changes, it makes them automatically applicable and explicitly shows everything.

That said, I'm fine with manually cutting it down if ppl prefer that.

Whitney marked 4 inline comments as done.Jan 22 2020, 8:27 AM
Whitney added inline comments.
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
626

Good catch, will address this concern in https://reviews.llvm.org/D73204

llvm/test/Transforms/LoopUnrollAndJam/loopnest.ll
7

How about I only keep the control flow, conditions, and phi nodes?

fhahn added a subscriber: fhahn.Jan 22 2020, 8:43 AM
fhahn added inline comments.
llvm/test/Transforms/LoopUnrollAndJam/loopnest.ll
7

If this test is intended to check that we generate the correct control flow, why not keep the inner loop body to a bare minimum (e.g. just the induction variables and maybe a function call that is passed the induction variables? We force unrolling, so the loop body, so the body is not really important, right?

That would reduce the clutter to a bare minimum, even if we use the script to auto-generate the CHECK lines.

Whitney marked 2 inline comments as done.Jan 22 2020, 8:58 AM
Whitney added inline comments.
llvm/test/Transforms/LoopUnrollAndJam/loopnest.ll
7

The intention of the test is to check loop info verify after unroll and jamming and 3 level loop nests. I will change A[i][j][k] = 0 to bar(i, j, k).

Whitney updated this revision to Diff 239634.Jan 22 2020, 9:52 AM
Whitney marked an inline comment as done.

Simplified the test case. A[i][j][k]=0 to bar(i,j,k).

fhahn added inline comments.Jan 22 2020, 10:54 AM
llvm/test/Transforms/LoopUnrollAndJam/loopnest.ll
95

if that's readnone nounwind and the result is unused, it will just be removed (by simplifyLoopAfterUnroll probably). If you need actual loop body, it's probably better to drop the attributes.

Whitney marked 2 inline comments as done.Jan 22 2020, 11:59 AM
Whitney added inline comments.
llvm/test/Transforms/LoopUnrollAndJam/loopnest.ll
95

They are required for unroll and jam to be safely performed. i.e. Without those attributes, unroll and jam is not performed.

jdoerfert added inline comments.Jan 22 2020, 12:30 PM
llvm/test/Transforms/LoopUnrollAndJam/loopnest.ll
95

That makes sense but @fhahn has a point. We should try not to introduce test cases that are "trivial" in some sense.

I think A[i][j][k] = 0 was an OK choice. I personally have a harder time understanding check lines and differences if I don't see all of it. Also, the above is arguably the minimal test case for 3 loops already, with A[i][j][k] = 0 the simplest useful example I can think of.

Nit: Make the loop bounds different for each loop. That way you can spot them more easily.

Whitney updated this revision to Diff 239701.Jan 22 2020, 2:26 PM
Whitney marked 2 inline comments as done.

Change the test case back to A[i][j][k] = 0 as I agreed that it is the minimal meaningful body for a three level loop nests.
Also use different upper bound for each loop as suggested by Johannes.

As for what to put for the checks. We could
(1) one check for the induction variable (original patch)
(2) only control flow (suggested by @dmgreen)
(3) all checks (suggested by @jdoerfert)
(4) all checks except the loop body
I don't mind any of them, as long as we know the loop nest is unroll and jammed, and LoopInfo verified, which all 4 suggestion satisfy,
I will wait to see if we can get to a conclusion before doing further changes. Currently is (3).

I'm happy with 3.

Nothing in this code needs the blocks to be processed in RPO? (or at least some defined order)? I'm a little surprised, but I don't spot anything.

llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
146

BasicBlockSet &Blocks,

186

I think that it will matter that this isn't a deterministic order

Whitney marked 3 inline comments as done.Jan 27 2020, 3:03 PM

Nothing in this code needs the blocks to be processed in RPO? (or at least some defined order)? I'm a little surprised, but I don't spot anything.

Functionally should be fine without an order, as now we are replacing values for all block for one unroll copy at the same time.

However, changing to deterministic order can make testing easier.

llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
186

Doesn't matter functionally, but it may affect the order in LIT tests. That's why I used CHECK-DAG in unroll-and-jam.ll. I agree that could be problematic, thinking to change Blocks from a set to a vector, so the caller of the function decide on a order.

I'd expect

#pragma unrollandjam(2)
for i
  for j
    for k
      A[i][j][k] = 0

to generate this:

for i +=2
  for j
    for k
      A[i][j][k] = 0
      A[i+1][j][k] = 0

(as by the side-note by @Whitney)

That is, if the jam loop is not specified, to jam the innermost by default, which is should be what you want in most cases. For some reason I was assuming LoopUnrollAndJam was rejecting this case as not-yet-implemented. Would that be an option instead of fixing a behavior that probably is not the expected effect?

I'd expect

#pragma unrollandjam(2)
for i
  for j
    for k
      A[i][j][k] = 0

to generate this:

for i +=2
  for j
    for k
      A[i][j][k] = 0
      A[i+1][j][k] = 0

(as by the side-note by @Whitney)

That is, if the jam loop is not specified, to jam the innermost by default, which is should be what you want in most cases. For some reason I was assuming LoopUnrollAndJam was rejecting this case as not-yet-implemented. Would that be an option instead of fixing a behavior that probably is not the expected effect?

That's an interesting idea. If we want to do that, we will need to modify multiple places. Mainly outermost loop needs to be partition into a list of ForeBlocks, a list of AftBlocks, and InnermostLoopBlocks, instead of ForeBlocks, SubLoopBlocks, and AftBlocks. ForeBlocks are the block in outermost loop but outside of innermost loop, and before the innermost loop. AftBlocks are block outside of innermost loop, and not in ForeBlocks. Safety check and code generation both need to be modified accordingly. And again, we will iterate the loops from outer to inner. Maybe we can simplify the transformation by only considering perfect nest? What does everyone think?

Put more generally, I was expecting this:

for i
  A(i)
  for j
    B(i, j)
    for k
      C(i, j, k)
    D(i, j)
  E(i)

To be unrolled in i:

for i +=2
  A(i)
  for j
    B(i, j)
    for k
      C(i, j, k)
    D(i, j)
  E(i)
  A(i+1)
  for j
    B(i+1, j)
    for k
      C(i+1, j, k)
    D(i+1, j)
  E(i+1)
for i remainder
  A(i)
  for j
    B(i, j)
    for k
      C(i, j, k)
    D(i, j)
  E(i)

And then the j loops to be jammed:

for i +=2
  A(i)
  A(i+1)
  for j
    B(i, j)
    for k
      C(i, j, k)
    D(i, j)
    B(i+1, j)
    for k
      C(i+1, j, k)
    D(i+1, j)
  E(i)
  E(i+1)
for i remainder
  A(i)
  for j
    B(i, j)
    for k
      C(i, j, k)
    D(i, j)
  E(i)

You are saying that we should also fuse the inner loops? That sounds sensible if we can do it and prove legality (and if we can somehow prove that UnJing i is better than UnJing j, which I can see could come up in places and might out-weight the extra code bloat).

When Unroll And Jam was written we did not have general loop fusion. We now do. Can we make use of it here to fuse any sub-loops together? I believe that is how gcc writes their algorithm, but last I looked they only supported perfectly nested loops which would be a big regression over what is here now. We might just be able to attempt sub-loop fusing, using the loop fusion infrastructure we have?

The alternative like you said would be trying to prove it is valid beforehand, which would mean checking that more blocks inside subloops can be moved past each other and all the extra memory dependencies are safe.

llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
186

Non-determinism is usually considered a problem, even if both outputs are equally valid (as they would be in this case). Especially in a functional safety context where you need to be testing what you are running.

Maybe use a SetVector? A vector may be fine too, if we don't need constant time lookup/removal.

Option 1: Use LoopFusion infrastructure to jam innerloops recursively.

  • we need this patch to update LI correctly.
  • modify LoopFusion to make its functionality utilities for other passes.
  • we may only know the innerloops cannot be fuse after unroll and jam the parent loops.

FYI, @kbarton opinion?

Option 2: Prove safety beforehand, and unroll and jamming without creating new loops.

  • we no longer need this patch
  • modify safety checks
  • modify codegen
  • not doing unroll and jam as long as one subloop pair is not safe to fuse.
  • should be faster to compile

Depending on whether we want to unroll and jam loop-i if loop-k is not safe to fuse, we will prefer option 1 or 2.

for i
  A(i)
  for j
    B(i, j)
    for k
      C(i, j, k)
    D(i, j)
  E(i)

I will continue updating this patch, when we have an idea of which option to take.

Meinersbur added a comment.EditedFeb 4 2020, 4:41 PM

Sorry for the break, I got the flu.

Put more generally, I was expecting this:
...
You are saying that we should also fuse the inner loops?

Yes.

The purposes of unroll-and-jam is to improve instruction-level-parallelism and reduce hot loop overhead. For performance-optimization, we should only consider the innermost body to be relevant (Statement C in your example). IMHO not jamming the innermost loop does not improve ILP nor overhead, so would be quite useless.

A way to define Unroll-And-Jam is to first tile by (unroll-factor,1,1) (all except the outermost tile factors are 1, so don't really need a loop) and the (fully) unroll the tile. As a side-effect, unroll-and-jam on a single loop would be identical to partial unrolling. Tiling is usually only defined for perfect loop nests, and so I would not necessarily assume that unroll-and-jam over non-perfectly nested loops is even defined. If we do, I'd expect something like:

for i += 2
  A(i)
  A(i+1)
  for j
    B(i, j)
    B(i+1, j)
    for k 
      C(i, j, k)
      C(i+1, j, k)
    D(i, j)
    D(i+1, j)
  E(i)
  E(i+1)
for i remainder:
  A(i)
  for j
    B(i, j)
    for k 
      C(i, j, k)
    D(i, j)
  E(i)

Caveat: What if A,B,D or E contain loops themselves? I'd just not allow it.

When Unroll And Jam was written we did not have general loop fusion. We now do. Can we make use of it here to fuse any sub-loops together? I believe that is how gcc writes their algorithm, but last I looked they only supported perfectly nested loops which would be a big regression over what is here now. We might just be able to attempt sub-loop fusing, using the loop fusion infrastructure we have?

I think using the loop fusion here would make the implementation more complicated.

The alternative like you said would be trying to prove it is valid beforehand, which would mean checking that more blocks inside subloops can be moved past each other and all the extra memory dependencies are safe.

I think the generalization of the legality check from "does the dependency violate jump over one loop" to "does it violate jumping n loops" NOT to be hard.

Edit: Added NOT

Option 1: Use LoopFusion infrastructure to jam innerloops recursively.
Option 2: Prove safety beforehand, and unroll and jamming without creating new loops.

IMHO option 2 is preferable and more robust.

Option 1: Use LoopFusion infrastructure to jam innerloops recursively.
Option 2: Prove safety beforehand, and unroll and jamming without creating new loops.

IMHO option 2 is preferable and more robust.

Another way to look at unroll-and-jam is a thin pass that invokes loop unroll + loop fusion under the hood. If we structure it that way, we maximize software reuse and any improvements to the loop fusion required to do unroll-and-jam (such as dealing with intervening code) would directly benefit use-cases that are not unroll-and-jam-specific, without duplication of effort. Similarly safety checks and cost modeling can be reused between the two transforms. I realize that may involve more work and take longer to make it functionally equivalent to what is currently available, but seems like the ideal long-term solution.

Option 1: Use LoopFusion infrastructure to jam innerloops recursively.
Option 2: Prove safety beforehand, and unroll and jamming without creating new loops.

IMHO option 2 is preferable and more robust.

I apologize for my delay in commenting here.

I also think that Option 2 is more preferable.I agree with @Meinersbur that we shouldn't be doing UnrollAndJam if the Jam part is not possible, otherwise you just end up with another pass of unrolling, which will likely end up causing confusion. Thus, it makes more sense for UnrollAndJam to be a stand-alone pass that doesn't rely on another pass to complete.

That said, there may be parts of the analysis that is done in fusion that can be reused (or generalized and commoned out) that could be useful here. I don't know the internal of UnrollAndJam to know whether that is easy to do, or would end up with more work than benefit it provides.

Thanks everyone for the inputs!
https://reviews.llvm.org/D76132 is created to implement the safety checks needed for option 2.

Would it help if a wrote/sketched the code for dependency violation?

Would it help if a wrote/sketched the code for dependency violation?

Is this for https://reviews.llvm.org/D76132 or other review?
If it is for D76132, it would be very helpful. I agree dependency checks can be improved for unroll and jam. Do you think it make sense to do it as a separate patch after?