This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnrollAndJam] Changed safety checks to consider more than 2-levels loop nest.
ClosedPublic

Authored by Whitney on Mar 13 2020, 7:18 AM.

Details

Summary

As discussed in https://reviews.llvm.org/D73129.

Example
Before unroll and jam:

for
  A
  for
    B
    for
      C
    D
  E

After unroll and jam (currently):

for
  A
  A'
  for
    B
    for
      C
    D
    B'
    for
      C'
    D'
  E
  E'

After unroll and jam (Ideal):

for
  A
  A'
  for
    B
    B'
    for
      C
      C'
    D
    D'
  E
  E'

This is the first patch to change unroll and jam to work in the ideal way.
This patch change the safety checks needed to make sure is safe to unroll and jam in the ideal way.

Diff Detail

Event Timeline

Whitney created this revision.Mar 13 2020, 7:18 AM
Whitney marked 3 inline comments as done.Mar 13 2020, 7:21 AM
Whitney added inline comments.
llvm/lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
283

Not needed as they are already checked in isSafeToUnrollAndJam

438–439

Just clang-format

llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
758

Now done in isEligibleLoopForm

Nice work!

Do you have a test case?

I need to think about how dependencies violation are detected. Could you write something about how you intent to do it?

llvm/lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp
438–439

Could you commit this separately?

I.e. just push a NFC commit explaining you are working on this file.

llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
102

Could you add a doxygen description for this function?

119

[nit] TODO's should not be doxygen comments

822

[style] [[ https://llvm.org/docs/CodingStandards.html#don-t-evaluate-end-every-time-through-a-loop | Don't evaluate Cur->getLoopDepth() every time through the loop ]]. Consider using llvm::seq

829–830

Could you write a comment about what kind of dependencies this is looking for/are not allowed?

840

Why is rotated form necessary?

847

[style] [[ https://llvm.org/docs/CodingStandards.html#don-t-evaluate-end-every-time-through-a-loop | Don’t evaluate end() every time through a loop ]]

859

[style] (*I).begin() -> I->begin()?

868

Can the following be handled correctly?

  • Multiple exits
875

Does the comment need updating?

Whitney marked 3 inline comments as done.Mar 13 2020, 4:53 PM
Whitney added inline comments.
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
840

It was there before:

if (Latch != Exit)
  return false;
if (SubLoopLatch != SubLoopExit)
  return false;

But at that time the function isRotatedForm wasn't exist.

868

I think it should, but I will write a LIT test to be sure. Only one AftBlock is allowed in the outermostloop.

Whitney updated this revision to Diff 250333.Mar 13 2020, 5:52 PM
Whitney marked 9 inline comments as done.

Addressed review comments from Michael. Still need to add a LIT test.

Whitney edited the summary of this revision. (Show Details)Mar 16 2020, 5:47 AM
dmgreen added inline comments.Mar 16 2020, 5:51 AM
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
854–855

This looks like it would recalculate LoadsAndStores a lot? Src and Dst don't seem like the right nomenclature too, but that's just a small nit.

863

size() != 0 -> !empty() ?

917

I realize you didn't write the original, but tranform -> transform

I believe the old message was intending to show array indices, so Fi was F(i) and Si_j was S(i,j), showing that the second unrolled iteration need to be able to move past the first, in terms of runtime execution.

971

Should this have a check for a 2 deep loop nest at the moment (like before), if the remainder of the analysis/transform code hasn't been updated yet? It looks like the count calculation might just exclude anything with multiple subloop blocks at the moment anyway, so is possibly not a problem in practice, without pragma's.

Whitney updated this revision to Diff 250562.Mar 16 2020, 8:02 AM
Whitney marked 7 inline comments as done.

Addressed Dave's comments.

llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
854–855

Renamed, and removed the calculation of one of the LoadsAndStores in the outer loop. I considered creating a map, but the code may get complicated. Is the current change okay with you, or you prefer a map?

917

fixed the typo, and added braces, is not any clearer?

971

Not sure I understand, this check is already like before.

Whitney marked an inline comment as done.Mar 16 2020, 8:04 AM
Whitney added inline comments.
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
854–855

Renamed, and moved the calculation of one of the LoadsAndStores in the outer loop. I considered creating a map, but the code may get complicated. Is the current change okay with you, or you prefer a map?

dmgreen added inline comments.Mar 16 2020, 4:42 PM
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
854–855

You may be right. It looks O(n^2), but perhaps n will not be too high?

971

Sorry, the actual line I was pointing to was semi-random. That wasn't clear.

Does the processHeaderPhiOperands check below need to check each level? IIRC it's testing data-dependencies (as in ssa/use-def dependencies, as opposed to the memory dependencies in checkDependencies. Can we physically move any instruction we need from aft to fore). If we end up moving multiple levels past one another, do we have to make the same checks at each level?

My general point was that some of the code still only handles 2-deep loop nests. Should we have a check somewhere (perhaps with a fixme next it) that still tests for that condition, until the rest of the code has caught up?

Whitney marked 4 inline comments as done.Mar 17 2020, 5:54 AM
Whitney added inline comments.
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
854–855

n is (loopdepth - 1) * 2 + 1. I personally think is ok.

971
Because of the way we rearrange basic blocks, we also require that
the Fore blocks of L on all unrolled iterations are safe to move before the
blocks of the direct child of L of all iterations. So we require that the
phi node looping operands of ForeHeader can be moved to at least the end of
ForeEnd, so that we can arrange cloned Fore Blocks before the subloop and
match up Phi's correctly.

As we are only unrolling L, not its child, we don't need to move instructions from non-L AftBlock to non-L ForeBlock, so we don't need to check if the moves are safe.

My general point was that some of the code still only handles 2-deep loop nests. Should we have a check somewhere (perhaps with a fixme next it) that still tests for that condition, until the rest of the code has caught up?

I modified all checks I found needed in isSafeToUnrollAndJam, am I missing something?

Whitney updated this revision to Diff 250943.Mar 17 2020, 5:08 PM
Whitney marked 4 inline comments as done.

Added LITs for dependencies.

llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
868

Multiple exits are not allowed, and that's blocked in partitionOuterLoopBlocks

Whitney updated this revision to Diff 250956.Mar 17 2020, 6:17 PM

clang-format

Meinersbur added inline comments.Mar 17 2020, 9:33 PM
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
659–661

i don't see this being enforced.

819–820

This looks like a bail-out; might improve this by setting CurLoop to innermost loop both instructions are in.

I'd prefer the following structure:

if (D->getDirection(LoopDepth) & Dependence::DVEntry::GT) { // That is, as by the other comment, check whter the root loop carries this dependency
  if (!CurLoop)
    return false; // Bail-out

  ...
}
820

I think this is should be looking for whether the first GT direction is due to the root loop (i.e. whether the root loop is the cause for the dependence to be fulfilled). Not an correctness issue.

The bitwise & comparison also matches NE and GE. Is this intended?

823–824

Isn't this equivalent to D->getDirection(CurLoopDepth ) == EQ?

827

Again, the & comparison feels wrong. Did you consider a switch over the values of Dependence::DVEntry?

llvm/test/Transforms/LoopUnrollAndJam/dependencies.ll
513

Does it otherwise unroll-and-jam the middle loop? Should we add a mechanism that stops unrolling of nests of already unrolled loops (e.g. add llvm.loop.unroll_and_jam.disable to all nested loops)?

Whitney updated this revision to Diff 250993.Mar 17 2020, 10:47 PM
Whitney marked 12 inline comments as done.

Harbormaster failed remote builds in B49526: Diff 250956!

Looks like some llc test cases failed. Doesn't seem related to this patch to me. Anyone know if it is safe to ignore?

llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
659–661

if (D->getDirection(LoopDepth) & Dependence::DVEntry::GT) return false; => allowing only = or < (not >)

819–820

might improve this by setting CurLoop to innermost loop both instructions are in.

Do you mean passing LI into this function, and calculating CurLoop here?

820

I think this is should be looking for whether the first GT direction is due to the root loop (i.e. whether the root loop is the cause for the dependence to be fulfilled). Not an correctness issue.

Here LoopDepth is assumed to be the loop depth of the unroll loop.

The bitwise & comparison also matches NE and GE. Is this intended?

The code before my change is using &, so I don't want to change the behaviour.

823–824

Could also be ALL. I think isScalar is clearer.

827

The code before my change is using &. I considered changing to use dependence distance instead of direction to allow more dependence, but I want to keep this patch as changes needed for safety checks for more than 2-levels loop nest.
There is a FIXME: // FIXME: Allow > so long as distance is less than unroll width
What do you think?

llvm/test/Transforms/LoopUnrollAndJam/dependencies.ll
513

Currently unroll and jam add llvm.loop.unroll_and_jam.disable to the loop it unroll and jammed (not all nested loops).
As we are traversing from inner to outer,
for.k is not considered as it doesn't have a inner loop
for.j is safe to unroll and jam, but is not my intension for this test case, so I added the disable pragma
for.i is unsafe only after my change.

Even if we change to traverse from outer to inner, we should allow the middle loop to unroll and jam when proven profitable.

Whitney updated this revision to Diff 250994.Mar 17 2020, 10:51 PM

fixed typo.

Harbormaster completed remote builds in B49548: Diff 250994.
Meinersbur added inline comments.Mar 18 2020, 10:20 AM
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
819–820

Where the innermost common loop is computed is unimportant.

820

Here LoopDepth is assumed to be the loop depth of the unroll loop.

I know. However, the dependency might also be fullfiled by surrounding loops. eg:

for i = 0 ... n-1
  #pragma unrollandjam
  for j = 0 ... n-1
    for j = 0 ... n-1:
      A[i][j][k] = f(A[i-1][j-1][k]);

The dependence vector is (>,>,=).
Verifying the direction of the j-dependence is not necessary because for the same i, the inner loops are parallel. In other words: the i-loop already ensures that the dependency is fulfilled.

823–824

ALL would be * in the dependence vector, i.e. the analysis could not prove any direction.
However, it's also not EQ. Sorry for the confusion.

I am not convinced it can be ignored.

#pragma unrollandjam
for i = ...
  for j = ...
    sum = f(sum, ...);

The dependence induced by sum is obviously scalar, but it cannot be jammed (it's sequential). The dependence vector is (=>,>)

827

I think taking the dependence distance into account is something for a different patch.

837

It could also be ALL (or GE), meaning we don't know what direction it is. This check gives it a pass.

Whitney updated this revision to Diff 251192.Mar 18 2020, 3:19 PM
Whitney marked 9 inline comments as done.

Addressed Michael's comments.

llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
819–820

Then I think I misunderstood, currently CurLoop is only set when both instructions belongs to the same loop.

820

Added code to allow accesses of different base.

823–824

Why the above example cannot be jammed?
Before jam:

for i = ...
  for j = ...
    sum = f(sum, ...);
  for j = ...
    sum = f(sum, ...);

After jam:

for i = ...
  for j = ...
    sum = f(sum, ...);
    sum = f(sum, ...);

The access pattern seems to be the same to me.

837

Changed to if (D->getDirection(CurLoopDepth) == Dependence::DVEntry::GT)

dmgreen added inline comments.Mar 19 2020, 4:20 AM
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
971

In the summary you have B' moving past D. And we need to be sure that B' doesn't depend on anything from D.

I think of it as B(1,0) needs to move past D(0,0). the "j" level loop isn't unrolled, but there still some movement needed at the "i" level.

llvm/test/Transforms/LoopUnrollAndJam/dependencies.ll
2

Why is this now using da-disable-delinearization-checks, and why have some of these existing tests been changed to use constant size arrays?

Whitney marked 7 inline comments as done.Mar 19 2020, 5:35 AM
Whitney added inline comments.
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
971

Here we are talking about def-use dependence.
If an instruction in B' (x2) depend on an instruction in D (y),
means there must be an instruction in B (x) that depend on instruction y in D,
as B' is clone from B.

B:
  x = phi [y, D]...
D:
  y =

As we are placing B' after B, and y is available for B, then y must also be available for B'.

Please correct me if I am wrong.

llvm/test/Transforms/LoopUnrollAndJam/dependencies.ll
2

-da-disable-delinearization-checks is added to more accurately delinearization of fixed-size multi-dimensional arrays. See
https://reviews.llvm.org/D72178 more detail explaination.

why have some of these existing tests been changed to use constant size arrays

They were originally testing single dimensional arrays, which may not be ideal for testing sub-sub portion of code.

363

This test was orignally testing

for i
  for j
    A[i]
    A[i-1]

which should be safe to unroll and jam.

I think it actually want to test the code for sub with sub with the dependence distance of the inner loop is less.

for i
  for j
    A[i][j]
    A[i+1][j-1]
401

This test was orignally testing

for i
  for j
    A[i]
    A[i]

I think it actually want to test the code for sub with sub with the dependence distance of the inner loop is eq.

for i
  for j
    A[i][j]
    A[i+1][j]
439

This test was orignally testing

for i
  for j
    A[i]
    A[i+1]

which should be safe to unroll and jam.

I think it actually want to test the code for sub with sub with the dependence distance of the inner loop is more.

for i
  for j
    A[i][j]
    A[i+1][j+1]
fhahn added a subscriber: fhahn.Mar 19 2020, 6:25 AM
fhahn added inline comments.
llvm/test/Transforms/LoopUnrollAndJam/dependencies.ll
2

-da-disable-delinearization-checks is added to more accurately delinearization of fixed-size multi-dimensional arrays.

Well, it returns more optimistic results at the loss of soundness IIRC. I think we should definitely keep test coverage without -da-disable-delinearization-checks. This is the common case, we should definitely handle correctly. Ideally we would have multi-dimensional tests that *do not* need -da-disable-delinearization-checks. IIRC constant loop bounds might help with that.

IMO tests that really require -da-disable-delinearization-checks should be additional, maybe in a separate file.

401

FWIW I think it would be better to keep the original test as is and add the new case as an additional test, as they seem to test different scenarios.

Whitney updated this revision to Diff 251437.Mar 19 2020, 12:12 PM
Whitney marked 3 inline comments as done.
Whitney edited the summary of this revision. (Show Details)

Kept the original test as is and added the new case as an additional test.

bmahjour added inline comments.Mar 19 2020, 1:13 PM
llvm/test/Transforms/LoopUnrollAndJam/dependencies.ll
2

Ideally we would have multi-dimensional tests that *do not* need -da-disable-delinearization-checks.

It's very difficult to come up with multi-dimensional tests that result in accurate dependence vectors. That's one of the main reasons why this option was added, to be able to let us test/exercise code paths that would otherwise not be taken due to overpessimistic dependence.

IIRC constant loop bounds might help with that.

The delinearization validity checks compare the subscripts against parameteric terms in the subscript that are believed to be the size of a given dimension. If the arrays have dynamic sizes then the constant loop bound won't help because the subscripts still contain parameteric terms. If the arrays have fixed sizes, then we don't even try to delinearize them unless -da-disable-delinearization-checks is enabled.

fhahn added a comment.Mar 19 2020, 2:59 PM

Kept the original test as is and added the new case as an additional test.

Thanks!

Whitney marked 2 inline comments as done.Mar 19 2020, 6:18 PM
dmgreen added inline comments.Mar 22 2020, 2:47 PM
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
971

Yeah, kind of. But with an extra level of unrolling in there. Using your ` notation from before, normal unrolling would do:

B:
  x = phi [y', D'], [x, A]
D:
  y =
B':
  x' = phi [y, D]
D:
  y' =

As we need to move B' before D, we also need to be able to hoist y into B, so the phi in B' can point at the correct value. That's what this processHeaderPhiOperands is doing. Note the x` phi only has one operand after unrolling, so will be simplified away.

It comes up quite a bit from the increment of the IV variable. The second IV will become an increment of the first after we have pushed the add up into the header from the latch.

llvm/test/Transforms/LoopUnrollAndJam/dependencies.ll
363

Hmm. I don't remember what this was trying to test. It feels like a very long time ago now.

Thanks for splitting the new tests out. More are always a good thing.

Whitney marked 4 inline comments as done.Mar 22 2020, 3:13 PM
Whitney added inline comments.
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
971

We are still only unrolling one loop each time, but fuse LoopDepth-1 times.
Instructions in B should be the same as B', except all i changed to i+1, when unrolling by 2.
Likewise, all instructions in D should be the same as D', except all i changed to i+1, when unrolling by 2.
In that case, how can an instruction defined in D' used in B?
I think processHeaderPhiOperands is used for the induction variable of the unrolled loop.

llvm/test/Transforms/LoopUnrollAndJam/dependencies.ll
363

For sure. I guess is hard to speculate what it was trying to test now.

dmgreen added inline comments.Mar 23 2020, 3:17 AM
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
971

Ah I see. Even though we are moving the blocks past one another, at that level there will not be any instructions that needs to move. OK fair enough :)

Whitney marked 3 inline comments as done.Mar 23 2020, 4:42 AM

Here is my suggested dependency checker:

// Check whether it is semantically safe Src and Dst considering any potential
// dependency between them.
//
// @param UnrollLevel The level of the loop being unrolled
// @param JamLevel    The level of the loop being jammed; if Src and Dst are on
// different levels, the outermost common loop counts as jammed level
//
// @return true if is safe and false if there is a dependency violation.
static bool checkDependency(Instruction *Src, Instruction *Dst,
                            unsigned UnrollLevel, unsigned JamLevel,
                            bool Sequentialized, DependenceInfo &DI) {
  assert(UnrollLevel <= JamLevel);

  if (Src == Dst)
    return true;
  if (isa<LoadInst>(Src) && isa<LoadInst>(Dst))
    return true;

  // Check whether unroll-and-jam may violate a dependency.
  // By construction, every dependency will be lexicographically non-negative
  // (if it was, it would violate the current execution order), such as
  //   (0,0,>,*,*)
  // Unroll-and-jam changes the GT execution of two executions to the same
  // iteration of the chosen unroll level. That is, a GT dependence becomes a GE
  // dependence (or EQ, if we fully unrolled the loop) at the loop's position:
  //   (0,0,>=,*,*)
  // Now, the dependency is not necessarily non-negative anymore, i.e.
  // unroll-and-jam may violate correctness.
  std::unique_ptr<Dependence> D = DI.depends(Src, Dst, true);
  if (!D)
    return true;
  assert(D->isOrdered() && "Expected an output, flow or anti dep.");

  // Quick bail-out.
  if (D->isConfused())
    return false;

  for (unsigned d = 1; d < UnrollLevel; ++d) {
    // Check if dependence is carried by an outer loop.
    // That is, changing
    //   (0,>,>,*,*)
    // to
    //   (0,>,>=,*,*)
    // will still not violate the dependency.
    if (D->getDirection(d) == Dependence::DVEntry::GT)
      return true;
  }

  if (!(D->getDirection(UnrollLevel) & Dependence::DVEntry::GT)) {
    // If the unrolled loop did not carry the dependency in the first place
    // (i.e. being <, <= or =), it will also not need after unroll-and-jam.
    // Note: The standard case is the dependence vector
    //   (0,0,=,>,*)
    // where the unrolled level is already EQ. Changing LT and LE should also
    // not affect the semantics, since these didn't help to fulfill the
    // dependence in the first place.
    return true;
  }

  // Check if unrolled level becomes an EQ dependence at the unroll level,
  // whether one of the inner loops would carry the dependence.
  for (unsigned d = UnrollLevel + 1; d <= JamLevel; ++d) {
    unsigned Dir = D->getDirection(d);

    // Check whether the jammed level will carry the dependency.
    if (Dir == Dependence::DVEntry::GT)
      return true;

    // A possible backwards direction will violate the dependency
    if (Dir & Dependence::DVEntry::LT)
      return false;
  }

  // We previously already already checked whether the instructions are in the
  // same region. Reaching this means that they are not, hence we have a
  // dependency violation.
  return Sequentialized;
}

static bool
checkDependencies(Loop &Root, const BasicBlockSet &SubLoopBlocks,
                  const DenseMap<Loop *, BasicBlockSet> &ForeBlocksMap,
                  const DenseMap<Loop *, BasicBlockSet> &AftBlocksMap,
                  DependenceInfo &DI, LoopInfo &LI) {
  SmallVector<BasicBlockSet, 8> AllBlocks;
  for (Loop *L : Root.getLoopsInPreorder())
    if (ForeBlocksMap.find(L) != ForeBlocksMap.end())
      AllBlocks.push_back(ForeBlocksMap.lookup(L));
  AllBlocks.push_back(SubLoopBlocks);
  for (Loop *L : Root.getLoopsInPreorder())
    if (AftBlocksMap.find(L) != AftBlocksMap.end())
      AllBlocks.push_back(AftBlocksMap.lookup(L));

  unsigned LoopDepth = Root.getLoopDepth();
  SmallVector<Instruction *, 4> EarlierLoadsAndStores;
  SmallVector<Instruction *, 4> CurrentLoadsAndStores;
  for (BasicBlockSet &Blocks : AllBlocks) {
    CurrentLoadsAndStores.clear();
    if (!getLoadsAndStores(Blocks, CurrentLoadsAndStores))
      return false;

    Loop *CurLoop = LI.getLoopFor((*Blocks.begin())->front().getParent());
    unsigned CurLoopDepth = CurLoop->getLoopDepth();

    for (auto Earlier : EarlierLoadsAndStores) {
      Loop *EarlierLoop = LI.getLoopFor(Earlier->getParent());
      unsigned EarlierDepth = EarlierLoop->getLoopDepth();
      unsigned CommonLoopDepth = std::min(EarlierDepth, CurLoopDepth);
      for (auto Later : CurrentLoadsAndStores) {
        if (!checkDependency(Earlier, Later, LoopDepth, CommonLoopDepth, false,
                             DI))
          return false;
      }
    }

    size_t NumInsts = CurrentLoadsAndStores.size();
    for (size_t i = 0; i < NumInsts; ++i) {
      for (size_t j = i + 1; j < NumInsts; ++j) {
        if (!checkDependency(CurrentLoadsAndStores[i], CurrentLoadsAndStores[j],
                             LoopDepth, CurLoopDepth, true, DI))
          return false;
      }
    }

    EarlierLoadsAndStores.append(CurrentLoadsAndStores.begin(),
                                 CurrentLoadsAndStores.end());
  }
  return true;
}

I think I discovered a bug in the current dependency checker. LoopUnrollAndJam transforms sub_sub_more in the check dependencies.ll which I dint think it is allowed to do.

llvm/test/Transforms/LoopUnrollAndJam/dependencies.ll
439

I think this is NOT safe to unroll-and jam. The unrolled equivalent is:

for i += 2
  for j
    S1:  A[i]
    S2:  A[i+1]
    S1': A[i+1]
    S2': A[i+2]

At S1', the lasst access of A[i+1] is S2 from the same itertation. instead if S1 from the previous j-iteration is in the original loop.

Whitney updated this revision to Diff 252851.Mar 26 2020, 8:22 AM

Thanks @Meinersbur ! I mostly used your code directly, except

for (unsigned d = 1; d < UnrollLevel; ++d) {
      // Check if dependence is carried by an outer loop.
      // That is, changing
      //   (0,>,>,*,*)
      // to
      //   (0,>,>=,*,*)
      // will still not violate the dependency.
      if (D->getDirection(d) == Dependence::DVEntry::GT)
        return true;
    }

which I think should be safe as long as the one dependence is not EQ then should be safe.

for i
  for j        <= unroll loop
    for k
       A[i][j][k]
       A[i-1][j+1][k]

Loop-j should be safe to unroll and jam. Am I right?

Whitney marked an inline comment as done.Mar 26 2020, 8:22 AM

Thanks @Meinersbur ! I mostly used your code directly, except

for (unsigned d = 1; d < UnrollLevel; ++d) {
      // Check if dependence is carried by an outer loop.
      // That is, changing
      //   (0,>,>,*,*)
      // to
      //   (0,>,>=,*,*)
      // will still not violate the dependency.
      if (D->getDirection(d) == Dependence::DVEntry::GT)
        return true;
    }

which I think should be safe as long as the one dependence is not EQ then should be safe.

for i
  for j        <= unroll loop
    for k
       A[i][j][k]
       A[i-1][j+1][k]

Loop-j should be safe to unroll and jam. Am I right?

Yes, that's what the cod above would be testing.

llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
715–718

[serious] This does not correspond to the getDirection() == GT from my version. In particular, this version lets' the loop pass if the outer loops have NE or LT dependenies. These do not ensure lexicographic greater-than and therefore cause a misoptimization.

[remark] While some style guides prefer this functional style, I prefer the version with explicit loops and the comment that explains what the code is doing.

Thanks @Meinersbur ! I mostly used your code directly, except

for (unsigned d = 1; d < UnrollLevel; ++d) {
      // Check if dependence is carried by an outer loop.
      // That is, changing
      //   (0,>,>,*,*)
      // to
      //   (0,>,>=,*,*)
      // will still not violate the dependency.
      if (D->getDirection(d) == Dependence::DVEntry::GT)
        return true;
    }

which I think should be safe as long as the one dependence is not EQ then should be safe.

for i
  for j        <= unroll loop
    for k
       A[i][j][k]
       A[i-1][j+1][k]

Loop-j should be safe to unroll and jam. Am I right?

Yes, that's what the cod above would be testing.

My above example was not good.
Consider this example:

for i
  for j        <= unroll loop
    for k
       A[i][j][k]
       A[i-1][j+1][k-1]

The dependence direction vector would be [LT, GT, LT].

After unroll and jam loop-j:

for i
   for j  += 2
     for k
        A[i][j][k]
        A[i-1][j+1][k-1]
        A[i][j+1][k]
        A[i-1][j+2][k-1]

I think loop-j is safe to unroll and jam.
However if we only allow GT for loop-i, then this would be consider not safe.

My above example was not good.
Consider this example:

for i
  for j        <= unroll loop
    for k
       A[i][j][k]
       A[i-1][j+1][k-1]

The dependence direction vector would be [LT, GT, LT].

This dependency vector would be an unconditional dependency violation (lexicographically negative; backwards in time)

Let's say the statements are S1 and S2.

No dependence between S1 and itself.
No dependence between S2 and itself.

S2(i2,j2,k2) depends on S1(i1,j1,k1) iff i1==i2-1(A[i1] and A[i2-1] access the same element), j1==j2+1, k1==k2-1 and (i1,j1,k1) <=_{lexicographic} (i2,j2,k2),
in other words, the dependence vector is (+1,-1,+1) or (GT,LT,GT) [direction from S1->S2, since S1 is the source and S2 is the consumer].

For the iteration in the i-loop, S1 and S2 will never access the same element, hence we can reorder S1 and S2 within the outermost loops as we like.

After unroll and jam loop-j:

for i
   for j  += 2
     for k
        A[i][j][k]
        A[i-1][j+1][k-1]
        A[i][j+1][k]
        A[i-1][j+2][k-1]

I think loop-j is safe to unroll and jam.
However if we only allow GT for loop-i, then this would be consider not safe.

But the dependency is strictly GT in the first element ?!?

Consider a case where the dependency in the i-loop is GE:

for i
  for j        <= unroll loop
    for k
       A[i][j][k]
       A[c ? i : i-1][j-1][k]

Here S2 conditionally accessed the element from the previous i-iteration or the same, depending on `c`. Thus the possibility of "EQ" requires us to check further.

Michael

Consider a case where the dependency in the i-loop is GE:

for i
  for j        <= unroll loop
   for k
      A[i][j][k]>
      A[c ? i : i-1][j-1][k]

Here S2 conditionally accessed the element from the previous i-iteration or the same, depending on `c`. Thus the possibility of "EQ" requires us to check further.

Thanks for your reply.
I am still confused. Your version only allow GT (by checking == GT). My version only allow GT, LT, NE (by checking ! &EQ). So both versions requires us to check further with GE.
I was trying to use

for i
  for j        <= unroll loop
    for k
       S1: A[i][j][k]
       S2: A[i-1][j+1][k-1]

as an example, where your version would consider as unsafe and my version would consider as safe.
Am I correct to think loop-j is safe to unroll and jam?
If yes, do you have an example where LT or NE for loop-i should be considered as unsafe for unroll and jam loop-j?

I am still confused. Your version only allow GT (by checking == GT). My version only allow GT, LT, NE (by checking ! &EQ). So both versions requires us to check further with GE.
I was trying to use

for i
  for j        <= unroll loop
    for k
       S1: A[i][j][k]
       S2: A[i-1][j+1][k-1]

as an example, where your version would consider as unsafe and my version would consider as safe.
Am I correct to think loop-j is safe to unroll and jam?

Yes

If yes, do you have an example where LT or NE for loop-i should be considered as unsafe for unroll and jam loop-j?

My justification, as laid out in the comments, is that a GT dependency ensures that the dependency vector is lexicographic positive. An LT or NE dependency does not do that. I would need to see an argument -- not an example -- why LT or NE provide safety. Also consider that DependencyInfo can overapproximate direction bits, so there may not be an example with exactly these direction vectors.

S2(i2,j2,k2) depends on S1(i1,j1,k1) iff i1==i2-1(A[i1] and A[i2-1] access the same element), j1==j2+1, k1==k2-1 and (i1,j1,k1) <=_{lexicographic} (i2,j2,k2),
in other words, the dependence vector is (+1,-1,+1) or (GT,LT,GT) [direction from S1->S2, since S1 is the source and S2 is the consumer].

This is a bit counter intuitive, but a positive dependence direction, is represented as LT (not GT). See section 4.2.1 in G.G, Ken Kennedy, C.W. Tseng, 1990. Practical Dependence Testing.
The corresponding direction vector for (+1,-1,+1) is (LT, GT, LT) which would be a lexicographically positive direction vector.

I think loop-j is safe to unroll and jam.
However if we only allow GT for loop-i, then this would be consider not safe.

But the dependency is strictly GT in the first element ?!?

Given that the direction vector is (LT, GT, LT), it follows that only allowing GT will disqualify the above example as an unsafe case.

I am still confused. Your version only allow GT (by checking == GT). My version only allow GT, LT, NE (by checking ! &EQ). So both versions requires us to check further with GE.
I was trying to use

for i
  for j        <= unroll loop
    for k
       S1: A[i][j][k]
       S2: A[i-1][j+1][k-1]

as an example, where your version would consider as unsafe and my version would consider as safe.
Am I correct to think loop-j is safe to unroll and jam?

Yes

If yes, do you have an example where LT or NE for loop-i should be considered as unsafe for unroll and jam loop-j?

My justification, as laid out in the comments, is that a GT dependency ensures that the dependency vector is lexicographic positive. An LT or NE dependency does not do that. I would need to see an argument -- not an example -- why LT or NE provide safety. Also consider that DependencyInfo can overapproximate direction bits, so there may not be an example with exactly these direction vectors.

I believe Whitney's reasoning is based on the assumption that if outer levels (levels enclosing the loop being unroll-and-jammed) have a non-equal direction, then the locations accessed in the inner levels cannot overlap in memory. One counter example to that I can think of is where the indexes overlap into neighboring dimensions. @Meinersbur I'm just curious, is that what you had in mind too? I agree we need to be conservative, but I just want to know the concern so we can document/think about it.

S2(i2,j2,k2) depends on S1(i1,j1,k1) iff i1==i2-1(A[i1] and A[i2-1] access the same element), j1==j2+1, k1==k2-1 and (i1,j1,k1) <=_{lexicographic} (i2,j2,k2),
in other words, the dependence vector is (+1,-1,+1) or (GT,LT,GT) [direction from S1->S2, since S1 is the source and S2 is the consumer].

This is a bit counter intuitive, but a positive dependence direction, is represented as LT (not GT). See section 4.2.1 in G.G, Ken Kennedy, C.W. Tseng, 1990. Practical Dependence Testing.
The corresponding direction vector for (+1,-1,+1) is (LT, GT, LT) which would be a lexicographically positive direction vector.

I indeed seem to have been confused the directions, maybe influence by the original implementation predominantly testing against GT. Thanks for pointing this out. It also means that some assumptions I had while writing the code were wrong.

One other is that DependenceAnalysis would not return a [> ...] dependence, as this would be a dependence of Dst to a Src in the future. Well:

for (int i = 0; i < n; ++i) {
        A[i] = 42;
        sum += A[i+1];
}
$ opt -da
Src:  store double 4.200000e+01, double* %arrayidx, align 8 --> Dst:  %0 = load double, double* %arrayidx2, align 8
  da analyze - consistent flow [>]!

It's not a flow dependence, but an anti-dependence. This the result of DA being invoked as analyze(store,load), but DA is not able to determine from the direction whether it's a flow or an anti-dependence.

That is, one DA call returns two results mixed into a single dependence vector: Src --> Dst and Dst --> Src. That doesn't look ideal and I am not sure about the consequences.

I believe Whitney's reasoning is based on the assumption that if outer levels (levels enclosing the loop being unroll-and-jammed) have a non-equal direction, then the locations accessed in the inner levels cannot overlap in memory.

Could this be commented, with a bit more detail, in the source code?

One counter example to that I can think of is where the indexes overlap into neighboring dimensions. @Meinersbur I'm just curious, is that what you had in mind too? I agree we need to be conservative, but I just want to know the concern so we can document/think about it.

What I had in mind was if DependencyInfo was combining multiple cases into a single Dependence result. For instance, depending on a condition c the dependence vector is either (+1,-1) or (0,+1). The combined direction flags would be (<=,<>). On the other side, if a particular instance has a reverse (i.e. negative, >, GT) direction, then one previous dimensions must have been positive (otherwise the value would have time-traveled). That could also serve a a justification.

Meinersbur added inline comments.Mar 30 2020, 10:08 PM
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
792

This should be for (size_t j = i; j < NumInsts; ++j) to not skip over self-dependencies of a store.

Given that the DependencyAnalysis result has to be interpreted different than I initially though, I came up with a new legality check.

Since DA returns (lexical) forward and backwards dependency in a single result, both have to be checked. There is also a symmetry between them as the alternative would be to invoke ->depends(Src,Dst) and well as ->depends(Dst,Src) and check them separately.

static bool preservesForwardDependence(Instruction *Src, Instruction *Dst,
                                       unsigned UnrollLevel, unsigned JamLevel,
                                       bool Sequentialized, Dependence *D) {
  // UnrollLevel might carry the dependency Src --> Dst
  // Does a different loop after unrolling?
  for (unsigned d = UnrollLevel + 1; d <= JamLevel; ++d) {
    auto JammedDir = D->getDirection(d);
    if (JammedDir == Dependence::DVEntry::LT)
      return true;

    if (JammedDir & Dependence::DVEntry::GT)
      return false;
  }

  return true;
}

static bool preservesBackwardDependence(Instruction *Src, Instruction *Dst,
                                        unsigned UnrollLevel, unsigned JamLevel,
                                        bool Sequentialized, Dependence *D) {
  // UnrollLevel might carry the dependency Dst --> Src
  for (unsigned d = UnrollLevel + 1; d <= JamLevel; ++d) {
    auto JammedDir = D->getDirection(d);
    if (JammedDir == Dependence::DVEntry::GT)
      return true;

    if (JammedDir & Dependence::DVEntry::LT)
      return false;
  }

  // Backward dependencies are only preserved if not interleaved.
  return Sequentialized;
}

/// Also a forward-dependency, but not carried by UnrollLoop.
static bool preservesNonCarriedDependence(Instruction *Src, Instruction *Dst,
                                          unsigned UnrollLevel,
                                          unsigned JamLevel,
                                          bool Sequentialized, Dependence *D) {
  // There might be dependency Src --> Dst that is not carried by UnrollLoop.
  for (unsigned d = UnrollLevel + 1; d <= JamLevel; ++d) {
    // TODO: Justify this; without it, sub_sub_eq fails
    if (D->isScalar(d))
      continue;

    auto JammedDir = D->getDirection(d);
    if (JammedDir == Dependence::DVEntry::LT)
      return true;

    if (JammedDir & Dependence::DVEntry::GT)
      return false;
  }

  return true;
}

static bool checkDependency(Instruction *Src, Instruction *Dst,
                            unsigned UnrollLevel, unsigned JamLevel,
                            bool Sequentialized, DependenceInfo &DI) {
  assert(UnrollLevel <= JamLevel);

  if (Src == Dst)
    return true;
  if (isa<LoadInst>(Src) && isa<LoadInst>(Dst))
    return true;

  std::unique_ptr<Dependence> D = DI.depends(Src, Dst, true);
  if (!D)
    return true;
  assert(D->isOrdered() && "Expected an output, flow or anti dep.");

  // Quick bail-out.
  if (D->isConfused())
    return false;

  for (unsigned d = 1; d < UnrollLevel; ++d) {
    // Insert comment justifying this here
    if (!(D->getDirection(d) & Dependence::DVEntry::EQ))
      return true;
  }

  auto UnrollDir = D->getDirection(UnrollLevel);
  if (UnrollDir & Dependence::DVEntry::LT &&
      !preservesForwardDependence(Src, Dst, UnrollLevel, JamLevel,
                                  Sequentialized, D.get()))
    return false;

  if (UnrollDir & Dependence::DVEntry::GT &&
      !preservesBackwardDependence(Src, Dst, UnrollLevel, JamLevel,
                                   Sequentialized, D.get()))
    return false;

  if (UnrollDir & Dependence::DVEntry::EQ &&
      !preservesNonCarriedDependence(Src, Dst, UnrollLevel, JamLevel,
                                     Sequentialized, D.get()))
    return false;

  return true;
}

This seem to work nicely with that checks, but I am not sure whether it is correct to ignore isScalar in the EQ case and why. It seems obvious that the sub_sub_eq test case can be unroll-and-jammed. Bit if we add the same skip to preservesForwardDependence, the test case sub_sub_less is unroll-and-jammed which it must not,

llvm/test/Transforms/LoopUnrollAndJam/dependencies.ll
363

This is not safe to unroll-and-jam. For %N == 2 the excution sequence is (Sa being the first access in the loop body, Sb the second)

Sa(0,0): A[0]
Sb(0,0): A[-1]
Sa(0,1): A[0]
Sb(0,1): A[-1]
Sa(1,0): A[1]
Sb(1,0): A[0]
Sa(1,1): A[1]
Sb(1,1): A[0]

After unroll-and-jam by 2:

Sa(0,0): A[0]
Sb(0,0): A[-1]
Sa(1,0): A[1]
Sb(1,0): A[0]
Sa(0,1): A[0]
Sb(0,1): A[-1]
Sa(1,1): A[1]
Sb(1,1): A[0]

That is, the dependency chain Sa(0,0)->Sa(0,1)->Sb(1,0)->Sb(1,1) has become Sa(0,0)->Sb(1,0)->Sa(0,1)->Sb(1,1) and therefore has been violated.

Whitney updated this revision to Diff 254388.Apr 1 2020, 6:35 PM
Whitney marked 2 inline comments as done.

Thanks Michael for the suggested change, I am still thinking/understanding it.

After a tiring search through the literature I've finally found a paper that states a theorem specifically about safety of unroll-and-jam. See Callahan et al. 1988. Estimating Interlock And Improving Balance For Pipelined Architectures, section 3.5, theorem 4. The theorem generally agrees with the approach of checking for lexicographical positivity, but unfortunately it doesn't say anything about cases where the direction vector elements in between the k (unrolled level) and j (inner level carrying a negative dependence) are non-zero. Another paper, S. Carr and K. Kennedy. 1994. Improving the Ratio of Memory operations to Floating-Point Operations in Loops appears to interpret it as if any negative entry in the direction vector between the unrolled level and the inner-most level is not legal. I find the latter too conservative, for instance if we have:

loop i
  loop j
    loop k
      A(i+1, j+1, k-1) = A(i, j, k)

the direction vector would be [< < >] and the unroll-and-jam would be legal. Checking for lexicographical positivity, as proposed in @Meinersbur 's solution, correctly identifies it as a legal case. The first paper (and others too) also suggest that unroll-and-jam can be viewed as an interchange followed by inner-loop unrolling followed by another interchange. The legality checks for interchange seem to also be overly conservative, for example in this case:

loop i
  loop j
    A(i, j+1) = A(i, j)

After the first interchange and inner loop unrolling we get:

loop j
  loop i
    A(i, j+1) = A(i, j)
    A(i+1, j+1) = A(i+1, j)

now there is an interchange preventing dependence from the first statement to the second with direction vector [< >]. We know, however that if we unrolled i in the original nest, there would be no fusion-preventing dependencies between the unrolled iterations at j-level, so unrol-and-jam is legal.

In conclusion, I find the lexicographical positivity test to be the most accurate and I cannot come up with a counter example that exposes a correctness bug, so I tend to agree with it more than anything else.

This seem to work nicely with that checks, but I am not sure whether it is correct to ignore isScalar in the EQ case and why. It seems obvious that the sub_sub_eq test case can be unroll-and-jammed. Bit if we add the same skip to preservesForwardDependence, the test case sub_sub_less is unroll-and-jammed which it must not

It would not be correct to ignore scalar dependencies, as they carry dependencies across all iterations of a loop, but in cases where the direction at the unrolled level is exactly EQ (eg in sub_sub_eq) we can assume safety without considering the inner levels. The reason is that if the unrolled level is exactly EQ, it will become LT (or GT) after unrolling which causes the dependencies in the inner loops to become non-fusion-preventing because the memory accesses would be disjoint. On the other hand, for cases like sub_sub_less, the dependence carried by the outer loop is LT but the dependence carried by the inner loop is scalar, which implies a lexicographical negativity. You get this check for free in your suggested implementation above because, by design, the direction bits in the DependenceInfo are all set when we have a scalar dependence, and preservesForwardDependence returns false causing the test to identify this case as illegal. We can add an explicit check for isScalar to be more clear, but I think just checking for the direction bits is simpler and good enough.

I don't think we need preservesNonCarriedDependence but we'd need a test for the EQ case, so I'd suggest removing preservesNonCarriedDependence and replacing checkDependency with the following which works well on all the tests and avoids the inexplicable skipping of scalar dependencies.

static bool checkDependency(Instruction *Src, Instruction *Dst,
                            unsigned UnrollLevel, unsigned JamLevel,
                            bool Sequentialized, DependenceInfo &DI) {
  assert(UnrollLevel <= JamLevel);

  if (Src == Dst)
    return true;
  if (isa<LoadInst>(Src) && isa<LoadInst>(Dst))
    return true;

  std::unique_ptr<Dependence> D = DI.depends(Src, Dst, true);
  if (!D)
    return true;
  assert(D->isOrdered() && "Expected an output, flow or anti dep.");

  // Quick bail-out.
  if (D->isConfused())
    return false;

  for (unsigned d = 1; d < UnrollLevel; ++d) {
    // Insert comment justifying this here
    if (!(D->getDirection(d) & Dependence::DVEntry::EQ))
      return true;
  }

  auto UnrollDir = D->getDirection(UnrollLevel);

  // If the distance carried by the unrolled loop is 0, then after unrolling
  // that distance will become non-zero resulting in non-overlapping accesses in
  // the inner loops.
  if (UnrollDir == Dependence::DVEntry::EQ)
    return true;

  if (UnrollDir & Dependence::DVEntry::LT &&
      !preservesForwardDependence(Src, Dst, UnrollLevel, JamLevel,
                                  Sequentialized, D.get()))
    return false;

  if (UnrollDir & Dependence::DVEntry::GT &&
      !preservesBackwardDependence(Src, Dst, UnrollLevel, JamLevel,
                                   Sequentialized, D.get()))
    return false;

  return true;
}

We should also add a test for the following illegal case involving scalar dependence at the outer level:

// loop k
//   loop i
//     loop j
//       A(i-1, j) = A(i, j)

Sounds right. @Whitney Can you update this patch?

Whitney updated this revision to Diff 261716.May 3 2020, 11:51 AM

Updated patch as suggested by @bmahjour and @Meinersbur. Thanks!

Meinersbur added inline comments.May 4 2020, 9:09 AM
llvm/lib/Transforms/Utils/LoopUnrollAndJam.cpp
792

Could you please address this comment?

Whitney updated this revision to Diff 261856.May 4 2020, 9:58 AM
Whitney marked 2 inline comments as done.

Addressed the last comment.

Meinersbur accepted this revision.May 4 2020, 3:22 PM

LGTM, thank you.

This revision is now accepted and ready to land.May 4 2020, 3:22 PM
This revision was automatically updated to reflect the committed changes.