This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Enable cleanup of single iteration reduction loops being sibling-fused maximally
ClosedPublic

Authored by sumesh13 on Jun 14 2021, 11:22 AM.

Details

Summary

Changes include the following:

  1. Single iteration reduction loops being sibling fused at innermost insertion level are skipped from being considered as sequential loops. Otherwise, the slice bounds of these loops is reset.
  2. Promote loops that are skipped in previous step into outer loops.
  3. Two utility function - buildSliceTripCountMap, getSliceIterationCount - are moved from

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp to mlir/lib/Analysis/Utils.cpp

Diff Detail

Event Timeline

sumesh13 created this revision.Jun 14 2021, 11:22 AM
sumesh13 requested review of this revision.Jun 14 2021, 11:22 AM
vinayaka-polymage requested changes to this revision.Jun 16 2021, 9:17 AM

Thanks for this, I am posting comments based on a very superficial glance, I will go through it in more detail.

  1. I have posted some minor inline comments.
  2. It would be good to wrap the commit msg to the standard wrap width.

Also, moving some of the slice utils from `LoopFusionUtils.h looks good too.

mlir/include/mlir/Transforms/LoopFusionUtils.h
118

Nit: isInnermostInsertionFusion close backtick.

mlir/lib/Analysis/Utils.cpp
1112–1115

This can directly become a ternary operator ?

This revision now requires changes to proceed.Jun 16 2021, 9:17 AM
bondhugula requested changes to this revision.Jun 16 2021, 11:56 AM
bondhugula added inline comments.
mlir/include/mlir/Analysis/Utils.h
190

in the given slice <-- This is not really for a given generic slice but given a slice trip count map where the counts have already been determined to be constants.

mlir/include/mlir/Dialect/Affine/IR/AffineOps.h
407

added is confusing here. Please clarify as to whether it's appended or replaced. Or you can simply drop added to indicate they are completely replaced.

414

Nit: return values of the cloned loop ...

420

Nit: drop blank line.

mlir/lib/Analysis/Utils.cpp
1351

!reductions.empty()

mlir/lib/Dialect/Affine/IR/AffineOps.cpp
1899–1926

It's actually possible to clone the loop here without cloning the individual ops in the old loop's body. You can have the new loop just take the old loop's body - the old arguments just come along with the body. You need no remapping for the args nor does the IV need any remapping. The new block arguments can be appended and so there is no value remapping needed. With a move, the code becomes much shorter and this also saves a lot of reallocations for large loop bodies.

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
367–368

Reflow.

sumesh13 updated this revision to Diff 353403.Jun 21 2021, 9:59 AM

Address Code review comments

sumesh13 updated this revision to Diff 353414.Jun 21 2021, 10:15 AM
sumesh13 marked 6 inline comments as done.

Address Code review comments

sumesh13 marked 3 inline comments as done.Jun 21 2021, 10:17 AM

Thanks for the feedback. I have addressed the review comments. Please take a look.

bondhugula added inline comments.Jun 21 2021, 11:15 PM
mlir/lib/Analysis/Utils.cpp
1108–1109

This is a stale comment - you may as well drop it in this revision since it's close to your changes.

1130–1131

Typo in this clause near "... and have been ...".

1351

No need of parentheses?

mlir/lib/Dialect/Affine/IR/AffineOps.cpp
1903

Nit: instruction -> operation

1909

I think you don't need to materialize the yield operands into a SmallVector - the operand_range will convert to a ValueRange and the builder should take a ValueRange.

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
370

Please return a LogicalResult (just like promoteIfSingleIteration).

419

I don't think you need getOperations() - just block.splice may work.

468

Drop blank line.

vinayaka-polymage requested changes to this revision.Jun 22 2021, 3:42 AM

I think, in summary, attempt here is to fuse two loops that are reducing under sibling fusion, which is good. I think, the conditions presented in this patch need to be stricter to make it work. Here is a counter example (output results in incorrect code):

func @reduce_add_f32_f32(%arg0: memref<64x64xf32, 1>, %arg1: memref<1x64xf32, 1>, %arg2: memref<1x64xf32, 1>) {
  %cst_0 = constant 0.000000e+00 : f32
  %cst_1 = constant 1.000000e+00 : f32
  %0 = memref.alloca() : memref<f32, 1>
  %1 = memref.alloca() : memref<f32, 1>
  affine.for %arg3 = 0 to 1 {
    affine.for %arg4 = 0 to 64 {
      %accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_0) -> f32 {
        %4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
        %5 = addf %prevAccum, %4 : f32
        affine.yield %5 : f32
      }
      %accum_dbl = addf %accum, %accum : f32
      affine.store %accum_dbl, %arg1[%arg3, %arg4] : memref<1x64xf32, 1>
    }
  }
  affine.for %arg3 = 0 to 1 {
    affine.for %arg4 = 0 to 64 {
      %accum = affine.for %arg5 = 0 to 32 iter_args (%prevAccum = %cst_1) -> f32 { // modified the test case on this line 64 => 32
        %4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
        %5 = mulf %prevAccum, %4 : f32
        affine.yield %5 : f32
      }
      %accum_sqr = mulf %accum, %accum : f32
      affine.store %accum_sqr, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
    }
  }
  return
}

Result with the proposed change will be :

module  {
  func @reduce_add_f32_f32(%arg0: memref<64x64xf32, 1>, %arg1: memref<1x64xf32, 1>, %arg2: memref<1x64xf32, 1>) {
    %c0 = constant 0 : index
    %cst = constant 0.000000e+00 : f32
    %cst_0 = constant 1.000000e+00 : f32
    %0 = memref.alloca() : memref<f32, 1>
    %1 = memref.alloca() : memref<f32, 1>
    affine.for %arg3 = 0 to 1 {
      affine.for %arg4 = 0 to 64 {
        %2:2 = affine.for %arg5 = 0 to 32 iter_args(%arg6 = %cst_0, %arg7 = %cst) -> (f32, f32) {
          %5 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
          %6 = addf %arg7, %5 : f32
          %7 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
          %8 = mulf %arg6, %7 : f32
          affine.yield %8, %6 : f32, f32
        }
        %3 = addf %2#1, %2#1 : f32
        affine.store %3, %arg1[%c0, %arg4] : memref<1x64xf32, 1>
        %4 = mulf %2#0, %2#0 : f32
        affine.store %4, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
      }
    }
    return
  }
}

Note that reduction on addition was supposed to happen on 64 elements, but is now happening only on 32.

This revision now requires changes to proceed.Jun 22 2021, 3:42 AM
sumesh13 updated this revision to Diff 354049.Jun 23 2021, 12:12 PM

Address review comments and add check to only do cleanup if the loop trip counts match.

sumesh13 marked 6 inline comments as done.Jun 23 2021, 12:15 PM

Thansk for the additional comments.

I think, in summary, attempt here is to fuse two loops that are reducing under sibling fusion, which is good. I think, the conditions presented in this patch need to be stricter to make it work. Here is a counter example (output results in incorrect code):

func @reduce_add_f32_f32(%arg0: memref<64x64xf32, 1>, %arg1: memref<1x64xf32, 1>, %arg2: memref<1x64xf32, 1>) {
  %cst_0 = constant 0.000000e+00 : f32
  %cst_1 = constant 1.000000e+00 : f32
  %0 = memref.alloca() : memref<f32, 1>
  %1 = memref.alloca() : memref<f32, 1>
  affine.for %arg3 = 0 to 1 {
    affine.for %arg4 = 0 to 64 {
      %accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_0) -> f32 {
        %4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
        %5 = addf %prevAccum, %4 : f32
        affine.yield %5 : f32
      }
      %accum_dbl = addf %accum, %accum : f32
      affine.store %accum_dbl, %arg1[%arg3, %arg4] : memref<1x64xf32, 1>
    }
  }
  affine.for %arg3 = 0 to 1 {
    affine.for %arg4 = 0 to 64 {
      %accum = affine.for %arg5 = 0 to 32 iter_args (%prevAccum = %cst_1) -> f32 { // modified the test case on this line 64 => 32
        %4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
        %5 = mulf %prevAccum, %4 : f32
        affine.yield %5 : f32
      }
      %accum_sqr = mulf %accum, %accum : f32
      affine.store %accum_sqr, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
    }
  }
  return
}

Result with the proposed change will be :

module  {
  func @reduce_add_f32_f32(%arg0: memref<64x64xf32, 1>, %arg1: memref<1x64xf32, 1>, %arg2: memref<1x64xf32, 1>) {
    %c0 = constant 0 : index
    %cst = constant 0.000000e+00 : f32
    %cst_0 = constant 1.000000e+00 : f32
    %0 = memref.alloca() : memref<f32, 1>
    %1 = memref.alloca() : memref<f32, 1>
    affine.for %arg3 = 0 to 1 {
      affine.for %arg4 = 0 to 64 {
        %2:2 = affine.for %arg5 = 0 to 32 iter_args(%arg6 = %cst_0, %arg7 = %cst) -> (f32, f32) {
          %5 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
          %6 = addf %arg7, %5 : f32
          %7 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
          %8 = mulf %arg6, %7 : f32
          affine.yield %8, %6 : f32, f32
        }
        %3 = addf %2#1, %2#1 : f32
        affine.store %3, %arg1[%c0, %arg4] : memref<1x64xf32, 1>
        %4 = mulf %2#0, %2#0 : f32
        affine.store %4, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
      }
    }
    return
  }
}

Note that reduction on addition was supposed to happen on 64 elements, but is now happening only on 32.

Thanks for pointing this out. I have introduced an additional check. Would this suffice ?

mlir/lib/Dialect/Affine/IR/AffineOps.cpp
1909

Can you pls clarify? am also appending the operands of the yeild operation from the inner for loop.

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
419

I see there is no splice in block itself.

bondhugula added inline comments.Jun 23 2021, 10:51 PM
mlir/lib/Dialect/Affine/IR/AffineOps.cpp
1909

I missed that - sorry. It should be fine then.

@sumesh13 Could you also please mention whether some of the methods were just moved without any change? That's useful in the commit summary.

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
388–390

Reflow - please use whole width.

instruction -> operation

bondhugula added inline comments.Jun 23 2021, 11:02 PM
mlir/include/mlir/Dialect/Affine/IR/AffineOps.h
417

After the change to takeBody, this would no longer be a clone! The input operation loop is no longer the same and will have no body. We should document this and the name of this method should ideally change. Do you want to erase loop here as well here? In that case this can be called: replaceWithNewYields?

bondhugula added inline comments.Jun 24 2021, 5:59 AM
mlir/lib/Dialect/Affine/IR/AffineOps.cpp
1878

Since this is exposed in mlir::, you'll need to name it something like cloneForOpWithNewYields.

sumesh13 updated this revision to Diff 354299.Jun 24 2021, 10:37 AM
sumesh13 edited the summary of this revision. (Show Details)

Address code review comments including rename API to replaceForOpWithNewYields

sumesh13 marked 2 inline comments as done.Jun 24 2021, 10:40 AM

@sumesh13 Could you also please mention whether some of the methods were just moved without any change? That's useful in the commit summary.

Thanks for all the comments. Updated commit message and have addressed other comments.

mlir/lib/Dialect/Affine/IR/AffineOps.cpp
1878

Thanks for pointing that out. I have renamed the function to now replaceForOpWithNewYields.

sumesh13 updated this revision to Diff 354322.Jun 24 2021, 11:39 AM
sumesh13 marked an inline comment as done.

Reflow a comment

Reflow a comment

Thanks for addressing comments, sorry for getting back on this so late.

I have added requests for clarifications in a few places. Can you please clarify those ?

mlir/include/mlir/Analysis/Utils.h
52

This function actually requires the outer loop to be parallel, but also to contain reductions inside it, correct ?

mlir/lib/Analysis/Utils.cpp
1137

You can directly use isMaximal here and update the comment above.

1346

Added a comment above. I might be confused here, but it seems like the function checks more that whether the loop is a reduction loop or not.

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
369

I think, such a merge of an inner loop should happen into its parent loop, when the bounds exactly match. Is this already being taken care of somewhere else ?

401

As ops are being moved outside, care must be taken to make sure that no dependences are being violated. In sibling fusion, it is OK, as there are read-read dependences. So, can you please clarify if such a promotion of loops should happen only in sibling fusions ?

433

Depending on the question above, if promotion of reducing loops should happen only in sibling fusion, the option here needs to be renamed.

sumesh13 updated this revision to Diff 354961.Jun 28 2021, 11:42 AM

Address comments

sumesh13 marked 2 inline comments as done.Jun 28 2021, 11:50 AM

@vinayaka-polymage Thanks for additional comments. Can you pls check if my clarifications address your concerns?

mlir/include/mlir/Analysis/Utils.h
52

No, this check only does a limited if a loop is reduction. The caller only calls it for loops determined to be sequential. Pls let me know if its not clear and some clarification might be helpful.

52

No, the check is specifcially if its a reduction loop. There is outer logic i utils.cpp that is only invoking it for sequential loops. Similarly while fusing, its only needed to check if the particular loop is reduction. Please let me know if there is clarifciation that can help here.

mlir/lib/Analysis/Utils.cpp
1346

Addressed the comment before.

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
369

The loop bounds are reset earlier if the bounds do not match. And in this case the merge is done only for unit slice. Do you still see it necessary to check the bounds?

401

Yeah, I have renamed the option in the caller.

sumesh13 updated this revision to Diff 354974.Jun 28 2021, 11:57 AM

Add an extra clarification for when the promotion API can be safely used.

sumesh13 marked an inline comment as done.Jun 28 2021, 11:58 AM
sumesh13 added inline comments.
mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
369

I have added a clarification.

bondhugula added inline comments.Jul 3 2021, 10:17 AM
mlir/test/Transforms/loop-fusion.mlir
3153

This whole test case has become too long (3144 lines) and we should really split this file - perhaps write after this PR in an NFC PR.

bondhugula requested changes to this revision.Jul 3 2021, 10:25 AM
bondhugula added inline comments.
mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
368–369

This doesn't look really clean to have this method named and implemented this way with a generic single argument, and then having a comment and TODO stating this only should be used for sibling fusion. If we do this, we should have a boolean argument siblingFusion set to true/false and it should work for both cases.

369

Reflow to use the whole line - here and anywhere else. Please make a sweep as reviewers really can't comment everywhere.

401–402

We shouldn't be having such TODOs - we should address this via a proper function signature.

This revision now requires changes to proceed.Jul 3 2021, 10:25 AM
sumesh13 updated this revision to Diff 356784.Jul 6 2021, 12:44 PM
sumesh13 set the repository for this revision to rG LLVM Github Monorepo.

Address comments

  • Reflow all coments
    • Additional argument to promoteSingleIterReduction to control when operations are moved out
sumesh13 marked 2 inline comments as done and an inline comment as not done.Jul 6 2021, 12:46 PM
sumesh13 added inline comments.
mlir/test/Transforms/loop-fusion.mlir
3153

Sure...I can move the purely sibling fusion related tests to a new file.

Thanks for addressing the comments I had. A few more minor comments. I'll just defer to @vinayaka-polymage for the other logic part. Please do wait for his acceptance before merging.

mlir/lib/Transforms/LoopFusion.cpp
1782–1785

This change requires a comment.

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
382–383

You don't need a static_cast here.

416

Nit: Use auto *it to avoid clang-tidy warning.

mlir/test/Transforms/loop-fusion.mlir
3265

This trailing comment isn't clear. Please use a separate line and rephrase a bit.

bondhugula accepted this revision.Jul 8 2021, 7:54 PM
sumesh13 updated this revision to Diff 357419.Jul 8 2021, 10:58 PM
sumesh13 marked an inline comment as not done.

Adrress review comments

sumesh13 marked 6 inline comments as done.Jul 8 2021, 11:02 PM
sumesh13 added inline comments.
mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
416

Didnt see a clang-tidy warning in this case.

Changes look good now, minor clarification request.

mlir/lib/Analysis/Utils.cpp
1346

Minor:
I think, the function checks that the loop forOp is parallel and also contains parallel reductions loops. Line 1348 checks that the outer loop is parallel. Can you please clarify?
A sequential outer loop containing a reduction loop inside it will result in false when passed to this function, right ?

sumesh13 added inline comments.Jul 11 2021, 10:35 PM
mlir/lib/Analysis/Utils.cpp
1346

Yes, the intent here was to say that the loop if not for the reduction is parallel and hence need not have its bounds reset. So a sequential outer loop with a reduction inner loop will result in false. Pls let me know if you see an opportunity to broaden the case.

mlir/lib/Analysis/Utils.cpp
1346

OK, so in that case, function name and its description in comments will have to change. As this is a utility, anyone might use it in future for detecting reduction loops. If the check is a custom one, name and description will have to reflect that. Does this make sense ?

sumesh13 added inline comments.Jul 12 2021, 7:18 PM
mlir/lib/Analysis/Utils.cpp
1346

Sure..I think that makes sense. Any suggestion? How about isNonSequentialDueToReduction?

Suggestion

mlir/lib/Analysis/Utils.cpp
1346

My suggestion is isLoopParallelAndContainsReductions.

sumesh13 updated this revision to Diff 358157.Jul 12 2021, 10:29 PM

isReductionLoop->isLoopParallelAndContainsReductionLoop

sumesh13 marked an inline comment as done.Jul 12 2021, 10:29 PM

Thanks for addressing all the comments, I will make a final pass on it in a day.

Thanks for addressing all the comments, sorry once again for taking so long to review. As some ops were moving out from one for op to its parent, I wanted to closely review the scenarios.

I was mainly considering cases with bounds that don't align well, like the following example, fusion produces incorrect code (even without changes suggested here). So, I think, we should separately address them, they are not introduced by this change.

func @reduce_add_f32_f32(%arg0: memref<64x64xf32, 1>, %arg1: memref<1x63xf32, 1>, %arg2: memref<1x64xf32, 1>) {
  %cst_0 = constant 0.000000e+00 : f32
  %cst_1 = constant 1.000000e+00 : f32
  %0 = memref.alloca() : memref<f32, 1>
  %1 = memref.alloca() : memref<f32, 1>
  affine.for %arg3 = 0 to 1 {
    affine.for %arg4 = 0 to 63 {
      %accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_0) -> f32 {
        %4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
        %5 = addf %prevAccum, %4 : f32
        affine.yield %5 : f32
      }
      %accum_dbl = addf %accum, %accum : f32
      affine.store %accum_dbl, %arg1[%arg3, %arg4] : memref<1x63xf32, 1>
    }
  }
  affine.for %arg3 = 0 to 1 {
    affine.for %arg4 = 0 to 32 {
      %accum = affine.for %arg5 = 0 to 64 iter_args (%prevAccum = %cst_1) -> f32 {
        %4 = affine.load %arg0[%arg5, %arg4] : memref<64x64xf32, 1>
        %5 = mulf %prevAccum, %4 : f32
        affine.yield %5 : f32
      }
      %accum_sqr = mulf %accum, %accum : f32
      affine.store %accum_sqr, %arg2[%arg3, %arg4] : memref<1x64xf32, 1>
    }
  }
  return
}

This set of changes look good to me!

This revision is now accepted and ready to land.Jul 14 2021, 10:49 AM

Thanks @bondhugula @vinayaka-polymage for the review and helpful comments!!

sumesh13 updated this revision to Diff 358687.Jul 14 2021, 11:58 AM

FIx a clang-format issue

sumesh13 updated this revision to Diff 358688.Jul 14 2021, 11:59 AM

Fix a clang-format issue

sumesh13 updated this revision to Diff 359046.Jul 15 2021, 10:37 AM

One asan fix