This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][affine-loop-fusion] Handle defining ops between the source and dest loops
ClosedPublic

Authored by tungld on Feb 18 2021, 8:55 PM.

Details

Summary

This patch handles defining ops between the source and dest loop nests, and prevents loop nests with iter_args from being fused.

If there is any SSA value in the dest loop nest whose defining op has dependence from the source loop nest, we cannot fuse the loop nests.

If there is a affine.for with iter_args, prevent it from being fused.

Diff Detail

Event Timeline

tungld created this revision.Feb 18 2021, 8:55 PM
tungld requested review of this revision.Feb 18 2021, 8:55 PM
tungld edited the summary of this revision. (Show Details)
tungld updated this revision to Diff 324889.Feb 18 2021, 10:58 PM

Remove a redundant method

tungld updated this revision to Diff 324908.Feb 19 2021, 1:08 AM

Clean the code

tungld edited the summary of this revision. (Show Details)Feb 19 2021, 1:15 AM
tungld updated this revision to Diff 324957.Feb 19 2021, 5:07 AM

Check causal dependence

Hey tungld,

Thanks for the patch! There is a different but somehow related issue that is being address here: https://reviews.llvm.org/D97032.
I would like to know what @bondhugula and @andydavis1 think but I think we might want to add the proper %0 -> %i1 dependence to the MDG,
instead of special-casing the load scenario. Even though it's not strictly a memref dependence, it's a memref-related dependence between two graph nodes.

Interestingly, we should also consider more generic cases. For example:

affine.for %i0 = 0 to 10 {
  affine.store %cst, %b[] : memref<f32>
  affine.store %cst, %a[%i0] : memref<10xf32>
}

%0 = "unknown_op" (%b) : (memref<f32>) -> f32

affine.for %i1 = 0 to 10 {
  %1 = affine.load %a[%i1] : memref<10xf32>
  %2 = divf %0, %1 : f32
}
affine.for %i0 = 0 to 10 {
  affine.store %cst, %b[] : memref<f32>
  affine.store %cst, %a[%i0] : memref<10xf32>
}

%0 = affine.for ...

affine.for %i1 = 0 to 10 {
  %1 = affine.load %a[%i1] : memref<10xf32>
  %2 = divf %0, %1 : f32
}

I think the former would be addressed by https://reviews.llvm.org/D97032 and the latter would need support for iter_args in the fusion algorithm
so I wouldn't try to handle them right now. However, we have to make sure that this solution paves the way to support the more complex cases.

Thanks,
Diego

bondhugula added a comment.EditedFeb 20 2021, 8:43 AM

Hey tungld,

Thanks for the patch! There is a different but somehow related issue that is being address here: https://reviews.llvm.org/D97032.
I would like to know what @bondhugula and @andydavis1 think but I think we might want to add the proper %0 -> %i1 dependence to the MDG,
instead of special-casing the load scenario. Even though it's not strictly a memref dependence, it's a memref-related dependence between two graph nodes.

Not really. This one is just a pure SSA dependence and nothing to do with memory dependences. The value may be loaded into from a memref but it's a value. The memref user op is a load which may participate in separate/other memref dependences.

Interestingly, we should also consider more generic cases. For example:

affine.for %i0 = 0 to 10 {
  affine.store %cst, %b[] : memref<f32>
  affine.store %cst, %a[%i0] : memref<10xf32>
}

%0 = "unknown_op" (%b) : (memref<f32>) -> f32

affine.for %i1 = 0 to 10 {
  %1 = affine.load %a[%i1] : memref<10xf32>
  %2 = divf %0, %1 : f32
}
affine.for %i0 = 0 to 10 {
  affine.store %cst, %b[] : memref<f32>
  affine.store %cst, %a[%i0] : memref<10xf32>
}

%0 = affine.for ...

affine.for %i1 = 0 to 10 {
  %1 = affine.load %a[%i1] : memref<10xf32>
  %2 = divf %0, %1 : f32
}

I think the former would be addressed by https://reviews.llvm.org/D97032 and the latter would need support for iter_args in the fusion algorithm

The author of D97032 is already taking care of all side-effecting intervening ops (like writes, frees, unknown side-effecting ops like call) in two revisions in a generic way using memory effects / side-effect interfaces; those are pretty simple patches. The first of those is D97032 that takes care of memref writes/frees. But this patch addresses a different issue. This is not a memory dependence but an SSA dependence like other SSA edges.

so I wouldn't try to handle them right now. However, we have to make sure that this solution paves the way to support the more complex cases.

Thanks,
Diego

bondhugula requested changes to this revision.Feb 20 2021, 8:52 AM

Some comments on clarifying doc / code comments.

mlir/lib/Transforms/LoopFusion.cpp
812–813

This line always needed a comment. Stores anyway don't define SSA values.

1490–1491

Nit on terminology: please either use "... depends on ..." or "has a dependence from" / "has a dependence to".

1496–1497

in between the loops -> in between the loop nests ? It's otherwise ambiguous.

"having dependence on the source loop" -> I think this needs to be rephrased for accuracy as well. "... with a user in .."?

This revision now requires changes to proceed.Feb 20 2021, 8:52 AM
tungld updated this revision to Diff 325357.Feb 21 2021, 5:25 PM

Reflect reviewers' comments

tungld marked 3 inline comments as done.EditedFeb 21 2021, 5:49 PM

Hi @bondhugula, @dcaballe,

Thanks for your comments! I just updated the patch.

This patch is to add one additional node in to MDG, which is from a defining op of an SSA value to its users. So, I think it can handle this

affine.for %i0 = 0 to 10 {
  affine.store %cst, %b[] : memref<f32>
  affine.store %cst, %a[%i0] : memref<10xf32>
}

%0 = affine.for ...

affine.for %i1 = 0 to 10 {
  %1 = affine.load %a[%i1] : memref<10xf32>
  %2 = divf %0, %1 : f32
}

because %0 = affine.for defines an SSA value.

I tried to run with the following program:

affine.for %arg2 = 0 to 10 {
   affine.store %cst, %arg1[] : memref<f32>
   affine.store %cst, %arg0[%arg2] : memref<10xf32>
}
%0 = affine.for %arg2 = 0 to 10 step 2 iter_args(%arg3 = %cst_0) -> (f32) {
   %1 = affine.load %arg0[%arg2] : memref<10xf32>
   %2 = addf %arg3, %1 : f32
   affine.yield %2 : f32
}
affine.for %arg2 = 0 to 10 {
    %1 = affine.load %arg0[%arg2] : memref<10xf32>
    %2 = divf %0, %1 : f32
}

and got

%0 = affine.for %arg2 = 0 to 10 step 2 iter_args(%arg3 = %cst_0) -> (f32) {
  affine.store %cst, %arg1[] : memref<f32>
  affine.store %cst, %arg0[%arg2] : memref<10xf32>
  %1 = affine.load %arg0[%arg2] : memref<10xf32>
  %2 = addf %arg3, %1 : f32
  affine.yield %2 : f32
}
affine.for %arg2 = 0 to 10 {
   %1 = affine.load %arg0[%arg2] : memref<10xf32>
   %2 = divf %0, %1 : f32
}

It seemed it could identify the defining op %0 = affine.for..., but the output program was not semantically similar to the input program due to step in affine.for. I think this is somewhat related to what @dcaballe mentioned

so I wouldn't try to handle them right now. However, we have to make sure that this solution paves the way to support the more complex cases.

dcaballe requested changes to this revision.Feb 21 2021, 11:43 PM

I would like to know what @bondhugula and @andydavis1 think but I think we might want to add the proper
%0 -> %i1 dependence to the MDG,

Actually, as you could infer from my answer, I completely misunderstood the implementation of the patch! As
@tungld mentioned, it seems we were already adding edges for some SSA dependences:

Add dependence edges between nodes which produce SSA values and their
users. Load ops can be considered as the ones producing SSA values.

This is what I was referring to. I was not aware of it :).

However, the following code would still fail after the patch, right @tungld?

affine.for %arg2 = 0 to 10 {
  %2 = affine.load %arg1[] : memref<f32>
  affine.store %2, %arg0[%arg2] : memref<10xf32>
}
%0 = affine.load %arg1[] : memref<f32>
%1 = addf %0, %0 : f32
affine.for %arg2 = 0 to 10 {
  %2 = affine.load %arg0[%arg2] : memref<10xf32>
  %3 = divf %1, %2 : f32
}

Maybe we should create a single node that includes all the top-level ops between loops?
That would require further discussion, though. The current fix should be good for now.

It seemed it could identify the defining op %0 = affine.for..., but the output program was not
semantically similar to the input program due to step in affine.for. I think this is somewhat
related to what @dcaballe mentioned

Sorry, I should've mentioned that loops with iter_args are not supported. Affine analysis is only looking at
memory dependences and not taking SSA loop-carried dependences into account. We should prevent loops
with iter_args to be fused. IIRC, @bondhugula suggested that we could demote these SSA values to memrefs
(reg2mem) before fusion so that all the analysis happens through memory.

The patch looks great! Just one more comment.

mlir/lib/Transforms/LoopFusion.cpp
982

Thanks!

1499

We are just checking for a dependence that would prevent finding an insertion point for the fused loop. Could we please move this to getFusedLoopNestInsertionPoint? In that way, sibling fusion will also get it.

This revision now requires changes to proceed.Feb 21 2021, 11:43 PM

However, the following code would still fail after the patch, right @tungld?

affine.for %arg2 = 0 to 10 {
  %2 = affine.load %arg1[] : memref<f32>
  affine.store %2, %arg0[%arg2] : memref<10xf32>
}
%0 = affine.load %arg1[] : memref<f32>
%1 = addf %0, %0 : f32
affine.for %arg2 = 0 to 10 {
  %2 = affine.load %arg0[%arg2] : memref<10xf32>
  %3 = divf %1, %2 : f32
}

It worked because we had load in the first loop which has no write-read dependence to %0 = affine.load %arg1[] : memref<f32>. I got this output using this patch:

%0 = affine.load %arg1[] : memref<f32>
%1 = addf %0, %0 : f32
affine.for %arg2 = 0 to 10 {
  %2 = affine.load %arg1[] : memref<f32>
  affine.store %2, %arg0[%arg2] : memref<10xf32>
  %3 = affine.load %arg0[%arg2] : memref<10xf32>
  %4 = divf %1, %3 : f32
}

To make dependence, I changed to use store as follows:

%cst = constant 0.000000e+00 : f32
  affine.for %arg2 = 0 to 10 {
    affine.store %cst, %arg1[] : memref<f32>
    affine.store %cst, %arg0[%arg2] : memref<10xf32>
  }
  %0 = affine.load %arg1[] : memref<f32>
  %1 = addf %0, %0 : f32
  affine.for %arg2 = 0 to 10 {
      %2 = affine.load %arg0[%arg2] : memref<10xf32>
        %3 = divf %1, %2 : f32
  }

and the output was correct:

%cst = constant 0.000000e+00 : f32
    affine.for %arg2 = 0 to 10 {
      affine.store %cst, %arg1[] : memref<f32>
      affine.store %cst, %arg0[%arg2] : memref<10xf32>
    }
    %0 = affine.load %arg1[] : memref<f32>
    %1 = addf %0, %0 : f32
    affine.for %arg2 = 0 to 10 {
      %2 = affine.load %arg0[%arg2] : memref<10xf32>
      %3 = divf %1, %2 : f32
    }

I think the add between affine.load and affine.for was handled by hasNonAffineUsersOnThePath.

Could we please move this to getFusedLoopNestInsertionPoint? In that way, sibling fusion will also get it.

Yes, let me do this, and come back soon.

We should prevent loops with iter_args to be fused

Hi Diego, do you prefer to have it in this patch? I could do it by simply checking the source and dest affine.for, if one of them uses iter_args (or returns a SSA value), we prevent them to be fused.

tungld updated this revision to Diff 325421.Feb 22 2021, 5:18 AM

Move the check into getFusedLoopNestInsertionPoint and prevent fusing loop nests that return values

tungld marked an inline comment as done.Feb 22 2021, 5:24 AM

I moved the check of defining ops into getFusedLoopNestInsertionPoint and prevented loop nests with values from being fused. Also added TODO for loop nests with values. Thanks!

bondhugula added a comment.EditedFeb 22 2021, 5:25 AM

We should prevent loops with iter_args to be fused

Hi Diego, do you prefer to have it in this patch? I could do it by simply checking the source and dest affine.for, if one of them uses iter_args (or returns a SSA value), we prevent them to be fused.

Several affine passes need to be updated to correctly handle / bail out on iter_args. Loop unroll was recently fixed in its presence. I think it could be done in another patch.

bondhugula accepted this revision.Feb 22 2021, 8:47 AM

LGTM.

mlir/lib/Transforms/LoopFusion.cpp
373

Should this be named gatherNonMemRefDefiningNodes? I'm a bit confused. Or is it clear from other context?

mlir/test/Transforms/loop-fusion.mlir
2959

This TODO could be confusing. It's not clear whether handling return values/iter_args in fusion is worthwhile; it may be better/simpler to do a reg2mem, do the fusion without any iter_args, and then again do mem2reg. You can either update the comment to reflect this or simply drop the comment.

If you have fixed for the iter_args handling, please also update the commit summary to reflect that.

dcaballe accepted this revision.Feb 22 2021, 10:08 AM

I think the add between affine.load and affine.for was handled by hasNonAffineUsersOnThePath.

Great!

LGTM. It should be good to go after addressing the remaining minor comments. Thanks!

mlir/test/Transforms/loop-fusion.mlir
2938

typo

2952

Please, add checks for ops in the loop body. Loops could be fused without removing the src loop.

2959

+1

This revision is now accepted and ready to land.Feb 22 2021, 10:08 AM
tungld updated this revision to Diff 325738.Feb 23 2021, 4:15 AM

Edit comments and lit tests

tungld edited the summary of this revision. (Show Details)Feb 23 2021, 4:39 AM
tungld marked 3 inline comments as done.Feb 23 2021, 4:50 AM

Hi Uday and Diego,

Thank you so much for your comments! I just updated the patch.

I don't have permission to land this, please help me to do it. My "name (email)" is "Tung D. Le (tung@jp.ibm.com)".

Thanks,
Tung.

mlir/lib/Transforms/LoopFusion.cpp
373

By definition MDG edge, non-memref value will refer to the dependence that one op defines values used in another op. Anyway, I updated comments here to make it clearer.

Hi Uday and Diego,

Thank you so much for your comments! I just updated the patch.

I don't have permission to land this, please help me to do it. My "name (email)" is "Tung D. Le (tung@jp.ibm.com)".

Thanks,
Tung.

@tungld could you please rebase and fix a conflict due to another revision that updated fusion?

tungld updated this revision to Diff 325936.Feb 23 2021, 4:47 PM
tungld marked an inline comment as done.

Rebase and fix a conflict

@tungld could you please rebase and fix a conflict due to another revision that updated fusion?

Done. Thanks!

@dcaballe @bondhugula any further comments? Thanks!

dcaballe accepted this revision.Feb 24 2021, 6:00 PM

LGTM! Just minor comments. I think Uday planned to commit it. Otherwise, I could commit it tomorrow together with one of mine.

mlir/test/Transforms/loop-fusion.mlir
2985

You can remove the %{{.*}} = if you are not capturing the return value.

2987

Just matching the ops here would suffice:

// CHECK-NEXT:    affine.load
// CHECK-NEXT:    affine.yield
tungld updated this revision to Diff 326267.Feb 24 2021, 7:08 PM

Edit lit tests according to @dcaballe's comments.

tungld marked 2 inline comments as done.Feb 24 2021, 7:09 PM

Just minor comments.

Done. Thanks!