This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Linalg] Add loop.parallel lowering for all Linalg Ops.
ClosedPublic

Authored by mravishankar on Apr 7 2020, 1:50 PM.

Details

Summary

The outer parallel loops of a linalg operation is lowered to
loop.parallel, with the other loops lowered to loop.for. This gets the
lowering to loop.parallel on par with the loop.for lowering. In future
the reduction loop could also be lowered to loop.parallel.

Also add a utility function that returns the loops that are
created. This requires change to the EDSC builders to return the
created ops.

Diff Detail

Event Timeline

mravishankar created this revision.Apr 7 2020, 1:50 PM

The utility function part looks like quite a massive change for just the purpose of getting the xxxForOp from their respective induction variables.
Why is this a good tradeoff (also, it is not tested FWICT)?

Could it be separated from this revision and evaluated independently as it seems quite orthogonal to the part that handles more cases of Linalg -> ploops ?

The utility function part looks like quite a massive change for just the purpose of getting the xxxForOp from their respective induction variables.
Why is this a good tradeoff (also, it is not tested FWICT)?

If I understand the comment correctly, are you suggesting not changing the EDSC builders to return the operation handles as well? I think all the induction variables are block arguments of the entry block of the loop operations. I can get the parent op of the block and not modify the edsc builders. I started down the path of getting the op directly from the builders, and just finished it. I am fine with that approach as well. My initial thought was the approach of getting the op from the induction variables is more a WAR, but I dont have enough background in this to know if that is indeed the case.

Could it be separated from this revision and evaluated independently as it seems quite orthogonal to the part that handles more cases of Linalg -> ploops ?

Maybe. I want to do both lower linalg to ploops, and have the utility function that does that return the loops created as well, just like the tileLinalgOp method returns the resulting tiled op and loops created.

asaadaldien added inline comments.Apr 7 2020, 4:14 PM
mlir/include/mlir/EDSC/Builders.h
222

Can we move this block to back to line:79 ?

bondhugula added a subscriber: andydavis1.EditedApr 7 2020, 8:51 PM

@mravishankar A general question on the direction this is taking: why are we even lowering all of this to loop.parallel and loop.for instead of affine.parallel and affine.for? The conversion from affine.* to loop.* is guaranteed to *always* succeed by design in all cases and it already does for affine.for -> loop.for. So you get your loop.for's for free (similarly affine.parallel -> loop.parallel). Looks like you are adding more and more code that is skipping a level of abstraction and introducing a parallel lowering path that would ultimately be redundant / subsumed. I feel this is taking the design and the infrastructure in the wrong direction, more so at each step. @mehdi_amini, @nicolasvasilache, @andydavis1 - has there been any thought and a clear design direction on this? If you go down this path, you'd be forced to duplicate even more of the infrastructure that exists on affine.for on loop.for in strictly less powerful ways and without a good reason. There may be a *few* things that you may just want to do on loop.for rather than on affine.for, but you could do that anyway even after having passed through the affine dialect.

On a less major note, is there an example here that can't be represented via the affine dialect straightaway - the way it is today? Even all your loop steps are one (the ones I can immediately tell from the test cases) - if there are some cases that you need that aren't, they could always be normalized to one via affine (without even needing grayboxes for the cases you have).

bondhugula requested changes to this revision.Apr 7 2020, 8:54 PM
bondhugula added inline comments.
mlir/test/Dialect/Linalg/loops.mlir
128

Is there a need to match all of the trailing 'step %{{.*}}'? You always print step right?

This revision now requires changes to proceed.Apr 7 2020, 8:54 PM
mehdi_amini added a comment.EditedApr 7 2020, 9:45 PM

FWIW, I entirely agree with @bondhugula 's sentiment here!

@bondhugula : Thanks for the comments. I certainly cannot think of a case in Linalg currently where the path of going from linalg -> affine -> loop will not be able to cover everything that currently is supported in Linalg. All I was trying to achieve with this change is to finish up the work that was started a while ago of lowering linalg to loop.parallel. From my perspective, loop.parallel and loop.for are in the same group. I certainly didnt intend to make a design or infrastructure level decision here.

I think it would be good to reach some consensus here on whether linalg should always lower to affine, and then lowered to loop dialect, and if it is not recommended to lower linalg to loops directly. As you mentioned going from affine.for -> loop.for is always guaranteed, and so is going from affine.parallel -> loop.parallel. So we can always make the decision to lower from linalg to affine and then to loops for both the parallel and non-parallel version. My only concern is that If there exists a lowering from linalg to loops, there should also be a way to target loop.parallel since without that you are dropping semantic information about the computation. This information is really important when lowering from loop dialect to GPU dialect (which was AFAIK one of the main motivations behind having loop.parallel in the first place).

@herhut who is probably also interested in this and has comments. I am assuming @nicolasvasilache will also provide his thoughts on this. FWIW, I am fine going through the affine dialect instead of directly going to loop dialect. I hadnt given thought about using affine dialect for my use cases, but i definitely dont see an issue with it.

Another point that is off the top of my head if the recommendation is to go through affine dialect. There is already mechanism to generate loop.parallel when tiling linalg operations. AFAIK, the tile size can be dynamic, and therefore cannot be expressed using affine.parallel loops. So if the codegeneration process is tiling linalg ops and then lowering the tiled ops to loops, you can end up in a situation where the outer loops are in Loop dialect but the inner loops are in affine dialect. I am not sure there is an issue with that cause eventually you can lower the affine loops to loop dialect, but its just something that I havent reasoned fully about for myself.

Side note: This change seems bigger than it is cause it is also doing some changes to EDSC which are probably not needed. Will try to remove those.

bondhugula added a comment.EditedApr 8 2020, 12:17 AM

Another point that is off the top of my head if the recommendation is to go through affine dialect. There is already mechanism to generate loop.parallel when tiling linalg operations. AFAIK, the tile size can be dynamic, and therefore cannot be expressed using affine.parallel loops.

I've pointed this out a couple of times that this isn't accurate - you can represent non-constant tile sizes using either affine.parallel or affine.for (https://llvm.discourse.group/t/beginner-q-help-with-loops-affine-linalg/707/4).

So if the codegeneration process is tiling linalg ops and then lowering the tiled ops to loops, you can end up in a situation where the outer loops are in Loop dialect but the inner loops are in affine dialect. I am not sure there is an issue with that cause eventually you can lower the affine loops to loop dialect, but its just something that I havent reasoned fully about for myself.

Second, there is no issue with using a mix of affine and loop dialect ops - '-convert-affine-to-std' should be able to handle it by design. From a mix of affine.for and loop.for, it'll take you to just loop.for's. Please file a bug report if it doesn't!

ftynse added a comment.Apr 8 2020, 3:41 AM

. @mehdi_amini, @nicolasvasilache, @andydavis1 - has there been any thought and a clear design direction on this? If you go down this path, you'd be forced to duplicate even more of the infrastructure that exists on affine.for on loop.for in strictly less powerful ways and without a good reason. There may be a *few* things that you may just want to do on loop.for rather than on affine.for, but you could do that anyway even after having passed through the affine dialect.

I did think about this, and we even had a document back in the time when had access to those ;) The discussion you want to have here is mostly independent of this patch, and pertains to the motivation for having the loop dialect in the first place. We had that discussion when the dialect was introduced.

Loop dialect was split out from Linalg, where the loop-related ops had been introduced to remove some of the affine constraints that were irrelevant and/or constraining for Linalg's use case. One of the constraints is the need for what I call "affine provenance", i.e. the set of rules spread out in the code that define which SSA values are allowed to be used as dimensions or as symbols in affine constructs. Supporting non-constant steps can be seen as a consequence of lifting those constraints. Linalg had (and still has) a forward-looking design that accounted for things like non-dense buffers and custom types. Plumbing all that through the affine machinery is hard (trust me, I tried).

While one can, in many cases, wiggle their way out of the representation problem, like you suggest with parametric steps, the question of whether one should remains pertinent. It's a complexity trade-off question. We can introduce extra operations and affine maps to model non-constant steps, call this an "affine idiom for parametric steps" and try to discover it when we reason about steps. We can introduce another idiom for another case that doesn't fit affine (let's take indirect accesses). And so on. This introduces extra complexity to the IR and to the code that manipulates it. What's the counterpart? Linalg-based flow does not intend to run affine transformations, so we cannot claim we pay the complexity price for having better optimization. We can spare some lowering code by... writing some other lowering code with more complex abstractions.

On a minor note, have you actually tried running the example you proposed in the linked forum post? :) There are many places where semi-affine maps are poorly supported, or not supported at all. Conversion to LLVM is one of them.

That being said, I was the one who has been arguing that Linalg lowering should go through affine when it can (and so was the design document). The problem is when it cannot, or when doing so would just increase the complexity of the system without visible benefits. Let's assume there are cases where it cannot (today, examples would be: using linalg ops with values that don't qualify as affine symbols, reductions that explicitly want to go through SSA values; tomorrow, we'll have sparse buffers). Then we need the possibility to emit at least some non-affine loops, which this patch contributes. Now, if there is one specific loop in a Linalg op that does not fit into affine, do we actually want a mix of affine and non-affine loops, or do we prefer a single non-affine loop nest that, e.g., preserves the idea of permutability that would be no longer discoverable by affine analysis. I can see value in having both options.

The actual duplication here is between Linalg->loop.for and Linalg->loop.parallel lowering, which I pointed out in one if the previous patches. Given that we have the lowering from loop.parallel to loop.for, we should remove the Linalg->loop.for and replace it with this. My recollection is that it was the plan, but it requires the lowering to loop.parallel to also support reductions, which this patch does not do.

Transformations on different kinds of loops are another question, unrelated to this patch. Again, I see value in removing affine restrictions or, conversely, having stricter restrictions such as hyper-rectangular, and separating the legality and profitability analysis (likely based on those restrictions) from the IR manipulation logic.

mravishankar marked an inline comment as done.

Removing EDSC related changes to focus change only on lowering to
loop.parallel ops.

Another point that is off the top of my head if the recommendation is to go through affine dialect. There is already mechanism to generate loop.parallel when tiling linalg operations. AFAIK, the tile size can be dynamic, and therefore cannot be expressed using affine.parallel loops.

I've pointed this out a couple of times that this isn't accurate - you can represent non-constant tile sizes using either affine.parallel or affine.for (https://llvm.discourse.group/t/beginner-q-help-with-loops-affine-linalg/707/4).

Thanks for the pointer. As was done in that post, I just looked at the op definition and reached the conclusion about parametric tiling. I havent worked with affine dialect as much to know about such things. Its definitely something I want to look into in due course.

So if the codegeneration process is tiling linalg ops and then lowering the tiled ops to loops, you can end up in a situation where the outer loops are in Loop dialect but the inner loops are in affine dialect. I am not sure there is an issue with that cause eventually you can lower the affine loops to loop dialect, but its just something that I havent reasoned fully about for myself.

Second, there is no issue with using a mix of affine and loop dialect ops - '-lower-to-affine' should be able to handle it by design. From a mix of affine.for and loop.for, it'll take you to just loop.for's. Please file a bug report if it doesn't!

Agreed (and said so earlier). It should be OK to mix loop.parallel/loop.for with affine.for/affine.parallel. But based on your post is it possible to generate affine.for/affine.parallel while tiling linalg ops as well? That way the same benefit of going to affine.for/affine.parallel would be available at the inter-tile loops as well.

. @mehdi_amini, @nicolasvasilache, @andydavis1 - has there been any thought and a clear design direction on this? If you go down this path, you'd be forced to duplicate even more of the infrastructure that exists on affine.for on loop.for in strictly less powerful ways and without a good reason. There may be a *few* things that you may just want to do on loop.for rather than on affine.for, but you could do that anyway even after having passed through the affine dialect.

I did think about this, and we even had a document back in the time when had access to those ;) The discussion you want to have here is mostly independent of this patch, and pertains to the motivation for having the loop dialect in the first place. We had that discussion when the dialect was introduced.

Loop dialect was split out from Linalg, where the loop-related ops had been introduced to remove some of the affine constraints that were irrelevant and/or constraining for Linalg's use case. One of the constraints is the need for what I call "affine provenance", i.e. the set of rules spread out in the code that define which SSA values are allowed to be used as dimensions or as symbols in affine constructs. Supporting non-constant steps can be seen as a consequence of lifting those constraints. Linalg had (and still has) a forward-looking design that accounted for things like non-dense buffers and custom types. Plumbing all that through the affine machinery is hard (trust me, I tried).

While one can, in many cases, wiggle their way out of the representation problem, like you suggest with parametric steps, the question of whether one should remains pertinent. It's a complexity trade-off question. We can introduce extra operations and affine maps to model non-constant steps, call this an "affine idiom for parametric steps" and try to discover it when we reason about steps. We can introduce another idiom for another case that doesn't fit affine (let's take indirect accesses). And so on. This introduces extra complexity to the IR and to the code that manipulates it. What's the counterpart? Linalg-based flow does not intend to run affine transformations, so we cannot claim we pay the complexity price for having better optimization. We can spare some lowering code by... writing some other lowering code with more complex abstractions.

Thanks @ftynse for the really useful background. I am certainly unaware of the discussion here. Would be really good if we could surface this back up on the discussion forum. But as you mentioned, I hope that this patch will be seen independent of that discussion. I am not trying to weigh the scales one way or the other, but rather just filling missing pieces where I can and when I need them.

The actual duplication here is between Linalg->loop.for and Linalg->loop.parallel lowering, which I pointed out in one if the previous patches. Given that we have the lowering from loop.parallel to loop.for, we should remove the Linalg->loop.for and replace it with this. My recollection is that it was the plan, but it requires the lowering to loop.parallel to also support reductions, which this patch does not do.

Agreed that lowering from linalg to loop.for should become redundant eventually, but right now the lowering to loop.parallel does not support reductions (Apologies for misrepresenting earlier that I am "finishing" the linalg to loop.parallel lowering, there are a couple of cases missing). As it stands, with this patch itself we "can" remove the linalg -> loop.for lowering. For the unhandled cases thats the fallback used anyway. So there is no change in functionality by merging the lowering to loop.for and loop.parallel.

mlir/test/Dialect/Linalg/loops.mlir
128

Probably not. I didnt change what was already there, just changed the check-prefix. I would rather keep it as is.

nicolasvasilache accepted this revision.Apr 8 2020, 10:51 AM

Thanks Mahesh!

Re Linalg, affine and loop (structured control flow), different transformations can be done at different levels.
Determining where which transformation should be done is still a very open problem: just because you *can* perform transformation A on representation B does not necessarily mean you *should*.
Tradeoffs include complexity, scalability, maintainability, driving transformations, composability, deriving profitability metrics etc etc etc.
The interesting part IMO is being able to mix these different systems and essentially compose Halide-style, affine-style and Allen-Kennedy-style transformations, declaratively, and evaluate based on concrete data.

All this is completely orthogonal to the fact that we need to build the different paths.
Linalg -> affine -> loops / SCF is one valid path
Linalg -> loops / SCF is another valid path
Affine ->loops / SCF is another valid path
None should be discouraged.

bondhugula added a comment.EditedApr 10 2020, 7:38 PM

Another point that is off the top of my head if the recommendation is to go through affine dialect. There is already mechanism to generate loop.parallel when tiling linalg operations. AFAIK, the tile size can be dynamic, and therefore cannot be expressed using affine.parallel loops.

I've pointed this out a couple of times that this isn't accurate - you can represent non-constant tile sizes using either affine.parallel or affine.for (https://llvm.discourse.group/t/beginner-q-help-with-loops-affine-linalg/707/4).

Thanks for the pointer. As was done in that post, I just looked at the op definition and reached the conclusion about parametric tiling. I havent worked with affine dialect as much to know about such things. Its definitely something I want to look into in due course.

So if the codegeneration process is tiling linalg ops and then lowering the tiled ops to loops, you can end up in a situation where the outer loops are in Loop dialect but the inner loops are in affine dialect. I am not sure there is an issue with that cause eventually you can lower the affine loops to loop dialect, but its just something that I havent reasoned fully about for myself.

Second, there is no issue with using a mix of affine and loop dialect ops - '-lower-to-affine' should be able to handle it by design. From a mix of affine.for and loop.for, it'll take you to just loop.for's. Please file a bug report if it doesn't!

Agreed (and said so earlier). It should be OK to mix loop.parallel/loop.for with affine.for/affine.parallel. But based on your post is it possible to generate affine.for/affine.parallel while tiling linalg ops as well? That way the same benefit of going to affine.for/affine.parallel would be available at the inter-tile loops as well.

Yes, of course. That's exactly what I've been pointing out. loop.for/parallel is currently being unnecessarily used for all these cases where it is possible to just go through affine.for/affine.parallel. Even with more general computation (where dim/symbol restrictions get in the way) that you may need in the future, with affine.scope ops (renamed from grayboxes), you'd never need to use loop.for/if to lower any tensor/memref indexing computation. And when you lower to loop.for/if, you'd get what you want. Is there a need to maintain a non-unit symbolic step for loop.for's steps? Your bounds are already any index type SSA value. Won't your code be simplified if you just canonicalized to a unit stride? With semi-affine maps, even with the current support, you only get *more* than what you get with loop.for's.

Adding explicit instantiation of linalgLowerOpToLoops for all linalg
named ops.

@bondhugula I am submitting this change now. Based on comments here there might be larger discussions to be had over how different dialects interact which is separate from this change. Please let me know if you have any other comments on the patch itself.

This revision was not accepted when it landed; it landed in state Needs Review.Apr 13 2020, 1:36 PM
This revision was automatically updated to reflect the committed changes.