Page MenuHomePhabricator

Reorder linalg.conv indexing_maps loop order
ClosedPublic

Authored by asaadaldien on Wed, Sep 16, 11:25 AM.

Details

Summary

Change the indexing map to iterate over the (b, x0, x1, z0, z1, q, k) instead of (b, x0, x1, k, q, z0, z1) to evaluate the convolution expression:
Y[b, x0, x1, k] = sum(W[z0, z1, q, k] * X[b, x0 + z0, x1 + z1, q], z0, z1, q)

This allows llvm auto vectorize to work and has better locality resulting significant performance improvments

Diff Detail

Event Timeline

asaadaldien created this revision.Wed, Sep 16, 11:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptWed, Sep 16, 11:25 AM
asaadaldien requested review of this revision.Wed, Sep 16, 11:25 AM
mravishankar accepted this revision.Fri, Sep 18, 6:29 AM

Looks good after comment is addressed.

mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.td
312

Is this more than 80 characters? Please adjust this.

Also could you change the order of statements to this

SmallVector<StringRef, 8> iters;
iters.reserve(...);
iters.resize(npar - getNumOutputFeatureDimensions(), getParallelIteratorTypeName());
iters.append(..);
iters.append(..);
This revision is now accepted and ready to land.Fri, Sep 18, 6:29 AM
nicolasvasilache requested changes to this revision.Fri, Sep 18, 7:01 AM

Putting a blocker to get some discussion here.

The original definition was following TF, it is not anymore.
Do we consider this to not be a problem ?

Putting the loops in the "right" order is independent from the op spec.
You should be able to just use tiling without any op definition change (tile by 1-1-1-1-1 + perm for scalar loop-level).

More generally, the way I am approaching this longer term is that this magic "embed the world op" would disappear in favor of more, smaller named ops.
Transformations would look like: named_op1 -> generic form -> named_op2 and I think a lot of these things can be automated from the TC form.
My overall concern is whether we are going towards relying more and more on an op with very fat one-off semantics.

I won't oppose it strongly for now but the conv op will most likely be a prime candidate for deletion when the infra is more mature, so I'd be cautious on putting a lot of effort on transformations / optimizations / perf results if they don't generalize to linalg.generic.

Also, the doc would need to be updated too.

This revision now requires changes to proceed.Fri, Sep 18, 7:01 AM

Putting a blocker to get some discussion here.

The original definition was following TF, it is not anymore.
Do we consider this to not be a problem ?

Putting the loops in the "right" order is independent from the op spec.
You should be able to just use tiling without any op definition change (tile by 1-1-1-1-1 + perm for scalar loop-level).

More generally, the way I am approaching this longer term is that this magic "embed the world op" would disappear in favor of more, smaller named ops.
Transformations would look like: named_op1 -> generic form -> named_op2 and I think a lot of these things can be automated from the TC form.
My overall concern is whether we are going towards relying more and more on an op with very fat one-off semantics.

I won't oppose it strongly for now but the conv op will most likely be a prime candidate for deletion when the infra is more mature, so I'd be cautious on putting a lot of effort on transformations / optimizations / perf results if they don't generalize to linalg.generic.

Also, the doc would need to be updated too.

Actually a better way to have handled is to just lower it to generic form and do the interchange while lowering it, i.e. use the LinalgLoopInterchangePattern on linalg.conv operation. The issue with this AFAICS is

  1. Since the indexing maps and iterator types are not mutable for named ops (as it should be) the linalg.conv needs to be lowered to its generic form while interchanging loops.
  2. To do (1) there needs to be a way to get the body of the generic op. For the "legacy" named ops, these are hard-coded when they are lowered to loops. Instead the named ops need an interface that can be used to generate the body. If we had that, then I would favor the approach of keeping the conv definition as is and reordering the loops as a transformation.

Putting a blocker to get some discussion here.

The original definition was following TF, it is not anymore.
Do we consider this to not be a problem ?

Which definition did we follow ? I am not familiar with explicit loop order def the only one I can see is the expression in linalg doc that doesn't imply the order, no ?

Putting the loops in the "right" order is independent from the op spec.
You should be able to just use tiling without any op definition change (tile by 1-1-1-1-1 + perm for scalar loop-level).

Maybe I am not following but this will require tiling the entire loop by 1 to reorder it ?

More generally, the way I am approaching this longer term is that this magic "embed the world op" would disappear in favor of more, smaller named ops.
Transformations would look like: named_op1 -> generic form -> named_op2 and I think a lot of these things can be automated from the TC form.
My overall concern is whether we are going towards relying more and more on an op with very fat one-off semantics.

I won't oppose it strongly for now but the conv op will most likely be a prime candidate for deletion when the infra is more mature, so I'd be cautious on putting a lot of effort on transformations / optimizations / perf results if they don't generalize to linalg.generic.

The reason I am adding this to make it competitive to other alternatives, more specifically I am working on Conv -> col2Im + matmul for IREE. When convoluting 1x7x7x512xf32 with filter 3x3x512x512 (Bottom layers of resnet50 with massive filter banks) without this change conv e2e CPU tile is 1718 ms with this change 28 ms which is better than my currently im2col implementation ~ 60 ms/img.

Can you please elaborate more about don't generalize to linalg.generic.

Also, the doc would need to be updated too.

Putting a blocker to get some discussion here.

The original definition was following TF, it is not anymore.
Do we consider this to not be a problem ?

Putting the loops in the "right" order is independent from the op spec.
You should be able to just use tiling without any op definition change (tile by 1-1-1-1-1 + perm for scalar loop-level).

More generally, the way I am approaching this longer term is that this magic "embed the world op" would disappear in favor of more, smaller named ops.
Transformations would look like: named_op1 -> generic form -> named_op2 and I think a lot of these things can be automated from the TC form.
My overall concern is whether we are going towards relying more and more on an op with very fat one-off semantics.

I won't oppose it strongly for now but the conv op will most likely be a prime candidate for deletion when the infra is more mature, so I'd be cautious on putting a lot of effort on transformations / optimizations / perf results if they don't generalize to linalg.generic.

Also, the doc would need to be updated too.

Actually a better way to have handled is to just lower it to generic form and do the interchange while lowering it, i.e. use the LinalgLoopInterchangePattern on linalg.conv operation. The issue with this AFAICS is

  1. Since the indexing maps and iterator types are not mutable for named ops (as it should be) the linalg.conv needs to be lowered to its generic form while interchanging loops.
  2. To do (1) there needs to be a way to get the body of the generic op. For the "legacy" named ops, these are hard-coded when they are lowered to loops. Instead the named ops need an interface that can be used to generate the body. If we had that, then I would favor the approach of keeping the conv definition as is and reordering the loops as a transformation.

+1 I am favoring this approach I think having named_ops to implement the same trait as generic ops (indexing_maps attributes, iterator_types attributes, body block) will make applying transformations to linalgOp fairly easy.

Thanks for discussing, some comments.

Actually a better way to have handled is to just lower it to generic form and do the interchange while lowering it, i.e. use the LinalgLoopInterchangePattern on linalg.conv operation.
...
Since the indexing maps and iterator types are not mutable for named ops (as it should be) the linalg.conv needs to be lowered to its generic form while interchanging loops.

Yes, this is what I meant with named -> generic -> named.
Additionally we would "match" back from generic to named (if we needed it).
Both emitter and matchers should be auto-generated.

Maybe I am not following but this will require tiling the entire loop by 1 to reorder it ?

For that very specific case yes, but lowering to linalg.generic and permuting there would be more general.

Instead the named ops need an interface that can be used to generate the body.

We have that already, see here
What is not there is the matcher going from generic to a known named form (which may not be strictly needed in this use case).
I expect this matcher to just use the regionBuilder and compare the regions while allowing for permutations to be a little more robust.

Which definition did we follow ? I am not familiar with explicit loop order def the only one I can see is the expression in linalg doc that doesn't imply the order, no ?

Indeed, the source of truth is just about layout not loop order.

Can you please elaborate more about don't generalize to linalg.generic.

Basically, assume you have named_op1 and you want to get to X + named_op2 (examples: transposes for layout, peeling of some loops to rewrite conv as matmul, canonicalize some 1s away, permute some iterators etc).
The claim is we will likely want to go through linalg.generic perform the transformation there and then come back to named form.
The emitter / matcher from named to generic forms should be auto-generated as sketched above.

+1 I am favoring this approach I think having named_ops to implement the same trait as generic ops (indexing_maps attributes, iterator_types attributes, body block) will make applying transformations to linalgOp fairly easy.

We already have that, which is what I was trying to nudge you to think about :)

In general, I'd just say: do what is the simplest for now, with the understanding that we may have to change when we unify things.
But don't get too attached to a particular impl, if it requires specific op knowledge and does not go through the generic path as things will evolve as we understand more.

Of course if you have cycles to explore the named_ops to implement the same trait as generic ops will make applying transformations to linalgOp fairly easy, that would be a very cool (and much needed) step forward :) !

Rescinding my request changes, I'll let you guys decide the priorities.

Thanks!

nicolasvasilache accepted this revision.Fri, Sep 18, 8:37 AM
This revision is now accepted and ready to land.Fri, Sep 18, 8:37 AM

It's a bit odd to see the loop ordering of a *lowering conversion* change because the output improves performance with a specific backend/compiler and for specific reasons. Moreover, that way, this would keep changing/evolving and be sensitive to common downstream paths and sensitive to LLVM's opt pipeline. In the absence of any target / scheduling info, the order to choose is typically expected to be just the most intuitive / canonical and shouldn't keep changing. You need an optimization mechanism if you need a better one.

bondhugula added inline comments.Fri, Sep 18, 1:03 PM
mlir/test/Dialect/Linalg/tile_conv.mlir
1

Why has the number of tile sizes increased here? If this is fixing another bug, please mention that in the summary.

It's a bit odd to see the loop ordering of a *lowering conversion* change because the output improves performance with a specific backend/compiler and for specific reasons. Moreover, that way, this would keep changing/evolving and be sensitive to common downstream paths and sensitive to LLVM's opt pipeline. In the absence of any target / scheduling info, the order to choose is typically expected to be just the most intuitive / canonical and shouldn't keep changing. You need an optimization mechanism if you need a better one.

The current lowering is expecting an NHWC o

Thanks for discussing, some comments.

Actually a better way to have handled is to just lower it to generic form and do the interchange while lowering it, i.e. use the LinalgLoopInterchangePattern on linalg.conv operation.
...
Since the indexing maps and iterator types are not mutable for named ops (as it should be) the linalg.conv needs to be lowered to its generic form while interchanging loops.

Yes, this is what I meant with named -> generic -> named.
Additionally we would "match" back from generic to named (if we needed it).
Both emitter and matchers should be auto-generated.

Maybe I am not following but this will require tiling the entire loop by 1 to reorder it ?

For that very specific case yes, but lowering to linalg.generic and permuting there would be more general.

Instead the named ops need an interface that can be used to generate the body.

We have that already, see here
What is not there is the matcher going from generic to a known named form (which may not be strictly needed in this use case).
I expect this matcher to just use the regionBuilder and compare the regions while allowing for permutations to be a little more robust.

Which definition did we follow ? I am not familiar with explicit loop order def the only one I can see is the expression in linalg doc that doesn't imply the order, no ?

Indeed, the source of truth is just about layout not loop order.

Can you please elaborate more about don't generalize to linalg.generic.

Basically, assume you have named_op1 and you want to get to X + named_op2 (examples: transposes for layout, peeling of some loops to rewrite conv as matmul, canonicalize some 1s away, permute some iterators etc).
The claim is we will likely want to go through linalg.generic perform the transformation there and then come back to named form.
The emitter / matcher from named to generic forms should be auto-generated as sketched above.

+1 I am favoring this approach I think having named_ops to implement the same trait as generic ops (indexing_maps attributes, iterator_types attributes, body block) will make applying transformations to linalgOp fairly easy.

We already have that, which is what I was trying to nudge you to think about :)

In general, I'd just say: do what is the simplest for now, with the understanding that we may have to change when we unify things.
But don't get too attached to a particular impl, if it requires specific op knowledge and does not go through the generic path as things will evolve as we understand more.

Of course if you have cycles to explore the named_ops to implement the same trait as generic ops will make applying transformations to linalgOp fairly easy, that would be a very cool (and much needed) step forward :) !

Rescinding my request changes, I'll let you guys decide the priorities.

Thanks!

The more I think about global linalg transformations ( something that can work on base Type linalgOp) the more i think this refactoring is important. i will sneak in cycles to do it =)

It's a bit odd to see the loop ordering of a *lowering conversion* change because the output improves performance with a specific backend/compiler and for specific reasons. Moreover, that way, this would keep changing/evolving and be sensitive to common downstream paths and sensitive to LLVM's opt pipeline. In the absence of any target / scheduling info, the order to choose is typically expected to be just the most intuitive / canonical and shouldn't keep changing. You need an optimization mechanism if you need a better one.

I agree with you a lowering conversion shouldn't change because of specific perf/backend improvements. but this particular change take care of defaults (linear layouts, nhwc order...etc) so it sets the natural iteration order (n, h, w, kh, kw, ci, co) for most naive c++ multi array expression:
Y[n, h, w, co] += W[kh, kw, ci, co] * X[n, h + hk, w + kw, ci];

Comments...

This revision was landed with ongoing or failed builds.Mon, Sep 21, 9:54 PM
This revision was automatically updated to reflect the committed changes.