This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Unify generic vectorization interface.
ClosedPublic

Authored by hanchung on May 12 2023, 4:12 PM.

Details

Summary

It breaks the logic of maskedVectorize (on tensor.pad ops) into
precondition checks and vectorization implementation; unifies the
interface.

The revision also rename`s vectorizeLinalgOpPrecondition` to
vectorizeOpPrecondition because we can vectorize ops other
than LinalgOps.

Diff Detail

Event Timeline

hanchung created this revision.May 12 2023, 4:12 PM
Herald added a project: Restricted Project. · View Herald Transcript
hanchung requested review of this revision.May 12 2023, 4:12 PM

Thank you so much for looking into this! I really appreciate it. LG, leaving some comments for now

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
1286

I think we need the notifyMatchFailure for the transform dialect. We should be able to use FailureOr<vector::TransferWriteOp> and return a TransferWriteOp from the vectorize side, I think. (Assuming that that was the reason to remove FailureOr<vector::TransferWriteOp>

1451

return failure()?

1503

Perhaps it's worth creating a vectorizeOpPreconditions utility or similar

hanchung added inline comments.May 12 2023, 4:46 PM
mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
1286

I can add notifyMatchFailure back. RE TransferWriteOp, the transform dialect does not use the TransferWriteOp operation, so I remove it. @nicolasvasilache do we need to return the op for transform dialect use case?

hanchung updated this revision to Diff 521843.May 12 2023, 5:05 PM

address comments -- creating a vectorizeOpPreconditions utility

hanchung edited the summary of this revision. (Show Details)May 12 2023, 5:06 PM

Thanks a ton for this clean-up! Just a few comments inline :)

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
579
mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
2932

Could this diagnostic be a bit more descriptive? For example, "Unsupported Op, cannot vectorize".

2938

[nit] I wouldn't list any specific Ops here. There's already linalg.generic, linalg.conv, tensor.pad, tensor.extract
[nit2] How about making this more descriptive: "Attempted to vectorize, but failed" (I think that it helps if the vectorizer communicates were _roughly_ it failed)

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
1466

Would adding tensorExtractVectorizationPrecondition to this list make sense? This would allow to get rid of customPreconditions, but then ExtractOp is never considered in isolation, is it?

1480
1524–1525

General question - would it make sense to completely separate convolutions from GenericOp? Basically, there seem to be 3 major cases in this file:

  • linalg.generic
  • linalg.conv
  • tensor.pad

I would be nice if the implementation reflected that - I know, easier said than done :) Also, perhaps my interpretation is wrong?

nicolasvasilache added a subscriber: springerm.EditedMay 16 2023, 1:04 AM

I am looking at this PR and the following comment in https://reviews.llvm.org/D148261:

Most of the code in maskedVectorize is duplicated/ad-hoc. I think a first step would be to unify maskedVectorize and vectorize into a single public API that can handle an Operation *, even if this public interface just dispatches to the existing maskedVectorize and vectorize initially. Then we can think about how to move tensor.pad support into the main vectorization path that uses the vectorization state, etc.

and I am wondering what the intended end state is.

It seems there is desire to fuse the various implementations and "sink the switch" deeper into the "main vectorization path that uses the vectorization state, etc".

IMO such deep switch into the implementation seems to go against the philosphy of external models that we have been using successfully e.g. in bufferization (@springerm in case he wants to chime in).

Instead, my hope was that we'd keep the switch at the top level and evolve towards more pluggable vectorization and external model + ops that define their own (masked) vectorization.

Could you comment on the approaches and tradeoffs @dcaballe ?

Hey, Nicolas! Thanks for the feedback!

One the challenges we are facing right now with a split API is that we end up generating code like the following on the client side:

for (auto op : candidates) {
  SmallVector<int64_t> vectorSizes;
  if (auto linalgOp = dyn_cast<linalg::LinalgOp>(op)) {
    if (enableVectorMasking) {
      vectorSizes.append(getVectorSizes(linalgOp, canonicalVectorShape));
    }
    (void)linalg::vectorize(rewriter, linalgOp, vectorSizes,
                            vectorizeGatherAccesses);
  } else if (auto padOp = dyn_cast<tensor::PadOp>(op)) {
    if (!enableVectorMasking) continue;
    auto ty = padOp.getResultType();
    // TODO(hanchung): Infer the vector sizes for pad op after
    // maskedVectorize method allows dynamic result shapes.
    if (!ty.hasStaticShape()) continue;
    SmallVector<int64_t> vectorSizes(ty.getShape().begin(),
                                     ty.getShape().end());
    FailureOr<vector::TransferWriteOp> maybeWriteOp =
        linalg::maskedVectorize(rewriter, padOp, vectorSizes);
    if (failed(maybeWriteOp)) {
      continue;
    }
  }
};

This has several drawbacks: a) both APIs are identical but we have to call them both individually, b) the client needs to know about all the APIs and call them individually and c) we have to add redundant support for flags, which can also lead to different "versions" of the APIs if they all are not updated at the same time.

The second big problem is masking. It is complicated to do it right. It requires a vectorization state and logic to determine if an operation actually needs masks or not so having everything go through the main vectorization path should help with all of this without replicating that logic. To give you a more specific example, the masks introduced in vectorizeAsTensorPadOp are causing some "silent" performance issues because we are masking all the pad ops, including those that doesn't require mask. That leads to some canonicalization patterns not triggering.

I think having a unified API and vectorization path should help with all of this, as long as we allow the flexibility that we need. Regarding extensibility, I thought we already had that with the "hook" mechanism. Perhaps it's just a matter of extending it support the external model approach? In that sense, bufferization seems like a much more general transformation that spans across multiple dialects. For vectorization we have multiple vectorizers (e.g., Linalg and Affine) so within the Linalg vectorizer things would be kind of contained, modulo some experimental operations that we may have off tree.

Another option would be to expose the vectorization state and all the masking logic in the public API but that doesn't sound good to me either. What are your thoughts about this?

I think having a unified API and vectorization path should help with all of this, as long as we allow the flexibility that we need. Regarding extensibility, I thought we already had that with the "hook" mechanism. Perhaps it's just a matter of extending it support the external model approach?

Precisely, I am talking about how to evolve this towards a better unified path and hopefully avoiding pitfalls I seem to see us engaging on while at the same time paying off some older technical debt.

Reading the quoted sentence above, I think I understand where the confusion comes from:

  1. the hook mechanism is meant for extending vectorization to support new ops that appear *within* a linalg.generic region. It is one possible extension mechanism, it may or may not need to move to an external model (mostly depending on how much extensibility we need in the future).
  2. the top-level switch is about supporting new ops that are unrelated to linalg.generic (e.g. tensor.pad, memref.copy). Similarly here, this may or may not need to move to an external model.

A unified API is good but I think I am also seeing premisses of trying to shoehorn 2. into 1. and I would caution against that: not everything should be represented in terms of / thought of as a single linalg.generic.

You can already see that e.g. tensor.pad has a different tiling external model and a different bufferization external model than a linalg.generic.
Some of the difference is due to the fact that tensor.pad is not DPS and we should likely evolve it to become DPS, but even with DPS this is still far from a linalg.generic.
Side historic note: early modelings tried to represent tensor.pad and tensor.concat as linalg.generic but that was a bridge too far: interfaces (e.g. for tiling, for bufferization, ...) was the way to go here.

At the top-level today, we have:

  1. linalg.generic that implement specific flavors of convolution we know how to handle
  2. linalg.generic with projected permutations for which we have a general algorithm
  3. memref.copy (which is unnecessary duplication at this level but exists for other reasons)
  4. tensor.pad
  5. other ops we'd want to support

Separately, we have the hook for the body of a`linalg.generic` with projected permutations that handles tensor.extract etc.

Strawman proposal

We have likely reached a point where we need to revisit some of this and attack the deeper first design issue: why do we have 2 linalg.generic vectorizations?

My TL;DR is that complexity increased because of trying to do too many things at the same time that are not materializend in the IR.
We have reached the point where we could split this better between:

  1. pluggable top-level op vectorization (with and without masking); this would capture tensor.pad, any linalg.generic and other ops at that level.
  2. better separate masked transfers from masked/unmasked compute payload by introducing a vector.generic that we have been missing for a while: vector.generic is a programmable vector abstraction that carries good arithmetic intensity (i.e. it is a "better vector.contract" that also supports in-register reuse with sliding windows etc)
  3. progressive lowerings and canonicalization between masked/unmasked vector.generic and other vector ops. The current hooks we have will then become progressive lowerings within the vector dialect and should work quite seamlessly with masking (i.e. pull one op out of a masked/unmasked vector.generic, rinse, repeat)

I believe this will set us off on a more composable longer-term path and nicely intercepts some of the ongoing work on higher-dimension vector programming for GPU.

Thoughts?

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
1524–1525

For this particular point, I would want to see a real cost/benefit discussion before an effort is attempted to "completely separate convolutions from GenericOp", which has potentially deep implications.
The fact that the current implementation of vectorization of Linalg ops with a*i + b*j is separate is to me more a symptom of missing abstractions in the vector dialect and the complexities that this creates.
I elaborate more below.

@dcaballe @hanchung just to be clear in case this wasn't: this is not a request to hold off on this particular PR.
You have legitimate usability needs that seem addressed here.

The points I raised are about looking around the next corner, reevaluating the current state more holistically and evolving towards a better future: I believe this first implementation of vectorization that we have been adding to and predates a lot of the MLIR infrastructure has reached its limit.

hanchung updated this revision to Diff 523100.May 17 2023, 10:19 AM
hanchung marked 5 inline comments as done.

Thanks for the review!

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
1466

The old public method, which is vectorizeLinalgOpPrecondition, has the input argument; and I'd like to keep it the same for users. It exposes the check for users. I assume that users would use the method to do some analysis before they configure pipelines. User should consider the case when they are using it, right?

1524–1525

We can separate it to linalg::ConvInterface, linalg::LinalgOp, tensor::PadOp, but there are some concerns. What if an operation implements ConvInterface but it can't be vectorized by vectorizeConvolution method? Should we fall back to try vectorizeAsLinalgGeneric? I think we need a wider discussion about what the final vectorization model looks like before we completely separate them.

Thanks @nicolasvasilache for the details! Yes, I agree that we should evaluate the future state of vectorization and make progress toward it, though I don't know how it would look like. My intention is to have a unified API. I believe that this would be a good start toward a better future. (At least, we're starting the discussion here.)

I think we are pretty aligned but there are some misunderstandings. The goal here was not to vectorize tensor.pad as a linalg.generic but just to bring tensor.pad vectorization to the main vectorization path so that we can reuse the vectorization state for masking, etc. (that is not done by this patch), but preserving its separate vectorization logic. That's all :).

I don't expect that everything can be vectorized using linalg.generic, but the fact that we have to extend "Linalg vectorizer" to support tensor ops kind of indicates that we have a representation gap in Linalg. Maybe we need a new flavor of linalg.generic where inputs can have different bounds, maybe we can extend linalg.generic to support "modifiers" of the iteration space (i.e., affine.if ops). This limitation also impacts convolution vectorization, which is kind of special cased because: 1) we are kind of decomposing conv ops in place as we vectorize, and 2) conv indexing maps are not pure affine but semi affine. We can certainly move decomposition outside of the vectorizer and improve vectorization support for semi-affine maps but that won't fix the abstraction gap in Linalg. We are mostly relying on LinalgOpInterface but that doesn't seem to be enough. We can extend Linalg representation to fill the gaps or think about a new "VectorizableOpInterface" that can be implemented by ops outside Linalg and provide all the information that the vectorizer needs. WDYT?

vector.generic sounds great a an old TODO, indeed, but I think that's something to use after vectorization. It won't solve the problem that we have before vectorization.

dcaballe accepted this revision.May 18 2023, 10:54 AM

Ok, I think we can continue the design discussion in a different venue. Nicolas is ok with moving forward so if all the feedback has been addressed, it should be ok to move forward with it. Thanks for working on this!

This revision is now accepted and ready to land.May 18 2023, 10:54 AM
This revision was automatically updated to reflect the committed changes.