This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Vector] Improve default lowering of vector transpose operations
ClosedPublic

Authored by dcaballe on Feb 25 2022, 4:38 PM.

Details

Summary

The default lowering of vector transpose operations generates a large sequence of
scalar extract/insert operations, one pair for each scalar element in the input tensor.
In other words, the vector transpose is scalarized. However, there are transpose
patterns where one or more adjacent high-order dimensions are not transposed (for
example, in the transpose pattern [1, 0, 2, 3], dimensions 2 and 3 are not transposed).
This patch improves the lowering of those cases by not scalarizing them and extracting/
inserting a full n-D vector, where 'n' is the number of adjacent high-order dimensions
not being transposed. By doing so, we prevent the scalarization of the code and generate a
more performant vector version.

Paradoxically, this patch shouldn't improve the performance of transpose operations if
we are using LLVM. The LLVM pipeline is able to optimize away some of the extract/insert
operations and the SLP vectorizer is converting the scalar operations back to its vector
form. However, scalarizing a vector version of the code in MLIR and relying on the SLP
vectorizer to reconstruct the vector code again is highly undesirable for several reasons.

Diff Detail

Event Timeline

dcaballe created this revision.Feb 25 2022, 4:38 PM
dcaballe requested review of this revision.Feb 25 2022, 4:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 6:35 PM
This revision is now accepted and ready to land.Mar 3 2022, 7:22 PM
nicolasvasilache accepted this revision.Mar 4 2022, 1:32 AM

Not blocking you for internal reasons but please take the opportunity to clean up this older implementation as I outline in the review.

Thanks!

mlir/lib/Dialect/Vector/Transforms/VectorTransforms.cpp
344

Side note: have you looked into generalizing the shuffle based approach?

I think most of the code below could go away if the shuffle approach works well enough.

Basically the idea is:
vector n-d -> vector 1-d by shape cast -> shuffle -> transpose vector n-d by shape cast.

The shuffle mask is a simple (linear scan -> delinearize -> transpose -> linearize) for each element.

This works very well in 2-D but I have not tried in higher-D myself as there was no concrete use case.

370

A helper findFirstIndex would be appropriate to avoid cluttering the main pattern.

379

Can you use VectorType::Builder(vt).setShape(vt.getShape().drop_front(leftmostTransposedDims)) here ?

393

Hmm the recursive aspect is quite old and due for a refresh.
We have better ways of doing this that I described above: (linear scan -> delinearize -> transpose -> linearize) for each element.

The helpers to linearize/delinearize already exist here: mlir/Dialect/Utils/IndexingUtils.h.
You may want to extend the utils a bit to allow some transpose logic too and make this one-off thing obsolete.
Note that more general transpose behavior would be useful factoring out for the "full shuffle" approach I described above.

dcaballe updated this revision to Diff 413088.Mar 4 2022, 11:49 AM
dcaballe marked 2 inline comments as done.

Addressed feedback.

Thanks for the feedback! I addressed the trivial one. The cleanup/generalization would need more thought before we move forward. More info below.
I'll move forward with this if no more comments.

mlir/lib/Dialect/Vector/Transforms/VectorTransforms.cpp
344

Yes, that's the first thing I tried but, as you mentioned, there are implementation gaps.
We can think about generalizing this in the future but I see some value on the multi-dim insert/extract approach. We keep more control on the lowering. Flattening the n-d vector to 1-d and then generating a single big shuffle would work but it's up to LLVM what to do with that. For example, it could lower it to zmm registers when you really want to stay in ymm registers (actually, this just happened to me with a similar problem). We need to run some experiments but my first impression is that flattening the n-d vector by default would bring other challenges.

393

Same as before. We would need to run more experiments. Flattening the n-d vector by default would bring other challenges and we would lose control on the lowering.

mlir/lib/Dialect/Vector/Transforms/VectorTransforms.cpp
393

just to be sure we are on the same page, there is a refactoring aspect that is NFC to use more established logic and avoif locally reinventing (linear scan -> delinearize -> transpose -> linearize) for each element: this is pure C++ code improvement, I suspect it can also be reused in other places.

vector.shuffle itself is separate and indeed subject to more experiments.

dcaballe added inline comments.Mar 7 2022, 9:39 AM
mlir/lib/Dialect/Vector/Transforms/VectorTransforms.cpp
393

Ok, let me land this and address this refactoring later this week. Thanks!