The default lowering of vector transpose operations generates a large sequence of
scalar extract/insert operations, one pair for each scalar element in the input tensor.
In other words, the vector transpose is scalarized. However, there are transpose
patterns where one or more adjacent high-order dimensions are not transposed (for
example, in the transpose pattern [1, 0, 2, 3], dimensions 2 and 3 are not transposed).
This patch improves the lowering of those cases by not scalarizing them and extracting/
inserting a full n-D vector, where 'n' is the number of adjacent high-order dimensions
not being transposed. By doing so, we prevent the scalarization of the code and generate a
more performant vector version.
Paradoxically, this patch shouldn't improve the performance of transpose operations if
we are using LLVM. The LLVM pipeline is able to optimize away some of the extract/insert
operations and the SLP vectorizer is converting the scalar operations back to its vector
form. However, scalarizing a vector version of the code in MLIR and relying on the SLP
vectorizer to reconstruct the vector code again is highly undesirable for several reasons.
Side note: have you looked into generalizing the shuffle based approach?
I think most of the code below could go away if the shuffle approach works well enough.
Basically the idea is:
vector n-d -> vector 1-d by shape cast -> shuffle -> transpose vector n-d by shape cast.
The shuffle mask is a simple (linear scan -> delinearize -> transpose -> linearize) for each element.
This works very well in 2-D but I have not tried in higher-D myself as there was no concrete use case.