This handles cases where the linalg input needs to be transposed.
if its isElementwise && isMinorIdentity we still need to broadcast. Can we push this down to merge the two path by inserting transposes if !isMinorIdentity ?
Correct, I was considering merging those two cases. That would simplify slightly the vectorization however that could generate less efficient code as right now the broadcasting is done lazily and if some operations don't need to be expanded on all the output dimensions it may pick a vector of smaller rank.
Note that this could also be optimized away by canonicalizations later on. I don't have a strong opinion, so I can change it based on what you, Nicolas and Aart think.
Nit: Add a comment broadcasting into inner most dims.
If we have can one of operands at vectorized to a lower-rank that means all its consumers are at most at the same lower-rank. In other words we can split the linalg.generic into two independent ops operating at lower-rank & higher rank domains.....
The vector.transfer_read / write operation have support for expressing the permutation and broadcast semantics as part of the permutation_map.
The transformation you are doing here looks like the kind of more general stuff needed to rewrite that permutation_map into vector.transpose + broadcast.
The tradeoff is that vector.transfer + permutation can lower better as loop + memory indexing reordering.
Introducing transpose + broadcast at the time of Linalg -> Vector is probably lowering too quickly.
OTOH we could also think of implementing some of this as canonicalizations of vector.transfer + transpose + broadcast.
This also has to be thought of in the context of vector.shape_cast and future scalable vectors.
So bottom line: I am a conflicted.
This looks more generally applicable and writing it in 2 steps as: 1) Linalg -> vector.transfer + permutation followed by 2) better support of permutation_map when lowering vector.transfer better satisfies my intuitions
OTOH it is also possible that the permutation_map in vector.transfer is more complicated to use; but this has to be weighed off against the alternative.
Can you put an extra time to chat in my cal tomorrow so we go a bit deeper with higher BW?
I'll put a blocker in the meantime because this sits at the intersection of a bunch of different things.
args_in/out attributes are not needed anymore
As discussed I switched to using affine map in vector transfer instead. I'll prepare another patch once we agree on this review to do the lowering of those transfer ops with affine maps.