This PR adds a pack_greedily transform operation that infers the packing for gemm
subcomputations embedded within in any LinalgOp and packs accordingly.
A normalization step guarantees that we get the innermost op dimensions in one of 8
possible (m, n, k) orders, specified as a parameter, from which we can emit all
packed forms.
The current implementation takes an arbitrary LinalgOp and tries to pack it along
dimensions ordered as (kk, mm, nn) and with sizes (32, 8, 16).
This is an arbitrary choice that can be lifted to a transform.pack_greedily.
This achieves a new level of normalization and generalization for any n-D
LinalgOp that contains a contraction hidden within it:
we will always see a predictable packed form for any of these ops.
Wouldn't it be just better to use a bitset instead of DenseSet for seen? Seems like you're iterating nextPos from 0 to permSize here, pretty bounded!